BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation

Autonomous vehicles are geared up with unique sensors which deliver complementary data: cameras seize semantic information and facts, radars present velocity estimation, and LiDARs offer spatial facts. For accurate perception, unified illustration ideal for multi-endeavor multi-modal element fusion has to be observed.

LiDAR camera mounted to the top of a vehicle. Image credit: Oregon Department of Transportation via Flickr, CC BY 2.0

LiDAR digicam mounted to the best of a car. Image credit: Oregon Division of Transportation via Flickr, CC BY 2.

A current paper on arXiv.org proposes BEVFusion to unify multi-modal functions in a shared bird’s-eye check out (BEV) representation space for process-agnostic understanding. The method makes it possible for to sustain each geometric construction and semantic density and normally supports most 3D notion jobs.

The method sets the new condition-of-the-artwork efficiency. On 3D object detection, it ranks 1st on the nuScenes benchmark leaderboard among the all answers that do not use test-time augmentation and product ensemble. It also demonstrates considerable improvements on BEV map segmentation.

Multi-sensor fusion is critical for an precise and reputable autonomous driving method. The latest strategies are based on level-amount fusion: augmenting the LiDAR place cloud with digicam attributes. However, the camera-to-LiDAR projection throws absent the semantic density of digicam functions, hindering the usefulness of this sort of methods, specially for semantic-oriented duties (these as 3D scene segmentation). In this paper, we crack this deeply-rooted conference with BEVFusion, an effective and generic multi-activity multi-sensor fusion framework. It unifies multi-modal features in the shared bird’s-eye see (BEV) illustration area, which nicely preserves both geometric and semantic details. To accomplish this, we diagnose and lift essential performance bottlenecks in the look at transformation with optimized BEV pooling, decreasing latency by more than 40x. BEVFusion is essentially activity-agnostic and seamlessly supports unique 3D notion jobs with nearly no architectural alterations. It establishes the new state of the art on nuScenes, accomplishing 1.3% bigger mAP and NDS on 3D object detection and 13.6% better mIoU on BEV map segmentation, with 1.9x reduce computation price tag.

Study short article: Liu, Z., “BEVFusion: Multi-Activity Multi-Sensor Fusion with Unified Bird’s-Eye Look at Representation”, 2022. Link: https://arxiv.org/abdominal muscles/2205.13542
Venture site: https://bevfusion.mit.edu/