BEV | Less is More

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

introduction
- BEVFormer：lookup and aggregate
  - spatial cross-attn：Transformer-based
  - temporal self-attn：with Temporal structure
- the BEV features can support
  - 3D perception tasks such as 3D object detection
  - map segmentation
- Camera-based 3D Perception
  - DETR3D projects learnable 3D queries in 2D images
  - BEV：transform image features into BEV features and predict 3D bounding boxes from the top-down view
method
- Overview
  - 模型主体就是stacking encoder layers * 6，每个encoder layer包含三个部分
    - running Q是grid-shaped BEV queries
    - 先和前一个timestep的BEV feature做temporal self-attention
    - 再和encoded multi-camera features做spatial cross-attention
    - 最后是FFN，输出当前t的BEV feature
- BEV Queries
  - given the HW BEV plane，grid-shaped $Q\in R^{CHW}$负责grid feature
  - BEV feature的中心点是自己
  - add PE
- Spatial Cross-Attention
  - deformable attention：query only interacts with its regions of interest across camera views
  - 与原始的deformAttn不同，sampling不是learnable的：首先将query拉高成bin query，从bin query中采集3D reference points，然后再映射到2D view
    - given grid scale s：real word loc $x^{‘}=(x-\frac{W}{2})s$, $y^{‘}=(y-\frac{W}{2})s$
    - define a set of anchor heights $\{z_i^{‘}\}_{N_{ref}}$，定义采样点个数的高度
    - 于是对每个grid query，都有N_ref个sampling choices
    - 最后将world coord通过相机参数转换到2D map上
- Temporal Self-Attention
  - 首先要algin中心点
  - 然后做deformable attn
  - 与原始的deformAttn不同，learnable sampling不是在query上做的，而是在concat的query和aligned former query上做的
Applications
- given BEV feature CHW
  - 3D Det based on DETR
    - image feature替换成single-scale的BEV feature
    - 预测 3D bounding boxes and velocity
    - 3D boxes reg loss只用L1 loss
  - map segmentation based on Panoptic SegFormer
    - 跟mask2former的decoder基本一样
    - 就是N个learnable的query做cross-self-attn，最后做semantic aggregate