BEV

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

  1. introduction

    • BEVFormer:lookup and aggregate

      • spatial cross-attn:Transformer-based
      • temporal self-attn:with Temporal structure

    • the BEV features can support

      • 3D perception tasks such as 3D object detection
      • map segmentation
    • Camera-based 3D Perception

      • DETR3D projects learnable 3D queries in 2D images
      • BEV:transform image features into BEV features and predict 3D bounding boxes from the top-down view
  2. method

    • Overview

      • 模型主体就是stacking encoder layers * 6,每个encoder layer包含三个部分

        • running Q是grid-shaped BEV queries
        • 先和前一个timestep的BEV feature做temporal self-attention
        • 再和encoded multi-camera features做spatial cross-attention
        • 最后是FFN,输出当前t的BEV feature

    • BEV Queries

      • given the HW BEV plane,grid-shaped $Q\in R^{CHW}$负责grid feature
      • BEV feature的中心点是自己
      • add PE
    • Spatial Cross-Attention

      • deformable attention:query only interacts with its regions of interest across camera views

      • 与原始的deformAttn不同,sampling不是learnable的:首先将query拉高成bin query,从bin query中采集3D reference points,然后再映射到2D view

        • given grid scale s:real word loc $x^{‘}=(x-\frac{W}{2})s$, $y^{‘}=(y-\frac{W}{2})s$
        • define a set of anchor heights $\{z_i^{‘}\}_{N_{ref}}$,定义采样点个数的高度
        • 于是对每个grid query,都有N_ref个sampling choices
        • 最后将world coord通过相机参数转换到2D map上

    • Temporal Self-Attention

      • 首先要algin中心点
      • 然后做deformable attn
      • 与原始的deformAttn不同,learnable sampling不是在query上做的,而是在concat的query和aligned former query上做的
  3. Applications

    • given BEV feature CHW
      • 3D Det based on DETR
        • image feature替换成single-scale的BEV feature
        • 预测 3D bounding boxes and velocity
        • 3D boxes reg loss只用L1 loss
      • map segmentation based on Panoptic SegFormer
        • 跟mask2former的decoder基本一样
        • 就是N个learnable的query做cross-self-attn,最后做semantic aggregate