dag

  • 美研院的论文,检测,用于腰椎/髋关节关键点提取
  • preparations
    1. hrnet
    2. pspModule

Structured Landmark Detection via Topology-Adapting Deep Graph Learning

  1. 动机

    • landmark detection
      • 特征点检测
      • identify the locations of predefined fiducial points
      • capture relationships among 解剖学特征点
    • 一个难点:遮挡/复杂前景状态下,landmark的准确检测和定位——structual information
    • the proposed method
      • 用于facial and medical landmark detection
      • topology-adapting:learnable connectivity
      • learn end-to-end with two GCNs
  2. 论点

    • heatmap regression based methods
      • 将landmarks建模成heatmaps,然后回归
      • lacking a global representation
      • 核心要素有bottom-up/top-down paths & multi-scale fusions & high resolution heatmap outputs
    • coordinate regression based methods
      • potentially incorporate structural knowledge but a lot yet to be explored
      • falls behind heatmap-based ones
      • 核心要素是cascaded & global & local
      • 好处是结构化,不丢点,不多点,但是不一定准
    • graph methods
      • 基于landmark locations和landmark-to-landmark-relationships构建图结构
      • most methods relies on heatmap detection results
      • we would directly regress landmark locations from raw input image
    • we propose
      • DAG:deep adaptive graph
      • 将landmarks建模成graph图
      • employ global-to-local cascaded Graph Convolution Networks逐渐将landmark聚焦在目标位置
      • graph signals combines
        • local image features
        • graph shape features
      • cascade
        • two GCNs
        • 第一个预测一个global transform
        • 第二个预测local offsets to further adjust
      • contributions
        • effectively exploit the structural knowledge
        • allow rich exchange among landmarks
        • narrow the gap between coordinate & heatmap based methods
  3. 方法

    • the cascaded-regression framework

      • input

        • image
        • initial landmarks from the mean shape
      • outputs

        • predicted landmark coordinates in multiple steps
      • feature

        • use graph representation
        • G = (V,E,F)
          • V是节点,代表landmarks,也就是特征点,表示为(x,y)的坐标
          • E是边,代表connectivity between landmarks,表示为(id_i, id_k)的无向/有向映射,整体的E matrix是个稀疏矩阵
          • F是graph signals,capturing appearance and shape information,表示为高维向量,如256-dim vec,与节点V一一对应,用于储存节点信息,在GCN中实际进行计算交互
      • overview

        • summary
          • cascade:一个GCN-global做粗定位,迭代多个GCN-local做precise定位
          • interpolation:feature map到feature nodes的转换,通过interpolation,【是global interp吗,是基于initial mean coords吗】
          • regression:【targets的具体坐标表示???】
          • inital graph:训练集的平均值
          • graph signal:visual feature和shape feature
    • Cascaded GCNs

      • GCN-global:global transformation

      • GCN-local:coordinate offsets

      • share the same GCN architecture

      • graph convolution

        • 核心思想就是:给定一个图结构(with connectivity E),每一次堆叠graph convolution,就是在对每个图节点,基于其自身$f_k^i$和邻居节点$f_k^j$的当前graph feature,weighted aggregating,结果作为这个节点这次图卷积的输出$f_{k+1}^i$

        • learnable weight matrices $W_1$ 和 $W_2$

        • 可以看作是邻居节点间信息交互的一种方式

      • Global Transformation GCN

        • 这个model的作用是将initial landmarks变换到coarse targets

        • 参照STN,

          • recall STN

          • 使用perspective transformation透视变换,引入9个scalars,进行图形变

        • workflow

          • given a target image
          • initialize landmark locations $V^0$ using trainingset mean
          • GCN-global + GIN 预测perspective transformation
          • 进而得到变换后的节点位置
        • graph isomorphism network (GIN)

          • 图的线性层

          • 输入是GCN-global的graph features $\{f_k^i\}$

          • 输出是9-dim vector

          • 计算方式

            • READOUT:sum the features from all nodes
            • CONCAT:得到一个高维向量
            • MLP:9-dim fc
            • 最后得到9-dim的perspective transformation scalar
        • coordinate update

          • 将9-dim $f^G$ reshape成3x3 transformation matrix M
          • 然后在当前的landmark locations $V^0$上施加变换——矩阵左乘

      • Local Refinement GCN

        • GCN结构与global的一致,但是不share权重

        • 最后的GIN头变了

          • 输出改成2-dim vector
          • represents coordinate offsets
        • coordinate update

          • 加法,分别在x/y轴坐标上

        • we perform T=3 iterations

    • Graph signal with appearance and shape information

      • Visual Feature
        • denote CNN输出的feature map H with D channels
        • encoding整个feature map:bi-linear interpolation at the landmark location $v_i$,记作$p_i$,是个D-dim vector
      • Shape Feature
        • visual feature对节点间关系的建模,基于global map全局信息提取,比较隐式、间接
        • 事实上图结构能够直接对global landmarks shape进行encoding
        • 本文用displacement vectors,就是距离,每个节点的displacement vector记作$q_i=\{v_j-v_i\}_{j!=i}$,flatten成一维,对有N个节点的图,每个节点的q-vec维度为2*(N-1)
        • shape feature保存了structural information,当人脸的嘴被遮住的情况下,基于眼睛和鼻子以及结构性信息,就能够推断嘴的位置,这是Visual Feature不能直接表达的
      • graph signal
        • concat
        • result in a feature vector $f_i \in R^{D+2(N-1)}$
    • Landmark graph with learnable connectivity

      • 大多数方法的图基于先验知识构建
      • we learn task-specific graph connectivity during training phase
      • 图的connectivity serves as a gate,用邻接矩阵表示,并将其作为learnable weights
    • training

      • GCN-global

        • margin loss

        • $v_i^1$是GCN-global的预测节点坐标

        • m是margin

        • $[u]_+$是$max(0,u)$

        • push节点坐标到比较接近ground truth就停止了,防止不稳定

      • GCN-local

        • L1 loss

        • $v_i^T$是第T个iteration GCN-local的预测节点坐标

      • overall loss

        • 加权和
  4. 网络结构

    • GCN-global
      • 三层basic graph convolution layer with residual(id path)
      • concat distance vector
      • 一层basic graph convolution
      • mean axis1(node axis)
      • fc,输出9-dim scalar,(b,9)
    • GCN-local
      • 三层basic graph convolution layer with residual(id path)
      • relu
      • concat distance vector
      • 一层basic graph convolution
      • fc,输出2-dim coords for each node,(b,24,2)