Less is More


  • 首页

  • 标签

  • 归档

  • 搜索

transform in CNN

发表于 2021-02-03 |

综述

  1. 几何变换
    • STN:
      • 普通的CNN能够隐式的学习一定的平移、旋转不变性,让网络能够适应这种变换:降采样结构本身能够使得网络对变换不敏感
      • 从数据角度出发,我们还会引入各种augmentation,强化网络对变化的不变能力
      • deepMind为网络设计了一个显式的变换模块来学习各种变化,将distorted的输入变换回去,让网络学习更简单的东西
      • 参数量:就是变换矩阵的参数,通常是2x3的纺射变化矩阵,也就是6个参数
    • deformable conv:
      • based on STN
      • 针对分类和检测分别提出deformable convolution和deformable RoI pooling:
      • 感觉deformable RoI pooling和guiding anchor里面的feature adaption是一个东西
      • 参数量:regular kernel params 3x3 + deformable offsets 3x3x2
      • what’s new?
        • 个人认为引入更多的参数引入的变化
        • 首先STN是从output到input的映射,使用变换矩阵M通常只能表示depictable transformation,且全图只有1个transformation
        • 其次STN的sampling kernel也是预定义的算法,对kernel内的所有pixel使用相同的变化,也就是1个weight factor
        • deformable conv是从input到output的映射,映射可以是任意的transformation,且3x3x2的参数最多可以包含3x3种transformation
        • sampling kernel对kernel内的每个点,也可以有不同的权重,也就是3x3个weight factor
    • 还有啥跟形变相关的
  2. attention机制
    • spatial attention:STN,sSE
    • channel attention:SENet
    • 同时使用空间attention和通道attention机制:CBAM
  3. papers

    • [STN] STN: Spatial Transformer Networks,STN的变换是pre-defined的,是针对全局featuremap的变换
    • [DCN 2017] Deformable Convolutional Networks ,DCN的变换是更随机的,是针对局部kernel分别进行的变化,基于卷积核添加location-specific shift
    • [DCNv2 2018] Deformable ConvNets v2: More Deformable, Better Results,进一步消除irrelevant context,基于卷积核添加weighted-location-specific shift,提升performance
    • [attention系列paper] [SENet &SKNet & CBAM & GC-Net][https://amberzzzz.github.io/2020/03/13/attention%E7%B3%BB%E5%88%97/]

STN: Spatial Transformer Networks

  1. 动机

    • 传统卷积:lack the ability of spacially invariant
    • propose a new learnable module
      • can be inserted into CNN
      • spatially manipulate the data
      • without any extra supervision
      • models learn to be invariant to transformations
  2. 论点

    • spatially invariant
      • the ability of being invariant to large transformations of the input data
    • max-pooling
      • 在一定程度上spatially invariant
      • 因为receptive fields are fixed and local and small
      • 必须叠加到比较深层的时候才能实现,intermediate feature layers对large transformations不太行
      • 是一种pre-defined mechanism,跟sample无关
    • spatial transformation module
      • conditioned on individual samples
      • dynamic mechanism
      • produce a transformation and perform it on the entire feature map
    • task场景
      • distorted digits分类:对输入做tranform能够simplify后面的分类任务
      • co-localisation:
      • spatial attention
    • related work
      • 生成器用来生成transformed images,从而判别器能够学习分类任务from transformation supervision
      • 一些methods试图从网络结构、feature extractors的角度的获得invariant representations,while STN aims to achieve this by manipulating the data
      • manipulating the data通常就是基于attention mechanism,crop涉及differentiable问题
  3. 方法

    • formulation

      • localisation network:predict transform parameters
      • grid generator:基于predicted params生成sampling grid
      • sampler:element-multiply

    • localisation network

      • input feature map $U \in R^{hwc}$
      • same transformation is applied to each channel
      • generate parameters of transformation $\theta$:1-d vector
      • fc / conv + final regression layer
    • parameterised sampling grid

      • sampling kernel

      • applied by pixel

      • general affine transformation:cropping,translation,rotation,scale,skew

      • ouput map上任意一点一定来自变换前的某一点,反之不一定,input map上某一点可能是bg,被crop掉了,所以pointwise transformation写成反过来的:

      • target points构成的点集就是sampling points on the input feature map

    • differentiable image sampling

      • 通过上一步的矩阵transformation,得到input map上需要保留的source point set

      • 对点集中每一点apply kernel

      • 通用的插值表达式:

      • 最近邻kernel是个pulse函数

      • bilinear kernel是个distance>1的全mute掉,分段可导

    • STN:Spatial Transformer Networks

      • 把spatial transformer嵌进CNN去:learn how to actively transform the features to help minimize the overall cost
      • computationally fast
      • 几种用法
        • feed the output of the localization network $\theta$ to the rest of the network:因为transform参数explicitly encodes目标的位置姿态信息
        • place multiple spatial transformers at increasing depth:串行能够让深层的transformer学习更抽象的变换
        • place multiple spatial transformers in parallel:并行的变换使得每个变换针对不同的object
  4. 实验

    • R、RTS、P、E:distortion ahead
    • aff、proj、TPS:transformer predefined
      • aff:给定角度??
      • TPS:薄板样条插值

Deformable Convolutional Networks

  1. 动机

    • CNN:fixed geometric structures
    • enhance the transformation modeling capability
      • deformable convolution
      • deformable RoI pooling
    • without additional supervision
    • share similiar spirit with STN
  2. 论点

    • to accommodate geometric variations

      • data augmentation is limited to model large, unknown transformations
      • fixed receptive fields is undesirable for high level CNN layers that encode the semantics
      • 使用大量增广的数据,枚举不全,而且收敛慢,所需网络参数量大
      • 对于提取语义特征的高层网络来讲,固定的感受野对不同目标不友好
    • introduce two new modules

      • deformable convolution
        • learning offsets for each kernel via additional convolutional layers
      • deformable RoI pooling

        • learning offset for each bin partition of the previous RoI pooling

  3. 方法

    • overview

      • operate on the 2D spatial domain
      • remains the same across the channel dimension
    • deformable convolution

      • 正常的卷积:
        • $y(p_0) = \sum w(p_n)*x(p_0 + p_n)$
        • $p_n \in R\{(-1,-1),(-1,0),…, (0,0), (1,1)\}$
      • deformable conv:with offsets $\Delta p_n$
        • $y(p_0) = \sum w(p_n)*x(p_0 + p_n + \Delta p_n)$
        • offset value is typically fractional
        • bilinear interpolation:
          • $x(p) = \sum_q G(q,p)x(q)$
          • 其中$G(q,p)$是条件:$G(q,p)=max(0, 1-|q_x-p_x|)*max(0, 1-|q_y-p_y|)$
          • 只计算和offset点距离小于1个单位的邻近点
      • 实现
        • offsets conv和特征提取conv是一样的kernel:same spatial resolution and dilation(N个position)
        • the channel dimension 2N:因为是x和y两个方向的offset
    • deformable RoI pooling

      • RoI pooling converts an input feature map of arbitrary size into fixed size features

      • 常规的RoI pooling

        • divides ROI into k*k bins and for each bin:$y(i,j) = \sum_{p \in bin(i,j)} x(p_0+p)/n_{ij}$
        • 对feature map上划分到每个bin里面所有的点
      • deformable RoI pooling:with offsets $\Delta p_{ij}$

        • $y(i,j) = \sum_{p \in bin(i,j)} x(p_0+p+\Delta p_{ij})/n_{ij}$
        • scaled normalized offsets:$\Delta p_{ij} = \gamma \Delta p_{ij} (w,h) $
        • normalized offset value is fractional
        • bilinear interpolation on the pooled map as above
      • 实现

        • fc layer:k*k*2个element(sigmoid?)
      • position sensitive RoI Pooling

        • fully convolutional
        • input feature map先通过卷积扩展成k*k*(C+1)通道
        • 对每个C+1(包含kk个feature map),conv出全图的offset(2\k*k个)

    • deformable convNets

      • initialized with zero weights
      • learning rates are set to $\beta$ times of the learning rate for the existing layers
        • $\beta=1.0$ for conv
        • $\beta=0.01$ for fc
      • feature extraction
        • back:ResNet-101 & Aligned-Inception-ResNet
        • withoutTop:A randomly initialized 1x1 conv is added at last to reduce the channel dimension to 1024
        • last block
          • stride is changed from 2 to 1
          • the dilation of all the convolution filters with kernel size>1 is changed from 1 to 2
        • Optionally last block
          • use deformable conv in res5a,b,c
      • segmentation and detection
        • deeplab predicts 1x1 score maps
        • Category-Aware RPN run region proposal with specific class
        • modified faster R-CNN:add ROI pooling at last conv
        • optional faster R-CNN:use deformable ROI pooling
        • R-FCN:state-of-the-art detector
        • optional R-FCN:use deformable ROI pooling

  4. 实验

    • Accuracy steadily improves when more deformable convolution layers are used:使用越多层deform conv越好,经验取了3

    • the learned offsets are highly adaptive to the image content:大目标的间距大,因为reception field大,consistent in different layers

    • atrous convolution also improves:default networks have too small receptive fields,但是dilation需要手调到最优

    • using deformable RoI pooling alone already produces noticeable performance gains, using both obtains significant accuracy improvements

Deformable ConvNets v2: More Deformable, Better Results

  1. 动机

    • DCN能够adapt一定的geometric variations,但是仍存在extend beyond image content的问题
    • to focus on pertinent image regions
      • increased modeling power
        • more deformable layers
        • updated DCNv2 modules
      • stronger training
        • propose feature mimicking scheme
    • verified on
      • incorporated into Faster-RCNN & Mask RCNN
      • COCO for det & set
    • still lightweight and easy to incorporate
  2. 论点

    • DCNv1
      • deformable conv:在standard conv的基础上generate location-specific offsets which are learned from the preceding feature maps
      • deformable pooling:offsets are learned for the bin positions in RoIpooling
      • 通过可视化散点图发现有部分散点落在目标外围
    • propose DCNv2
      • equip more convolutional layers with offset
      • modified module
        • each sample not only undergoes a learned offset
        • but also a learned feature amplitude
      • effective trainin
        • use RCNN as the teacher network since RCNN learns features unaffected by irrelevant info outside the ROI
        • feature mimicking loss
  3. 方法

    • stacking more deformable conv layers

      • replace more regular conv layers by their deformable counterparts:
        • resnet50的stage3、4、5的3x3conv都替换成deformable conv:13个conv layer
        • DCNv1是把stage5的3个resblock的3x3 conv替换成deformable conv:3个deconv layer
      • 因为DCNv1里面在PASCAL上面实验发现再多的deconv精度就饱和了,但是DCNv2是在harder dataset COCO上面的best-acc-efficiency-tradeoff
    • modulated deformable conv

      • modulate the input feature amplitudes from different spacial locations/bins
        • set the learnable offset & scalar for the k-th location:$\Delta p_k$和$\Delta m_k$
        • set the conv kernel dilation:$p_k$,resnet里面都是1
        • the value for location p is:$y(p) = \sum_{k=1}^K w_k x(p+p_k+\Delta p_k)\Delta m_k$,bilinear interpolation
      • 目的是抑制无关信号
      • learnable offset & scalar obtained via a separate conv layer over the same input feature map x
      • 输出有3K个channel:2K for xy-offset,K for scalar
        • offset的conv后面没激活函数,因为范围无限
        • scalar的conv后面有个sigmoid,将range控制在[0,1]
        • 两个conv全0初始化
        • 两个conv layer的learning rate比existing layers小一个数量级
    • modulated deformable RoIpooling

      • given an input ROI
      • split into K(7x7) spatial bins
      • average pooling over the sampling points for each bin计算bin的value
      • the bin value is:$y(k) = \sum_{j=1}^{n_k} x(p_{kj}+\Delta p_k)\Delta m_k /n_k$,bilinear interpolation
      • a sibling branch
        • 2个1024d-fc:gaussian initialization with 0.01 std dev
        • 1个3Kd-fc:全0初始化
        • last K channels + sigmoid
        • learning rate跟existing layers保持一致
    • RCNN feature mimicking

      • 发现无论是conv还是deconv,error-bound都很大
      • 尽管从设计思路上,DCNv2是带有mute irrelevant的能力的,但是事实上并没做到
      • 说明such representation cannot be learned well through standard FasterRCNN training procedure:
        • 说白了就是supervision力度不够
        • 需要additional guidance
    • feature mimic loss

      • enforced only on positive ROIs:因为背景类往往需要更长距离/更大范围的context信息

      • architecture

        • add an additional RCNN branch
        • RCNN input cropped images,generate 14x14 featuremaps,经过两个fc变成1024-d
        • 和FasterRCNN里对应的counterpart,计算cosine similarity
        • 这个太扯了不展开了

spineNet

发表于 2021-01-28 |

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization

  1. 动机

    • object detection task
      • requiring simultaneous recognition and localization
      • solely encoder performs not well
      • while encoder-decoder architectures are ineffective
    • propose SpineNet

      • scale-permuted intermediate features
      • cross-scale connections
      • searched by NAS on detection COCO
      • can transfer to classification tasks
      • 在轻量和重量back的一阶段网络中都涨点领先

  2. 论点

    • scale-decreasing backbone
      • throws away the spatial information by down-sampling
      • challenging to recover
      • 接一个轻量的FPN:
    • scale-permuted model
      • scales of features can increase/decrease anytime:retain the spacial information
      • connections go across scales:multi-scale fusion
      • searched by NAS
      • 是一个完整的FPN,不是encoder-decoder那种可分的形式
      • directly connect to classification and bounding box regression subnets
      • base on ResNet50
        • use bottleneck feature blocks
        • two inputs for each feature blocks
        • roughly the same computation
  3. 方法

    • formulation

      • overall architecture
        • stem:scale-decreased architecture
        • scale-permuted network
        • blocks in the stem network can be candidate inputs for the following scale-permuted network
      • scale-permuted network
        • building blocks:$B_k$
        • feature level:$L_3 - L_7$
        • output features:1x1 conv,$P_3 - P_7$
    • search space

      • scale-permuted network:

        • block只能从前往后connect
        • based on resNet blocks
        • channel 256 for $L_5, L_6, L_7$
      • cross-scale connections:

        • two input connections for each block

        • from lower ordering block / stem

        • resampling

          • narrow factor $\alpha$:1x1 conv
          • 上采样:interpolation
          • 下采样:3x3 s2 conv
          • element-wise add

      • block adjustment

        • intermediate blocks can adjust its scale level & type
        • level from {-1, 0, 1, 2}
        • select from bottleneck / residual block
    • family of models

      • R[N] - SP[M]:N feature layers in stem & M feature layers in scale-permuted layers
      • gradually shift from stem to SP
      • with size decreasing

    • spineNet family

      • basic:spineNet-49
      • spineNet-49S:channel数scaled down by 0.65
      • spineNet-96:double the number of blocks
      • spineNet-143:repeat 3 times,fusion narrow factor $\alpha=1$
      • spineNet-190:repeat 4 times,fusion narrow factor $\alpha=1$,channel数scaled up by 1.3
  4. 实验

    • 在mid/heavy量级上,比resnet-family-FPN涨出两个点

    • 在light量级上,比mobileNet-family-FPN涨出一个点

guided anchoring

发表于 2021-01-27 |

原作者知乎reference:https://zhuanlan.zhihu.com/p/55854246

  • 不完全是anchor-free,因为还是有decision grid to choose from的,应该说是adaptive anchor instead of hand-picked
  • 为了特征和adaptive anchor对齐,引入deformable conv

Region Proposal by Guided Anchoring

  1. 动机

    • most methods
      • predefined anchors
      • do a uniformed dense prediction
    • our method
      • use sematic features to guide the anchoring
      • anchor size也是网络预测参数,compute from feature map
      • arbitrary aspect ratios
    • feature inconsistency
      • 不同的anchor loc都是对应feature map上某一个点
      • 变化的anchor size和固定的位置向量之间存在inconsistency
      • 引入feature adaption module
    • use high-quality proposals
      • GA-RPN提升了proposal的质量
      • 因此我们对proposal进入stage2的条件更严格
    • adopt in Fast R-CNN, Faster R-CNN and RetinaNet均涨点
      • RPN提升显著:9.1
      • MAP也有涨点:1.2-2.7
    • 还可以boosting trained models
      • boosting a two-stage detector by a fine-tuning schedule
  2. 论点

    • alignment & consistency

      • 我们用feature map的pixels作为anchor representations,那么anchor centers必须跟feature pixels保持align

      • 不同pixel的reception field必须跟对应的anchor size保持匹配

      • previous sliding window scheme对每个pixel都做一样的操作,用同样一组anchor,因此是align和consist的
      • previous progressly refining scheme对anchor的位置大小做了refinement,ignore the alignment & consistency issue,是不对的!!
    • disadvantage of predefined anchors

      • hard hyperparams
      • huge pos/neg imbalance & computation
    • we propose GA-RPN

      • learnable anchor shapes to mitigate the hand-picked issue
      • feature adaptation to solve the consistency issue
      • key concerns in this paper
        • learnable anchors
        • joint anchor distribution
        • alignment & consistency
        • high-quality proposals
  3. 方法

    • formulation

      • $p(x,y,w,h|I) = p(x,y|I)p(w,h|x,y,I)$
      • 将问题解耦成位置和尺寸的预测,首先anchor的loc服从full image的均匀分布,anchor的size建立在loc存在的基础上
      • two branches for loc & shape prediction
        • loc:binary classification,hxwx1
        • shape:location-dependent shapes,hxwx2
        • anchors:loc probabilities above a certain threshold & correponding ‘most probable’ anchor shape
      • multi-scale
        • the anchor generation parameters are shared
      • feature adaptation module

        • adapts the feature according to the anchor shape

    • anchor location prediction

      • indicates the probability of an object’s center
      • 一层卷积:1x1 conv,channel1,sigmoid
      • transform back:each grid(i,j) corresponds to coords ((i+0.5)*s, (j+0.5)*s) on the origin map
      • filter out 90% of the regions
      • thus replace the ensuing conv layers by masked convs
      • groud truth
        • binary label map
        • each level:center region & ignore region & outside region,基于object center的方框
          • $\sigma_1=0.2,\sigma_2=0.5$:region box的长宽系数
          • ???用centerNet的heatmap会不会更好???
      • focal loss $L_{loc}$
    • anchor shape prediction
      • predicts the best shape for each location
      • best shape:a shape that lead to best iou with the nearest gt box
      • 一层卷积:1x1 conv,channel2,[-1,1]
      • transform layer:transform direct [-1,1] outputs to real box shape
        • $w = \sigma s e^{dw}$
        • $h = \sigma s e^{dh}$
        • s:stride
        • $\sigma$:经验参数,8 in experiments
      • set 9 pairs of (w,h) as RetinaNet,calculate the IoU of these sampled anchors with gt,take the max as target value
      • bounded iou loss:$L_{shape} = L_1(1-min(\frac{w}{w_g}, \frac{w_g}{w})) + L_1(1-min(\frac{h}{h_g}, \frac{h_g}{h}))$
    • feature adaptation
      • intuition:the feature corresponding to different size of anchor shapes应该encode different content region
      • inputs:feature map & anchor shape
      • location-dependent transformation:3x3 deformable conv
      • deformable conv的offset是anchor shape得到的
      • outputs:adapted features
    • with adapted features
      • then perform further classification and bounding-box regression
    • training
      • jointly optimize:$L = \lambda_1 L_{loc} + \lambda_2 L_{shape} + L_{cls} + L_{reg}$
      • $\lambda_1=0.2,\lambda_2=0.5$
      • each level of feature map should only target objects of a specific scale range:但是ASFF论文主张说这种arrange by scale的模式会引入前背景inconsistency??
    • High-quality Proposals
      • set a higher positive/negative threshold
      • use fewer samples

ASFF

发表于 2021-01-25 |

Learning Spatial Fusion for Single-Shot Object Detection

  1. 动机

    • inconsistency when fuse across different feature scales
    • propose ASFF
      • suppress the inconsistency
      • spatially filter conflictive information:想法应该跟SSE-block类似
    • build on yolov3

      • introduce a bag of tricks
      • anchor-free pipeline

  2. 论点
    • ssd is one of the first to generate pyramidal feature representations
      • deeper layers reuse the formers
      • bottom-up path
      • small instances suffers low acc because containing insufficient semanic info
    • FPN use top-down path
      • shares rich semantics at all levels
      • improvement:more strengthening feature fusion
    • 在使用FPN时,通常不同scale的目标绑定到不同的level上面
      • inconsistency:其他level的feature map对应位置的信息则为背景
      • some methods set ignore region in adjacent features
  3. 方法

    • introduce advanced techniques
      • mixup
      • cosine learning rate schedule
      • sync-bn
      • an anchor-free branch to run jointly with anchor-based ones
      • L1 loss + IoU loss
    • fusion
      • 全联接而非adjacent merge:三个level的fuse map都来自三个level的feature map
      • 上采样:
        • 1x1 conv:对齐channel
        • upsamp with interpolation
      • 下采样:
        • s2:3x3 s2 conv
        • s4:maxpooling + 3x3 s2 conv
      • adaptive fusion
        • pixel level的reweight
        • shared across channels:hxwx1
        • 对来自三个level的feature map,resolution对齐以后,分别1x1conv,channel 1
        • norm the weights:softmax
        • 为啥能suppress inconsistency:三个level的像素点,只激活一个另外两个是0的情况是绝对不harm的,相当于上面ignore那个方法拓展成adaptive
    • training
      • apply mixup on the classification pretraining of D53
      • turn off mixup augmentation for the last 30 epochs.
    • inference
      • the detection header at each level first predicts the shape of anchors???这个不太懂
    • ASFF & ASFF*
      • enhanced version of ASFF by integrating other lightweight modules
      • dropblock & RFB
  4. 实现

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    class ASFF(nn.Module):
    def __init__(self, level, activate, rfb=False, vis=False):
    super(ASFF, self).__init__()
    self.level = level
    self.dim = [512, 256, 128]
    self.inter_dim = self.dim[self.level]
    if level == 0:
    self.stride_level_1 = conv_bn(256, self.inter_dim, kernel=3, stride=2)
    self.stride_level_2 = conv_bn(128, self.inter_dim, kernel=3, stride=2)
    self.expand = conv_bn(self.inter_dim, 512, kernel=3, stride=1)
    elif level == 1:
    self.compress_level_0 = conv_bn(512, self.inter_dim, kernel=1)
    self.stride_level_2 = conv_bn(128, self.inter_dim, kernel=3, stride=2)
    self.expand = conv_bn(self.inter_dim, 256, kernel=3, stride=1)
    elif level == 2:
    self.compress_level_0 = conv_bn(512, self.inter_dim, kernel=1, stride=1)
    self.compress_level_1= conv_bn(256,self.inter_dim,kernel=1,stride=1)
    self.expand = conv_bn(self.inter_dim, 128, kernel=3, stride=1)
    compress_c = 8 if rfb else 16
    self.weight_level_0 = conv_bn(self.inter_dim, compress_c, 1, 1, 0)
    self.weight_level_1 = conv_bn(self.inter_dim, compress_c, 1, 1, 0)
    self.weight_level_2 = conv_bn(self.inter_dim, compress_c, 1, 1, 0)
    self.weight_levels = conv_bias(compress_c * 3, 3, kernel=1, stride=1, padding=0)
    self.vis = vis

    def forward(self, x_level_0, x_level_1, x_level_2):
    # 跟论文描述一样:上采样先1x1conv对齐,再upinterp,下采样3x3 s2 conv
    if self.level == 0:
    level_0_resized = x_level_0
    level_1_resized = self.stride_level_1(x_level_1)
    level_2_downsampled_inter = F.max_pool2d(x_level_2, 3, stride=2, padding=1)
    level_2_resized = self.stride_level_2(level_2_downsampled_inter)
    elif self.level == 1:
    level_0_compressed = self.compress_level_0(x_level_0)
    sh = torch.tensor(level_0_compressed.shape[-2:])*2
    level_0_resized = F.interpolate(level_0_compressed, tuple(sh), 'nearest')
    level_1_resized = x_level_1
    level_2_resized = self.stride_level_2(x_level_2)
    elif self.level == 2:
    level_0_compressed = self.compress_level_0(x_level_0)
    sh = torch.tensor(level_0_compressed.shape[-2:])*4
    level_0_resized = F.interpolate(level_0_compressed, tuple(sh), 'nearest')
    level_1_compressed = self.compress_level_1(x_level_1)
    sh = torch.tensor(level_1_compressed.shape[-2:])*2
    level_1_resized = F.interpolate(level_1_compressed, tuple(sh),'nearest')
    level_2_resized = x_level_2
    # 这里得到的resized特征图不直接转换成一通道的weighting map,
    # 而是先1x1conv降维到8/16,然后concat,然后3x3生成3通道的weighting map
    # weighting map相当于一个prediction head,所以是conv_bias_softmax,无bn
    level_0_weight_v = self.weight_level_0(level_0_resized)
    level_1_weight_v = self.weight_level_1(level_1_resized)
    level_2_weight_v = self.weight_level_2(level_2_resized)
    levels_weight_v = torch.cat((level_0_weight_v, level_1_weight_v, level_2_weight_v), 1)
    levels_weight = self.weight_levels(levels_weight_v)
    levels_weight = F.softmax(levels_weight, dim=1)

    # reweighting
    fused_out_reduced = level_0_resized * levels_weight[:, 0:1, :, :] + \
    level_1_resized * levels_weight[:, 1:2, :, :] + \
    level_2_resized * levels_weight[:, 2:, :, :]

    # 3x3的conv,是特征图平滑
    out = self.expand(fused_out_reduced)

    if self.vis:
    return out, levels_weight, fused_out_reduced.sum(dim=1)
    else:
    return out

VoVNet

发表于 2021-01-22 |

An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection

  1. 动机

    • denseNet
      • dense path:diverse receptive fields
      • heavy memory cost & low efficiency
    • we propose a backbone
      • preserve the benefit of concatenation
      • improve denseNet efficiency
      • VoVNet comprised of One-Shot Aggregation (OSA)
    • apply to one/two stage object detection tasks
      • outperforms denseNet & resNet based ones
      • better small object detection performance
  2. 论点

    • main difference between resNet & denseNet
      • aggregation:summation & concatenation
        • summation would washed out the early features
        • concatenation last as it preserves
    • GPU parallel computation
      • computing utilization is maximized when operand tensor is larger
      • many 1x1 convs for reducing dimension
      • dense connections in intermediate layers are inducing the inefficiencies
    • VoVNet

      • hypothesize that the dense connections are redundant
      • OSA:aggregates intermediate features at once
      • test as object detection backbone:outperforms DenseNet & ResNet with better energy efficiency and speed
    • factors for efficiency

      • FLOPS and model sizes are indirect metrics
      • energy per image and frame per second are more practical
      • MAC:
        • memory accesses cost,$hw(c_i+c_o) + k^2 c_ic_o$
        • memory usage不止跟参数量有关,还跟特征图尺寸相关
        • MAC can be minimized when input channel size equals the output
      • FLOPs/s
        • splitting a large convolution operation into several fragmented smaller operations makes GPU computation inefficient as fewer computations are processed in parallel
        • 所以depthwise/bottleneck理论上降低了计算量FLOP,但是从GPU并行的角度efficiency降低,并没有显著提速:cause more sequential computations
        • 以时间为单位的FLOPs才是fair的
  3. 方法

    • hypothesize

      • dense connection makes similar between neighbor layers
      • redundant
    • OSA

      • dense connection:former features concats in every following features
      • one-shot connection:former features concats once in the last feature

      • 最开始跟dense block保持参数一致:一个block里面12个layers,channel20,发现深层特征contributes less,所以换成浅层,5个layers,channel43,发现有涨点:implies that building deep intermediate feature via dense connection is less effective than expected

      • in/out channel数相同

        • much less MAC:
          • denseNet40:3.7M
          • OSA:5layers,channel43,2.5M
          • 对于higher resolution的detection任务impies more fast and energy efficient
        • GPU efficiency
          • 不需要那好几十个1x1
    • architecture

      • stem:3个3x3conv
      • downsamp:s2的maxpooling
      • stages:increasing channels enables more rich semantic high-level information,better feature representation
      • deeper:makes more modules in stage3/4

  4. 实验

    • one-stage:refineDet
    • two-stage:Mask-RCNN

GCN

发表于 2021-01-18 |

reference:https://mp.weixin.qq.com/s/SWQHgogAP164Kr082YkF4A

  1. 图

    • $G = (V,E)$:节点 & 边,连通图 & 孤立点
    • 邻接矩阵A:NxN,有向 & 无向
    • 度矩阵D:NxN对角矩阵,每个节点连接的节点
    • 特征矩阵X:NxF,每个1-dim F是每个节点的特征向量
  2. 特征学习

    • 可以类比CNN:对其邻域(kernel)内特征进行线性变换(w加权),然后求和,然后激活函数
    • $H^{k+1} = f(H^{k},A) = \sigma(AH^{k}W^{k})$
      • H:running updating 特征矩阵,NxFk
      • A:0-1邻接矩阵,NxN
      • W:权重,$F_k$x$F_{k+1}$
    • 权重所有节点共享
    • 节点的邻接节点可以看做感受野
    • 网络加深,感受野增大:节点的特征融合了更多节点的信息
  3. 图卷积

    • A中没有考虑自己的特征:添加自连接

      • A = A + I
    • 加法规则对度大的节点,特征会越来越大:归一化

      • 使得邻接矩阵每行和为1:左乘度矩阵的逆

      • 数学实质:求平均

      • one step further:不单对行做平均,对度较大的邻接节点也做punish

    • GCN网络

  4. 实现

    • weights:in x out,kaiming_uniform_initialize

    • bias:out,zero_initialize

    • activation:relu

    • A x H x W:左乘是系数矩阵乘法

    • 邻接矩阵的结构从输入开始就不变了,和每层的特征矩阵一起作为输入,传入GCN

    • 分类头:最后一层预测Nxn_class的特征向量,提取感兴趣节点F(n_class),然后softmax,对其分类

    • 归一化

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      # 对称归一化
      def normalize_adj(adj):
      """compute L=D^-0.5 * (A+I) * D^-0.5"""
      adj += sp.eye(adj.shape[0])
      degree = np.array(adj.sum(1))
      d_hat = sp.diags(np.power(degree, -0.5).flatten())
      norm_adj = d_hat.dot(adj).dot(d_hat)
      return norm_adj


      # 均值归一化
      def normalize_adj(adj):
      """compute L=D^-1 * (A+I)"""
      adj += sp.eye(adj.shape[0])
      degree = np.array(adj.sum(1))
      d_hat = sp.diags(np.power(degree, -1).flatten())
      norm_adj = d_hat.dot(adj)
      return norm_adj
  5. 应用场景

    [半监督分类GCN]:SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS,提出GCN

    [skin GCN]:Learning Differential Diagnosis of Skin Conditions with Co-occurrence Supervision using Graph Convolutional Networks,体素,一个单独的基于图的相关性分支,给feature加权

    [Graph Attention]:Graph Attention Networks,图注意力网络

Learning Differential Diagnosis of Skin Conditions with Co-occurrence Supervision using Graph Convolutional Networks

  1. 动机

    • 皮肤病:发病率高,experts少
    • differential diagnosis:鉴别诊断,就是从众多疾病类别中跳出正确类别
    • still challenging:timely and accurate
    • propose a DLS(deep learning system)
      • clinical images
      • multi-label classification
      • 80 conditions,覆盖病种
      • labels incompleteness:用GCN建模成Co-occurrence supervision,benefit top5
  2. 论点

    • google的DLS
      • 26中疾病
      • 建模成multi-class classification problem:非0即1的多标签表达破坏了类别间的correlation
    • our DLS:GCN-CNN
      • multi-label classification task over 80 conditions
      • incomplete image labels:GCN that characterizes label co-occurrence supervision
      • combine the classification network with the GCN
      • 数据量:136,462 clinical images
      • 精度:test on 12,378 user taken images,top-5 acc 93.6%
    • GCN
      • original application:
        • nodes classification,only a small subset of nodes had their labels available:半监督文本分类问题,只有一部分节点用于训练
        • the graph structure is contructed from data
      • ML-GCN:
        • multi-label classification task
        • correlation map(图结构)则是通过数据直接建立
        • 图节点是每个类别的semantic embeddings
  3. 方法

    • overview

      • 一个trainable的CNN,将图片转化成feature vector
      • 一个GCN branch:两层图卷积,都是order-1,图结构是基于训练集计算,无向图,encoding的是图像labels之间的dependency,用它 implicitly supervises the classification task
      • 然后两个feature vector相乘,给出最终结果

    • GCN branch

      • two graph convolutional (GC) layers
      • 一种estimated图结构:build co-occurence graph using only training data
        • node embed semantic meaning to labels
        • 边的值定义有点像类别间的相关性强度:$e_{ij} = 1(\frac{C(i,j)}{C(i)+C(j)} \geq t)$,分子是有两种标签的样本量,分母是各自样本量
      • 一种designed图结构:intial value是基于有经验的专家构建
      • node representation
        • graph branch的输入 label embedding
        • 用了BioSentVec,一个基于生物医学语料库训练的word bag
      • GCN
        • randomly initialize
        • GCN-0:dim 700
        • GCN-1:dim 1024
        • GCN-2:dim 2048
        • 最终得到(cls,2048)的node features
    • cls branch

      • input:downsized to 448x448
      • resnet101:执行到FC-2048,作为image features
      • 先训练300 epochs,lr 0.1,step decay
    • GCN-CNN

      • 先预训练resnet backbone,
      • 然后整体一起训练300 epochs,lr 0.0003,
      • image feature和node features通过dot product融合,得到(cls, )的cls vec,
  4. 实验

    • 图结构不能random initialization,会使结果变差
    • 基于数据集估计的graph initialization有显著提升

    • 基于专家设计的graph initialization有进一步提升,但是不明显,考虑到标注工作繁重不太推荐

SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS

  1. reference

    • http://tkipf.github.io/graph-convolutional-networks/,官方博客
    • https://zhuanlan.zhihu.com/p/35630785,知乎笔记
  2. 论点

    • 场景
      • semi-supervised learning
      • on graph-structured data
      • 比如:在一个citation network,classifying nodes (such as documents),labels are only available for a small subset of nodes,任务的目标是对大部分未标记的节点预测类别
    • previous approach
      • Standard Approach
        • loss由两部分组成:单个节点的fitting error,和相邻节点的distance error
        • 基于一个假设:相邻节点间的label相似
        • 限制了模型的表达能力
      • Embedding-based Approach
        • 分两步进行:先学习节点的embedding,再基于embedding训练分类器
        • 不end-to-end,两个task分别执行,不能保证学到的embedding是适合第二个任务的
    • 思路
      • train on a supervised target for nodes with labels
      • 然后通过图的连通性,trainable adjacency matrix,传递梯度给unlabeled nodes
      • 使得全图得到监督信息
    • contributions
      • introduce a layer-wise propagation rule,使得神经网络能够operate on graph,实现end-to-end的图结构分类器
      • use this graph-based neural network model,训练一个semi-supervised classification of nodes的任务
  3. 方法

    • fast approximate convolutions on graphs

      • given:

        • layer input:$H^l$
        • layer output:$H^{l+1}$
        • kernel pattern:$A$,在卷积里面是fixed kxk 方格,在图里面就是自由度更高的邻接矩阵
        • kernel weights:$W$
      • general layer form:$H^{l+1}=f(H^l,A)$

      • inspiration:卷积其实是一种特殊的图,每个grid看作一个节点,每个节点都加上其邻居节点的信息,也就是:

        • W是在对grids加权
        • A是在对每个grids加上他的邻接节点

      • details in practice

        • 自环:保留自身节点信息,$\hat A=A+I$
        • 正则化:stabilize the scale,$H^{l+1}=\sigma(\hat D^{-\frac{1}{2}}\hat A\hat D^{-\frac{1}{2}}H^lW)$
        • 一个实验:只利用图的邻接矩阵,就能够学得效果不错

    • semi-supervised node classification

      • 思路就是在所有有标签节点上计算交叉熵loss
      • 模型结构
        • input:X,(b,N,D)
        • 两层图卷积
          • GCN1-relu:hidden F,(b,N,F)
          • GCN2-softmax:output Z,(b,N,cls)
        • 计算交叉熵
  4. code

    • torch/keras/tf官方都有:
      • https://github.com/tkipf/gcn,论文里给的tf这个链接
      • torch和keras的readme里面有说明,initialization scheme, dropout scheme, and dataset splits和tf版本不同,不是用来复现论文
      • python setup.py bdist_wheel
    • 数据集:Cora dataset,是一个图数据集,用于分类任务,数据集介绍https://blog.csdn.net/yeziand01/article/details/93374216
      • cora.content是所有论文的独自的信息,总共2708个样本,每一行都是论文编号+词向量1433-dim+论文类别
      • cora.cites是论文之间的引用记录,A to B的reflect pair,5429行,用于创建邻接矩阵

transformers

发表于 2021-01-18 |

startup

reference1:https://mp.weixin.qq.com/s/Rm899vLhmZ5eCjuy6mW_HA

reference2:https://zhuanlan.zhihu.com/p/308301901

  1. NLP & RNN

    • 文本涉及上下文关系
    • RNN时序串行,建立前后关系
    • 缺点:对超长依赖关系失效,不好并行化
  2. NLP & CNN

    • 文本是1维时间序列
    • 1D CNN,并行计算
    • 缺点:CNN擅长局部信息,卷积核尺寸和长距离依赖的balance
  3. NLP & transformer

    • 对流入的每个单词,建立其对词库的权重映射,权重代表attention
    • 自注意力机制
    • 建立长距离依赖

  4. put in CV

    • 插入类似的自注意力层
    • 完全抛弃卷积层,使用Transformers
  5. RNN & LSTM & GRU cell

    • 标准要素:输入x、输出y、隐层状态h

    • RNN

      • RNN cell每次接收一个当前输入$x_t$,和前一步的隐层输出$h_{t-1}$,然后产生一个新的隐层状态$h_t$,也是当前的输出$y_t$
      • formulation:$y_t, h_t = f(x_t, h_{t-1})$
      • same parameters for each time step:同一个cell每个time step的权重共享
      • 一个问题:梯度消失/爆炸

        • 考虑hidden states’ chain的简化形式:$h_t = \theta^t h_0$,一个sequence forward下去就是same weights multiplied over and over again
        • 另外tanh也是会让神经元梯度消失/爆炸

    • LSTM

      • key ingredient

        • cell:增加了一条cell state workflow,优化梯度流
        • gate:通过门结构删选携带信息,优化长距离关联

      • 可以看到LSTM的循环状态有两个:细胞状态$c_t$和隐层状态$h_t$,输出的$y_t$仍旧是$h_t$

    • GRU

      • LSTM的变体,仍旧是门结构,比LSTM结构简单,参数量小,据说更好训练

  6. papers

    [一个列了很多论文的主页] https://github.com/dk-liang/Awesome-Visual-Transformer

    [经典考古]

    ​ * [Seq2Seq 2014] Sequence to Sequence Learning with Neural Networks,Google,最早的encoder-decoder stacking LSTM用于机翻

    ​ * [self-attention/Transformer 2017] Transformer: Attention Is All You Need,Google,

    ​ * [bert 2019] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,Google,NLP,输入single sentence/patched sentences,用Transformer encoder提取bidirectional cross sentence representation,用输出的第一个logit进行分类

    [综述]

    ​ * [综述2020] Efficient Transformers: A Survey,Google,

    ​ * [综述2021] Transformers in Vision: A Survey,迪拜,

    ​ * [综述2021] A Survey on Visual Transformer,华为,

    [classification]

    ​ * [ViT 2020] AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE,Google,分类任务,用transformer的encoder替换CNN再加分类头,每个feature patch作为一个input embedding,channel dim是vector dim,可以看到跟bert基本一样,就是input sequence换成patch,后续基于它的提升有DeiT、LV-ViT

    ​ * [BotNet 2021] Bottleneck Transformers for Visual Recognition,Google,将CNN backbone最后几个stage替换成MSA

    ​ * [CvT 2021] CvT: Introducing Convolutions to Vision Transformers,微软,

    ​ * [Swin 2021] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,微软

    ​ * [PVT2021] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions,跟swin一样也是multi-scale features

    [detection]

    ​ * [DeTR 2020] DeTR: End-to-End Object Detection with Transformers,Facebook,目标检测,CNN+transformer(en-de)+预测头,每个feature pixel作为一个input embedding,channel dim是vector dim

    ​ * [Deformable DETR]

    ​ * [Anchor DETR]

    ​ * 详见《det-transformers》

    [segmentation]

    ​ * [SETR] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,复旦,水,感觉就是把FCN的back换成transformer

    [Unet+Transformer]:

    ​ * [UNETR 2021] UNETR: Transformers for 3D Medical Image Segmentation,英伟达,直接使用transformer encoder做unet encoder

    ​ * [TransUNet 2021] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation,encoder stream里面加transformer block

    ​ * [TransFuse 2021] TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation,大学,CNN feature和Transformer feature进行bifusion

    ​ * 详见《seg-transformers》

Sequence to Sequence

  1. [a keras tutorial][https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html]

    • general case

      • extract the information of the entire input sequence
      • then start generate the output sequence
    • seq2seq model workflow

      • a (stacking of) RNN layer acts as encoder
        • processes the input sequence
        • returns its own internal state:不要RNN的outputs,只要internal states
        • encoder编码得到的东西叫Context Vector
      • a (stacking of) RNN layer acts as decoder
        • given previous characters of the target sequence
        • it is trained to predict the next characters of the target sequence
        • teacher forcing:
          • 输入是target sequence,训练目标是使模型输出offset by one timestep的target sequence
          • 也可以不teacher forcing:直接把预测作为next step的输入
        • Context Vector的同质性:每个step,decoder都读取一样的Context Vector作为initial_state
      • when inference
        • 第一步获取input sequence的state vectors
        • repeat
          • 给decoder输入input states和out sequence(begin with a 起始符)
          • 从prediction中拿到next character
        • append the character to the output sequence
        • until:得到end character / hit the character limit
    • implementation

      https://github.com/AmberzzZZ/transformer/blob/master/seq2seq.py

  2. one step further

    • 改进方向
      • bi-directional RNN:粗暴反转序列,有效涨点
      • attention:本质是将encoder的输出Context Vector加权
      • ConvS2S:还没看
    • 主要都是针对RNN的缺陷提出
  3. 动机

    • present a general end-to-end sequence learning approach
      • multi-layered LSTMs
      • encode the input seq to a fix-dim vector
      • decode the target seq from the fix-dim vector
    • LSTM did not have difficulty on long sentences
    • reversing the order of the words improved performance

  4. 方法

    • standard RNN

      • given a sequence $(x_1, x_2, …, x_T)$

      • iterating:

      • 如果输入、输出的长度事先已知且固定,一个RNN网络就能建模seq2seq model了

      • 如果输入、输出的长度不同、并且服从一些更复杂的关系?就得用两个RNN网络,一个将input seq映射成fixed-sized vector,另一个将vector映射成output seq,but long-term-dependency issue

    • LSTM

      • LSTM是始终带着全部seq的信息的,如上图那样
    • our actual model

      • use two LSTMs:encoder-decoder能够增加参数量
      • an LSTM with four layers:deeper
      • input sequence倒序:真正的句首更接近trans的句首,makes it easy for SGD to establish communication
    • training details

      • LSTM:4 layers,1000 cells
      • word-embedding:1000-dim,(input vocab 160,000, output vocab 80,000)
      • naive softmax
      • uniform initialization:(-0.08, 0.08)
      • SGD,lr=0.7,half by every half epoch,total 7.5 epochs
      • gradient norm [10, 25]
      • all sentences in a minibatch are roughly of the same length

Transformer: Attention Is All You Need

  1. 动机

    • sequence2sequence models
      • encoder + decoder
      • RNN / CNN + an attention path
    • we propose Transformer
      • base solely on attention mechanisms
      • more parallelizable and less training time
  2. 论点

    • sequence modeling
      • 主流:RNN,LSTM,gated
        • align the positions to computing time steps
        • sequential本质阻碍并行化
      • Attention mechanisms acts as a integral part
        • in previous work used in conjunction with the RNN
      • 为了并行化
        • some methods use CNN as basic building blocks
        • difficult to learn dependencies between distant positions
    • we propose Transformer
      • rely entirely on an attention mechanism
      • draw global dependencies
    • self-attention
      • relating different positions of a single sequence
      • to generate a overall representation of the sequence
  3. 方法

    • encoder-decoder

      • encoder:doc2emb
        • given an input sequence of symbol representation $(x_1, x_2, …, x_n)$
        • map to a sequence of continuous representations $(z_1, z_2, …, z_n)$,(embeddings)
      • decoder:hidden layers
        • given embeddings z
        • generate an output sequence $(y_1, y_2, …, y_m)$ one element at a time
        • the previous generated symbols are served as additional input when computing the current time step
    • Transformer Architecture

      • Transformer use

        • for both encoder and decoder
        • stacked self-attention and point-wise fully-connected layers

      • encoder

        • N=6 identical layers
        • each layer has 2 sub-layers
          • multi-head self-attention mechanism
          • postision-wise fully connected layer
        • residual
          • for two sub-layers independently
          • add & layer norm
        • d=512
      • decoder

        • N=6 identical layers
        • 3 sub-layers
          • [new] masked multi-head self-attention:combine了先验知识,output embedding只能基于在它之前的time-step的embedding计算
          • multi-head self-attention mechanism
          • postision-wise fully connected layer
        • residual
      • attention

        • reference:https://bbs.cvmart.net/articles/4032
        • step1:project embedding to query-key-value pairs
          • $Q = W_Q^{dd} A^{dN}$
          • $K = W_K^{dd} A^{dN}$
          • $V = W_V^{dd} A^{dN}$
        • step2:scaled dot-product attention
          • $A^{N*N}=softmax(K^TQ/\sqrt{d})$
          • $B^{dN} = V^{dN}A^{N*N}$
        • multi-head attention
          • 以上的step1&step2操作performs a single attention function
          • 事实上我们可以用多组projection得到多组$\{Q,K,V\}^h$,in parallel地执行attention运算,得到多组$\{B^{d*N}\}^h$
          • concat & project
            • concat in d-dim:$B\in R^{(dh)N}$
            • linear project:$out = W^{d(dh)} B$
          • h=8
          • $d_{in}/h=64$:embedding的dim
          • $d_{out}=64$:query-key-value的dim
      • positional encoding

        • 数学本质是一个hand-crafted的映射矩阵$W^P$和one-hot的编码向量$p$:

        • 用PE表示e

          • pos是sequence x上的position
          • 2i和2i+1是embedding a上的idx
      • point-wise feed-forward network

        • fc-ReLU-fc
        • dim_fc=2048
        • dim_in & dim_out = 512
    • 运行过程

      • encoder是可以并行计算的

        • 输入是sequence embedding和positional embedding:$A\in R^{d*N}$
        • 经过repeated blocks
        • 输出是另外一个sequence:$B\in R^{d*N}$
        • self-attention:Q、K、V是一个东西
        • encoder的本质就是在解析自注意力:
        • 并行的全局两两比较,一步到位
          • RNN要by step
        • CNN要stack layers
  • decoder是在训练阶段是可以并行的,在inference阶段by step

    • 输入是encoder的输出和上一个time-step decoder的输出embedding

    • 输出是当前time-step对应position的输出词的概率

    • 第一个attention layer是out embedding的self-attention:要实现像RNN一样依次解码出来,每个time step要用到上一个位置的输出作为输入——masking

      • given输入sequence是\ I have a cat,5个元素
      • 那么mask就是$R^{5*5}$的下三角矩阵
      • 输入embedding经过transformation变成Q、K、V三个矩阵

      • 仍旧是$A=K^TQ$计算attention

      • 这里有一些attention是非法的:位置靠前的query只能用到比他位置更靠前的query,因此要乘上mask矩阵:$A=M A$

      • softmax:$A=softmax(A)$

      • scale:$B = VA$

      • concat & projection

        • 第二个attention layer是in & out sequence的注意力,其key和value来自encoder,query来自上一个decoder block的输出

  1. why self-attention

    • 衡量维度
      • total computational complexity per layer
      • amount of computation that can be parallelized
      • path-length between long-range dependencies
    • given input sequence with length N & dim $d_{in}$,output sequence with dim $d_{out}$
      • RNN need N sequencial operations of $W\in R^{d_{in} * d_{out}}$
      • CNN need N/k stacking layers of $d_{in}d_{out}$ sequence operations of $W\in R^{kk}$,generally是RNN的k倍
  2. training

    • optimizer:$Adam(lr, \beta_1=0.9, \beta_2=0.98, \epsilon=10^{-9})$
    • lrschedule:warmup by 4000 steps,then decay
    • dropout

      • residual dropout:就是stochastic depth
      • dropout to the sum of embeddings & PE for both encoder and decoder
      • drop_rate = 0.1
    • label smoothing:smooth_factor = 0.1

  3. 实验

    • A:vary the number of attention heads,发现多了少了都hurts
    • B:reduce the dim of attention key,发现hurts
    • C & D:大模型+dropout helps
    • E:learnable & sincos PE:nearly identical
    • 最后是big model的参数

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  1. 动机

    • BERT:Bidirectional Encoder Representations from Transformers
      • Bidirectional
      • Encoder
      • Representations
      • Transformers
    • workflow
      • pretrain bidirectional representations from unlabeled text
      • tune with one additional output layer to obtain the model
    • SOTA
      • GLUE score 80.5%
  2. 论点

    • pretraining is effective in NLP tasks
      • feature-based method:use task-specfic architectures,仅使用pretrained model的特征
      • fine-tuining method:直接fine-tune预训练模型
      • 两种方法在预训练阶段训练目标一致:use unidirectional language models to learn general language representations
      • reduce the need for many heavily-engineered task- specific architectures
    • current methods’ limitations
      • unidirectional:
        • limit the choice of architectures
        • 事实上token的上下文都很重要,不能只看上文
      • 简单的concat两个independent的L2R和R2L模型(biRNN)
        • independent
        • shallow concat
    • BERT
      • masked language model:在一个sequence中预测被遮挡的词
      • next sentence prediction:trains text-pair representations
  3. 方法

    • two steps

      • pre-training
        • unlabeled data
        • different pretraining tasks
      • fine-tuning
        • labeled data of the downstream tasks
        • fine-tune all the params
      • 两个阶段的模型,只有输出层不同

        • 例如问答模型
        • pretraining阶段,输入是两个sentence,输入的起始有一个CLS symbol,两个句子的分隔有一个SEP symbol
        • fine-tuning阶段,输入分别是问和答,【输出是啥?】

    • architecture

      • multi-layer bidirectional Transformer encoder

        • number of transfomer blocks L
        • hidden size H
        • number of self-attention heads A
        • FFN dim 4H
      • Bert base:L=12,H=768,A=12

      • Bert large:L=24,H=1024,A=16

      • input/output representations

        • a single sentence / two packed up sentence:
          • 拼接的sentence用特殊token SEP衔接
          • segment embedding:同时add a learned embedding to every token indicating who it belongs
        • use WordPiece embeddings with 30000 token vocabulary
        • 输入sequence的第一个token永远是一个特殊符号CLS,它对应的final state输出作为sentence整体的representation,用于分类任务
        • overall网络的input representation是通过将token embeddings拼接上上特殊符号,加上SE和PE得到

    • pre-training

      • two unsupervised tasks
        • Masked LM (MLM)
          • mask some percentage of the input tokens at random:15%
            • 80%的概率用MASK token替换
            • 10%的概率用random token替换
            • 10%的概率unchanged
          • then predict those masked tokens
          • the final hidden states corresponding to the masked tokens are fed into a softmax
          • 相比较于传统的left2right/right2left/concat模型
            • 既有前文又有后文
            • 只预测masked token,而不是全句预测
        • Next Sentence Prediction (NSP)
          • 对于relationship between sentences:
            • 例如question&answer,句子推断
            • not direatly captured by language modeling,模型直观学习的是token relationship
          • binarized next sentence prediction task
            • 选取sentence A&B:
              • 50%的概率是真的上下文(IsNext)
              • 50%的概率是random(NotNext)
            • 构成了一个二分类问题:仍旧用CLS token对应的hidden state C来预测
    • fine-tuning

      • BERT兼容many downstream tasks:single text or text pairs
      • 直接组好输入,end-to-end fine-tuning就行
      • 输出还是用CLS token对应的hidden state C来预测,接分类头

A Survey on Visual Transformer

  1. 动机

    • provide a comprehensive overview of the recent advances in visual transformers
    • discuss the potential directions for further improvement
    • develop timeline

  2. 按照应用场景分类

    • backbone:分类
    • high/mid-level vision:通常是语义相关的,检测/分割/姿态估计
    • low-level vision:对图像本身进行操作,超分/图像生成,目前应用较少
    • video processing

  3. revisiting transformer

    • key-concepts:sentence、embedding、positional encoding、encoder、decoder、self-attention layer、encoder-decoder attention layer、multi-head attention、feed-forward neural network

    • self-attention layer

      • input vector is transformed into 3 vectors
        • input vector is embedding+PE(pos,i):pos是word在sequence中的位置,i是PE-element在embedding vec中的位置
        • query vec q
        • key vec k
        • value vec v
        • $d_q = d_k = d_v = d_{model} = 512$
      • then calculate:$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$
      • encoder-decoder attention layer
        • K和V是从encoder中拿到
        • Q是从前一层拿到
        • 计算是相似的
    • multi-head attention
      • 一个attention是一个softmax,对应了一对强相关,同时抑制了其他word的相关性
      • 考虑一个词往往与几个词强相关,这就需要多个attention
      • multi-head:different QKV matrices are used for different heads
      • given a input vector,the number of heads h
        • 先产生h个 pairs
        • $d_q=d_k=d_v=d_{model}/h=64$
        • 这h个pair,分别计算attention vector,得到h个[b,d]的context vector
        • concat along-d-axis and linear projection to final [b,d] vector
    • residual & layer-norm:layer-norm在residual-add以后
    • feed-forward network
      • fc-GeLU-fc
      • $d_h=2048$
    • final-layer in decoder
      • dense+softmax
      • $d_{words}=$ number of words in the vocabulary
    • when applied in CV tasks
      • most transformers adopt the original transformer’s encoder module
      • used as a feature selector
      • 相比较于CNN,能够capture long-distance characteristics,derive global information
      • 相比较于RNN,能够并行计算
    • 计算量
      • 首先是三个线性层:线性时间复杂度O(n),计算量与$d_{model}$成正比
      • 然后是self-attention层:QKV矩阵乘法运算,平方时间复杂度O(n^2)
      • multi-head的话,还有一个线性层:平方时间复杂度O(n^2)
  4. revisiting transformers for NLP

    • 最早期的RNN + attention:rnn的sequential本质影响了长距离/并行化/大模型
    • transformer的solely attention结构:解决以上问题,促进了large pre-trained models (PTMs) for NLP

    • BERT and its variants

      • are a series of PTMs built on the multi-layer transformer encoder architecture
      • pre-trained
        • Masked language modeling
        • Next sentence prediction
      • fine-tuned
        • add an output layer
    • Generative Pre-trained Transformer models (GPT)
      • are another type of PTMs based on the transformer decoder architecture
      • masked self-attention mechanisms
      • pre-trained
        • 与BERT最大的不同是有向性
  5. visual transformer

    • 【category1】: backbone for image classification

      • transformer的输入是tokens,在NLP里是embedding形式的分词序列,在CV里就是representing a certain semantic concept的visual token
        • visual token可以来自CNN的feature
        • 也可以直接来自image的小patch
      • purely use transformer来做image classification任务的模型有iGPT、ViT、DeiT

      • iGPT

        • pretraining stage + finetuning stage
        • pre-training stage
          • self-supervised:自监督,所以结果较差
          • given an unlabeled dataset
          • train the model by minimizing the -log(density),感觉是在force光栅排序正确
        • fine-tuning stage
          • average pool + fc + softmax
          • jointly train with L_gen & L_CE
      • ViT
        • pre-trained on large datasets
          • standard transformer’s encoder + MLP head
          • treats all patches equally
          • 有一个类似BERT class token的东西
            • 从训练的角度,gather knowledge of the entire class
            • inference的时候,只拿了这第一个logit用来做预测
        • fine-tuning
          • 换一个zero-initialized的MLP head
          • use higher resolution & 插值pe
      • DeiT
        • Data-efficient image transformer
        • better performance with
          • a more cautious training strategy
          • and a token-based distillation
    • 【category2】: High/Mid-level Vision

    • 【category3】: Low-level Vision

    • 【category4】: Video Processing

    • efficient transformer:瘦身&加速

      • Pruning and Decomposition
      • Knowledge Distillation
      • Quantization
      • Compact Architecture Design

ViT: AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

  1. 动机

    • attention in vision
      • either in conjunction with CNN
      • or replace certain part of a CNN
      • overall都还是CNN-based
    • use a pure transformer to sequence of image patches
    • verified on image classification tasks in supervised fashion
  2. 论点

    • transformer lack some inductive biases inherent to CNNs,所以在insufficient data上not generalize well
    • however large scale training trumps inductive bias,大数据集上ViT更好
    • naive application of self-attention
      • 建立pixel之间的两两关联:计算量太大了
      • 需要approximation:local/改变size
    • we use transformer
      • wih global self-attention
      • to full-sized images
  3. 方法

    • input 1D-embedding sequence

      • 将image $x\in R^{HWC}$ 展开成patches $\{x_p \in R^{P^2C}\}$
      • thus sequence length $N=HW/P^2$
      • patch embedding:
        • use a trainable linear projection
        • fixed dimension size through-all
      • position embedding:
        • add to patch embedding
        • standard learnable 1D position embedding
      • prepended embedding:
        • 前置的learnable embedding $x_{class}$
        • similar to BERT’s class token
      • 以上三个embedding组合起来,作为输入sequence
    • transformer encoder

      • follow the original Transformer
      • 交替的MSA和MLP
      • layer norm LN
      • residual
      • GELU

    • hybrid architecture

      • input sequence也可以来源于CNN的feature maps
      • patch size可以是1x1
    • classification head

      • attached to $z_L^0$:是class token用来做预测
      • pre-training的时候是MLP
      • fine-tuning的时候换一个zero-initialized的single linear layer
    • workflow

      • typically先pre-train on large datasets
      • 再fine-tune to downstream tasks
      • fine-tune的时候替换一个zero-initialized的新线性分类头
      • when feeding images with higher resolution
        • keep the patch size
        • results in larger sequence length
        • 这时候pre-trained PE就no longer meaningful了
        • we therefore perform 2D interpolation基于它在原图上的位置
    • training details

      • Adam:$\beta_1=0.9,\beta_2=0.999$
      • batch size 4096
      • high weight decay 0.1
      • linear lr warmup & decay
    • fine-tuning details

      • SGDM
      • cosine LR
      • no weight decay
      • 【????】average 0.9999

win Transformer: Hierarchical Vision Transformer using Shifted Windows

  1. 动机

    • use Transformer as visual tasks’ backbone
    • challenges of Transformer in vision domain
      • large variations of scales of the visual entities
      • high resolution of pixels
    • we propose hierarchical Transformer
      • shifted windows
      • self-attention in local windows
      • cross-window connection
    • verified on
      • classification:ImageNet top1 acc 86.4
      • detection:COCO box-MAP 58.7
      • segmentation:ADE20K
      • this paper主要介绍分类,检测是以swin作为backbone,用MaskRCNN等二阶段架构来训练的,分割是以swin作为backbone,用UperNet去训练的,具体模型配置official repo的readme里面有详细列表
  2. 论点

    • when transfer Transformer’s high performance in NLP domain to CV domain
      • differences between the two modalities
        • scale:NLP里面,word tokens serves as the basic element,但是CV里面,patch的形态大小都是可变的,previous methods里面,都是统一设定固定大小的patch token
        • resolution:主要问题就是self-attention的计算复杂度,是image size的平方
      • we propose Swin Transformer
        • hierarchial feature maps
        • linear computatoinal complexity to image size
    • hierarchical
      • start from small patches
      • merge in deeper layers
      • 所以对不同尺度的特征patch进行了融合
    • linear complexity

      • compute self-attention locally in each window
      • 每个window的number of patches是设定好的,window数是与image size成正比的
      • 所以是线性

    • shifted window approach

      • 跨层的window shift,建立起相邻window间的桥梁
      • 【QUESTION】all query patches within a window share the same key set

    • previous attemptations of Transformer

      • self-attention based backbone architectures
        • 将部分/全部conv layers替换成self-attention
        • 模型主体架构还是ResNet
        • slightly better acc
        • larger latency caused by self-att
      • self-attention complement CNNs
        • 作为additional block,给到backbone/head,提供长距离信息
        • 有些检测/分割网络也开始用了transformer的encoder-decoder结构
      • transformer-based vision backbones
        • 主要就是ViT及其衍生品
        • ViT requires large-scale training sets
        • DeiT introduces training strategies
        • 但是还存在high resolution计算量的问题
  3. 方法

    • overview

      • Swin-T:tiny version
      • 第一步是patch partition:
        • 将RGB图切成non-overlapping patches
        • patches:token,basic element
        • feature input dim:with patch size 4x4,dim=4x4x3=48
      • 然后是linear embedding layer
        • 将raw feature re-projection到指定维度
        • 指定维度C:default=96
      • 接下来是Swin Transformer blocks
        • the number of tokens maintain
      • patch merging layers负责reduce the number of tokens
        • 第一个patch merging layer concat 所有2x2的neighbor patches:4C-dim vec each
        • 然后用了一个线性层re-projection
        • number of tokens(resolution):(H/4*W/4)/4 = (H/8*W/8),跟常规的CNN一样变化的
        • token dims:2C
        • 后面接上一个Transformer blocks
        • 合起来叫stage2(stage3、stage4)
    • Swin Transformer blocks

      • 跟原始的Transformer block比,就是把原始的MSA替换成了window-based的MSA

      • 原始的attention:global computation leads to quadratic complexity

      • window-based attention:

        • attention的计算只发生在每个window内部
        • non-overlapping partition
        • 很显然lacks connections across windows
      • shifted window partitioning in successive blocks

        • 两个attention block

        • 第一个用常规的window partitioning strategy:从左上角开始,take M=4,window size 4x4(一个window里面包含4x4个patch)

        • 第二层的window,基于前一层,各平移M/2

        • introduce connections between neighbor non-overlapping windows in the previous layer

        • efficient computation

          • shifted window会导致window尺寸不一致,不利于并行计算

      • relative position bias

        • 我们在MxM的window内部计算local attention:也就是input sequence的time-step是$M^2$
        • Q、K、V $\in R ^ {M^2 d}$
        • $Attention(Q,K,V)=Softmax(QK^T/\sqrt{d}+B)V$
        • 这个B作为local的position bias,在二维上,在每个轴上的变化范围[-M+1,M-1]
        • we parameterized a smaller-sized bias matrix $\hat B\in R ^{(2M-1)*(2M-1)}$
        • values in $B \in R ^ {M^2*M^2}$ are taken from $\hat B$
        • the learnt relative position bias可以用来initialize fine-tuned model
    • Architecture variants

      • base model:Swin-B,参数量对标ViT-B

      • Swin-T:0.25x,对标ResNet-50 (DeiT-S)

      • Swin-S:0.5x,对标ResNet-101

      • Swin-L:2x

      • window size:M=7

      • query dim:d=32,(每个stage的input sequence dim逐渐x2,heads num逐渐x2)

      • MLP:expansion ratio=4

      • channel number C:第一个stage的embdding dim,(后续逐渐x2)

      • hypers:

        • drop_rate:0.0
        • drop_path_rate:0.1

      • acc

  4. official repo: https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md

    • keras官方也出了一版:https://github.com/keras-team/keras-io/blob/master/examples/vision/swin_transformers.py

    • model zoo

      model | resolution | C | num_layers | num_heads | window_size

      Swin-T | 224 | 96 | {2,2,6,2} | {3,6,12,24} | 7

      Swin-S | 224 | 96 | {2,2,18,2} | {3,6,12,24} | 7

      Swin-B | 224/384 | 128 | {2,2,18,2} | {4,8,16,32} | 7/12

      Swin-L | 224/384 | 192 | {2,2,18,2} | {6,12,24,48} | 7/12

    • models/build.py

      • SwinTransformer & SwinMLP:前者就是论文里的,basic block是transformer的MSA加上MLP layers,后者是没用MSA,就用MLP来建模相邻windows之间的global relationship的,用的conv1d。

DETR: End-to-End Object Detection with Transformers

  1. 动机

    • new task formulation:a direct set prediction problem
    • main gradients
      • a set-based global loss
      • a transformer en-de architecture
      • remove the hand-designed componets like nms & anchor
    • acc & run-time on par with Faster R-CNN on COCO
      • significantly better performance on large objects
      • lower performances on small objects
  2. 论点

    • modern detectors run object detection in an indirect way

      • 基于格子/anchor/proposals进行回归和分类
      • 算法性能受制于nms机制、anchor设计、target-anchor的匹配机制
    • end-to-end approach

      • transformer的self-attention机制,explicitly model all pairwise interactions between elements:内含了去重(nms)的能力
      • bipartite matching:set loss function,将预测和gt的box一一匹配,run in parallel
      • DETR does not require any customized layers, thus can be reproduced easily
      • expand to segmentation task:a simple segmentation head trained on top of a pre-trained DETR
    • set prediction:to predict a set of bounding boxes and the categories for each

      • basic:multilabel classification
      • detection task has near-duplicates issues
      • set prediction是postprocessing-free的,它的global inference schemes能够avoid redundancy
      • usual loss:bipartite match
    • object detection

      • set-based loss
        • modern detectors use non-unique assignment rules together with NMS
        • bipartite matching是target和pred一一对应
  3. 方法

    • overall

      • three main components
        • a CNN backbone
        • an encoder-decoder transformer
        • a simple FFN
    • backbone

      • conventional r50
      • input:$[H_0, W_0, 3]$
      • output:$[H,W,C], H=\frac{H_0}{32}, W=\frac{W_0}{32}, C=2048$
    • transformer encoder

      • reduce channel dim to $d$:1x1 conv,$d=512$
      • collapse the spatial dimensions:feature sequence [d, HW],每个spatial pixel作为一个feature
      • fixed positional encodings:
        • added to the input of each attention layer
        • 【QUESTION】加在K和Q上还是embedding上?
    • transformer decoder

      • 输入N个dim=d的embedding
        • 叫object queries:表示我们预测固定值N个目标
        • 因为decoder也是permutation-invariant的(因为all shared),所以要输入N个不一样的embedding
        • learnt positional encodings
        • add them to the input of each attention layer
      • decodes the N objects in parallel
    • prediction FFN

      • 3 layer,ReLU,
      • box prediction:normalized center coords & height & width
      • class prediction:
        • an additional class label $\varnothing$ 表示no object
    • auxiliary losses

      • each decoder layer后面都接一个FFN prediction和Hungarian loss
      • shared FFN
      • an additional shared LN to norm the inputs of FFN
      • three components of the loss
        • class loss:CE loss
        • box loss
          • GIOU loss
          • L1 loss
    • technical details

      • AdamW:
        • initial transformer lr=10e-4
        • initial backbone lr=10e-5
        • weight decay=10e-4
      • Xavier init
      • imagenet-pretrained resnet weights with frozen batchnorm layers:r50 & r101,DETR & DETR-R101
      • a variant:
        • increase feature resolution version
        • remove stage5’s stride and add a dilation
        • DETR-DC5 & DETR-DC5-R101
        • improve performance for small objects
        • overall 2x computation increase
      • augmentation
        • resize input
        • random crop:with 0.5 prob then resize
      • transformer default dropout 0.1
      • lr schedule
        • 300 epochs
        • drop by factor 10 after 200 epochs
      • 4 images per GPU,total batch 64
    • for segmentation task:全景分割

      • 给decoder outputs加mask head
      • compute multi-head attention among
        • decoder box predictions
        • encoder outputs
      • generate M attention heatmaps per object
      • add a FPN styled CNN to recover resolution
      • pixel-wise argmax

UNETR: Transformers for 3D Medical Image Segmentation

  1. 动机

    • unet结构用于医学分割
      • encoder learns global context
      • decoder utilize the representations to predict the semanic ouputs
      • the locality of CNN limits long-range spatial dependency
    • our method
      • use a pure transformer as the encoder
      • learn sequence representations of the input volume
      • global
      • multi-scale
      • encoder directly connects to decoder with skip connections
  2. 论点

    • unet结构
      • encoder用来提取全图特征
      • decoder用来recover
      • skip connections用来补充spatial information that is lost during downsampling
      • localized receptive fields:
        • disadvantage in capturing multi-scale contextual information
        • 如不同尺寸的脑肿瘤
        • 缓和手段:atrous convs,still limited
    • transformer
      • self-attention mechanism in NLP
        • highlight the important features of word sequences
        • learn its long-range dependencies
      • in ViT
        • an image is represented as a patch embedding sequence
    • our method
      • formulation
        • 1D seq2seq problem
        • use embedded patches
      • the first completely transformer-based encoder
    • other unet- transformer methods
      • 2D (ours 3D)
      • employ only in the bottleneck (ours pure transformer)
      • CNN & transformer in separate streams and fuse
  3. 方法

    • overview

    • transformer encoder

      • input:1D sequence of input embeddings
      • given 3D volume $x \in R^{HWDC}$
      • divide into flattened uniform non-overlapping patches $x\in R^{LCN^3}$
        • $L=HWD/N^3$:the sequence length
        • $N^3$:patch dimension
      • linear projection to K-dim $E \in R^{LCK}$:remain constant through transformer
      • 1D learnable positional embedding $E_{pos} \in R^LD$
      • 12 self-att blocks:MSA + MLP
    • decoder &skip connections
      • 选取encoder第{3,6,9,12}个block的输出
      • reshape back to 3D volume $[\frac{H}{N},\frac{W}{N},\frac{D}{N},C]$
      • consecutive 3x3x3 conv+BN+ReLU
      • bottleneck
        • deconv by 2 to increase resolution
        • then concat with the previous resized feature
        • then jointly consecutive conv
        • then upsample with deconv…
      • concat到原图resolution以后,consecutive conv以后,再1x1x1 conv+softmax
    • loss
      • dice loss
        • dice:for each class channel,计算dice,然后求类平均
        • 1-dice
      • ce loss
        • for each pixel,求bce,然后求所有pixel的平均

pre-training & self-training

发表于 2021-01-17 |

[pre-training] Rethinking ImageNet Pre-training,He Kaiming,imageNet pre-training并没有真正helps acc,只是speedup,random initialization能够reach no worse的结果,前提是数据充足增强够猛,对小门小户还是没啥用,我们希望speedup

[pre-training & self-training] Rethinking Pre-training and Self-training,Google Brain,提出task-specific的pseudo label要比pre-training中搞出来的各种标签要好,前提还是堆数据,对小门小户没啥用,low-data下还是pre-train保平安

总体上都是针对跨任务下,imageNet pre-training意义的探讨,

  • 分类问题还是可以继续pretrained
  • kaiming这个只是fact,没有现实指导意义
  • google这个one step further,提出了self-training在现实条件中可以一试

Rethinking Pre-training and Self-training

  1. 动机

    • given fact:ImageNet pre-training has limited impact on COCO object detection
    • investigate self-training to utilize the additional data
  2. 论点

    • common practice pre-training
      • supervised pre-training
        • 首先要求数据有标签
        • pre-train the backbone on ImageNet as a classification task
      • 弱监督学习
        • with pseudo/noisy label
        • kaiming:Exploring the limits of weakly supervised pretraining
      • self-supervised pre-training
        • 无标签的海量数据
        • 构造学习目标:autoencoder,contrastive,…
        • https://zhuanlan.zhihu.com/p/108906502
    • self-training paradigm on COCO
      • train an object detection model on COCO
      • generate pseudo labels on ImageNet
      • both labeled data are combined to train a new model
      • 基本基于noisy student的方法
    • observations
      • with stronger data augmentation, pre-training hurts the accuracy, but helps in self-training
      • both supervised and self-supervised pre-training methods fails
      • the benefit of pre-training does not cancel out the gain by self-training
      • flexible about unlabeled data sources, model architectures and computer vision tasks
  3. 方法

    • data augmentation
      • vary the strength of data augmentation as 4 levels
    • pre-training

      • efficientNet-B7
      • AutoAugment weights & noisy student weights

    • self-training

      • noisy student scheme
      • 实验发现self-training with this standard loss function can be unstable
      • implement a loss normalization technique
    • experimental settings
      • object detection
        • COCO dataset for supervised learning
        • unlabeled ImageNet and OpenImages dataset for self-training:score thresh 0.5 to generate pesudo labels
        • retinaNet & spineNet
        • batch:half supervised half pesudo
      • semantic segmentation
        • PASCAL VOC 2012 for supervised learning
        • augmented PASCAL & COCO & ImageNet for self-training:score thresh 0.5 to generate pesudo masks & multi-scale
        • NAS-FPN
  4. 实验

    • pre-training

      • Pre-training hurts performance when stronger data augmentation is used:因为会sharpen数据差异?
      • More labeled data diminishes the value of pre-training:通常我们的实验数据fraction都比较小的相对imageNet,所以理论上不会harm?
      • self-supervised pre-training也会一样harm,在augment加强的时候

    • self-training

      • Self-training helps in high data/strong augmentation regimes, even when pre-training hurts:不同的augment level,self-training对最终结果都有加成

      • Self-training works across dataset sizes and is additive to pre-training:不同的数据量,也都有加成,但是low data regime下enjoys the biggest gain

    • discussion

      • weak performance of pre-training is that pre-training is not aware of the task of interest and can fail to adapt
      • jointly training also helps:address the mismatch between two dataset
      • noisy labeling is worse than targeted pseudo labeling
    • 总体结论:小样本量的时候,pre-training还是有加成的,再加上self-training进一步提升,样本多的时候就直接self-training

Rethinking ImageNet Pre-training

  1. 动机
    • thinking random initialization & pre-training
    • ImageNet pre-training
      • speed up
      • but not necessarily improving
    • random initialization
      • can achieve no worse result
      • robust to data size, models, tasks and metrics
    • rethink current paradigm of ‘pre- training and fine-tuning’
  2. 论点
    • no fundamental obstacle preventing us from training from scratch
      • if use normalization techniques appropriately
      • if train sufficiently long
    • pre-training
      • speed up
      • when fine-tuning on small dataset new hyper-parameters must be selected to avoid overfitting
      • localization-sensitive task benefits limited from pre-training
      • aimed at communities that don’t have enough data or computational resources
  3. 方法
    • normalization
      • form
        • normalized parameter initialization
        • normalization layers
      • BN layers makes training from scratch difficult
        • small batch size degrade the acc of BN
        • fine-tuning可以freeze BN
        • alternatives
          • GN:对batch size不敏感
          • syncBN
      • with appropriately normalized initialization可以train from scratch VGG这种不用BN层的
    • convergence
      • pre-training model has learned low-level features that do not need to be re-learned during
      • random-initial training need more iterations to learn both low-level and semantic features
  4. 实验
    • investigate maskRCNN
      • 替换BN:GN/sync-BN
      • learning rate:
        • training longer for the first (large) learning rate is useful
        • but training for longer on small learning rates often leads to overfitting
    • 10k COCO往上,train from scratch results能够catch up pretraining results,只要训的够久
    • 1k和3.5k的COCO,converges show no worse,但是在验证集上差一些:strong overfitting due to lack of data
    • PASCAL的结果也差一点,因为instance和category都更少,not directly comparable to the same number of COCO images:fewer instances and categories has a similar negative impact as insufficient training data

long-tailed

发表于 2021-01-11 |

[bag of tricks] Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks:结论就是两阶段,input mixup + CAM-based DRS + muted mixup fine-tuning组合使用最好

[balanced-meta softmax] Balanced Meta-Softmax for Long-Tailed Visual Recognition:商汤

[eql] Equalization Loss for Long-Tailed Object Recognition

[eql2] Equalization Loss v2: A New Gradient Balance Approach for Long-tailed Object Detection

[Class Rectification Loss] Imbalanced Deep Learning by Minority Class Incremental Rectification:提出CRL使得模型能够识别分布稀疏的小类们的边界,以此避免大类主导的影响

Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks

  1. 动机

    • to give a detailed experimental guideline of common tricks
    • to obtain the effective combinations of these tricks
    • propose a novel data augmentation approach
  2. 论点

    • long-tailed datasets
      • poor accuray on the under-presented minority
      • long-tailed CIFAR:
        • 指数型衰减
        • imbalance factor:50/100
        • test set unchanged
      • ImageNet-LT
        • sampling the origin set follow the pareto distribution
        • test set is balanced
      • iNaturalist
        • extremely imbalanced real world dataset
        • fine-grained problem
    • different learning paradigms
      • metric learning
      • meta learning
      • knowledge transfer
      • suffer from high sensitivity to hyper-parameters
    • training tricks
      • re-weighting
      • re-sample
      • mixup
      • two-stage training
      • different tricks might hurt each other
      • propose a novel data augmentation approach based on CAM:generate images with transferred foreground and unchanged background
  3. 方法

    • start from baseline

    • re-weighting

      • baseline:CE
      • re-weighting methods:
        • cost-sensitive CE:按照样本量线性加权$\frac{n_c}{n_{min}}$
        • focal loss:困难样本加权
        • class-balanced loss:
          • effective number rather than 样本量$n_c$
          • hyperparameter $\beta$ and weighting factor:$\frac{1-\beta}{1-\beta^{n_c}}$
        • 在cifar10上有效,但是cifar100上就不好了
          • directly application in training procedure is not a proper choice
          • especially when类别增多,imbalance加剧的时候
    • re-sampling

      • re-sampling methods
        • over-sampling:
          • 随机复制minority
          • might leads to overfitting
        • under-sampling
          • 随机去掉一些majority
          • be preferable to over-sampling
        • 有规律地sampling
          • 大体都是imbalanced向着lighter imbalanced向着balanced推动
        • artificial sampling methods
          • create artificial samples
          • sample based on gradients and features
          • likely to introduce noisy data
      • 观察到提升效果不明显
    • mixup

      • input mixup:input mixup can be further improved if we remove the mixup in last several epochs
      • manifold mixup:on only one layer
      • 观察到两种mixup功效差不多,后面发现input mixup更好些

        • input mixup去掉再finetuning几个epoch结果又提升,manifold则会变差

    • two-stage training

      • imbalanced training + balanced fine-tuning
      • vanilla training schedule on imbalanced data
        • 先学特征
      • fine-tune on balanced subsets
        • 再调整recognition accuracy
        • deferred re-balancing by re-sampling (DRS) :propose CAM-based sampling
        • deferred re-balancing by re-weighting (DRW)
      • proposed CAM-based sampling
        • DRS only replicate or remove
        • for each sampled image, apply the trained model & its ground truth label to generate CAM
        • 用heatmap的平均值作为阈值来区分前背景
        • 对前景apply transformations
          • horizontal flipping
          • translation
          • rotating
          • scaling
      • 发现fine-tuning时候再resample比直接resample的结果好
      • proposed CAM-based sampling好于其他sampling,其中CAM-based balance- sampling最好
      • ImageTrans balance-sampling只做变换,不用CAM区分前背景,结果不如CAM-based,证明CAM有用

      • 发现fine-tuning时候再reweight比直接reweight的结果好

      • 其中CSCE(按照样本量线性加权)最好

      • 整体来看DRS的结果稍微比DRW好一点

    • trick combinations

      • two-stage的CAM-based DRS略好于DRW,两个同时用不会further improve
      • 再加上mixup的话,input比manifold好一些
      • 结论就是:input mixup + CAM-based DRS + mute fine-tuning,apply the tricks incrementally

Balanced Meta-Softmax for Long-Tailed Visual Recognition

  1. 动机

    • long-tailed:mismatch between training and testing distributions
    • softmax:biased gradient estimation under the long-tailed setup
    • propose
      • Balanced Softmax:an elegant unbiased extension of Softmax
      • apply a complementary Meta Sampler:optimal sample rate
    • classification & segmentation
  2. 论点

    • raw baseline:a model that minimizes empirical risk on long-tailed training datasets often underperforms on a class-balanced test set
    • most methods use re-sampling or re-weighting
      • to simulate a balanced dataset
      • may under-class the majority or have gradient issue
    • meta-learning
      • optimize the weight per sample
      • need a clean and unbiased dataset
    • decoupled training
      • 就是上面一篇论文中的两阶段,第一阶段先学表征,第二阶段调整分布fine-tuning
      • not adequate for datasets with extremely high imbalance factor
    • LDAM
      • Label-Distribution-Aware Margin Loss
      • larger generalization error bound for minority
      • suit for binary classification
    • we propose BALMS
      • Balanced Meta-Softmax
      • theoretically equivalent with generalization error bound
      • for datasets with high imbalance factors should combine Meta Sampler
  3. 方法

    • balanced softmax

      • biased:从贝叶斯条件概率公式看,standard softmax上默认了均匀采样的p(y),在长尾分布的时候,就是有偏的
      • 加权:

        • 加在softmax项里面
        • 基于样本量线性加权

      • 数学意义上:we need to focus on minimizing the training loss of the tail classes

    • meta sampler

      • resample和reweight直接combine可能会worsen performance
      • class balance resample可能有over-balance issue
    • combination procedures

      • 对当前分布,先计算balanced-softmax,保存一个梯度更新后的模型
      • 计算这个临时模型在meta set上的CE,对分布embedding进行梯度更新:评估当前分布咋样,往一定方向矫正
      • 对真正的模型,用最新的分布,计算balanced-softmax,进行梯度更新:用优化后的分布,引导模型学习
  4. 实验

    • CE的结果呈现明显的长尾同分布趋势
    • CBS有缓解
    • BS更好
    • BS+CBS会over sample
    • BS+meta最好

Imbalanced Deep Learning by Minority Class Incremental Rectification

  1. 动机

    • significantly imbalanced training data
    • propose
      • batch-wise incremental minority class rectification model
      • Class Rectification Loss (CRL)
    • bring benefits to both minority and majority class boundary learning

  2. 论点

    • Most methods produce learning bias towards the majority classes
      • to eliminate bias
        • lifting the importance of minority classes:over-sampling can easily cause model overfitting,可能造成对小类别的过分关注,而对大类别不够重视,影响模型泛化能力
        • cost-sensitive learning:difficult to optimise
        • threshold-adjustment technique:given by experts
    • previous methods mainly investigate single-label binary-class with small imbalance ratio
    • real data
      • large ratio:power-law distributions
      • Subtle appearance discrepancy
    • hard sample mining
      • hard negatives are more informative than easy negatives as they violate a model class boundary
      • we only consider hard mining on the minority classes for efficiency
      • our batch-balancing hard mining strategy:eliminating exhaustive searching
    • LMLE
      • 唯一的竞品:考虑了data imbalance的细粒度分类
      • not end-to-end
      • global hard mining
      • computationally complex and expensive
  3. 方法

    • CRL overview

      • explicitly imposing structural discrimination of minority classes
      • batch-wise
      • operate on CE
      • forcus on minority class only:the conventional CE loss can already model the majority classes well
    • limitations of CE

      • CE treat the individual samples and classes as equally important
      • the learned model is suboptimal
      • boundaries are biased towards majority classes
    • profile the class distribution for each class

      • hard mining
      • overview

    • minority class hard sample mining

      • selectively “borrowing” majority class samples from class decision boundary

      • to minority class’s perspective:mining both hard-positive and hard-negative samples

      • define minority class:selected in each mini-batch

      • Incremental refinement:

        • eliminates the LMLE’s drawback in assuming that local group structures of all classes can be estimated reliably by offline global clustering
        • mini-batch的data distribution和训练集不是完全一致的
      • steps

        • profile the minority and majority classes per label in each training mini-batch

          • for each sample,for each class $j$,for each pred class $k$,we have $h^j=[h_1^j, …, h_k^j, …, h_{n_cls}^j]$
          • sort $h_k^j$ in descent order,define the minority classes for each class with $C_{min}^j = \sum_{k\in C_{min}^j}h_k^j \leq \rho * n_{bs}$,with $\rho=0.5$
        • hard mining

          • hardness

            • score based:prediction score,class-level
            • feature based:feature distance,instance-level
          • class-level,for class c

            • hard-positives:same gt class,but low prediction
            • hard-negative:different gt class,with high prediction
          • instance-level,for each sample in class c

            • hard-positives:same gt class,large distance with current sample
            • hard-negative:different gt class,small distance with current sample
          • top-k mining

            • hard-positives:bottom-k scored on c/top-k distance on c
            • hard-negative:top-k scored on c/bottom-k distance on c
          • score-based yields superior to distance-based

    • CRL

      • final weighted loss:$L = \alpha L_{crl}+(1-\alpha)L_{ce}$,$\alpha=\eta\Omega_{imbalance}$
      • class imbalance measure $\Omega$:more weighting is assigned to more imbalanced labels
      • form
        • triplet loss:类内+类间
        • contrastive loss:类内
        • modelling the distribution relationship of positive and negative pairs:没看懂
  4. 总结

    就是套用现有的metric learning,定义了一个变化的minority class,垃圾。

    说到底就是大数据——CE,小数据——metric learning。

refineDet

发表于 2021-01-08 |

和refineNet没有任何关系

RefineDet: Single-Shot Refinement Neural Network for Object Detectio

  1. 动机

    • inherit the merits of both two-stage and one-stage:accuracy and efficiency
    • single-shot
    • multi-task
    • refineDet
      • anchor refinement module (ARM)
      • object detection module (ODM)
      • transfer connection block (TCB)
  2. 论点

    • three advantages that two-stage superior than one-stage
      • RPN:handle class imbalance
      • two step regress:coarse to refine
      • two stage feature:RPN任务和regression任务有各自的feature
    • 模拟二阶段检测的RPN,把classifier任务中的大量阴性框先排掉,但不是以两个阶段的形式,而是multi-task并行
    • 将一阶段检测的objectness和box regression任务解耦,两个任务通过transfer block连接
    • ARM
      • remove negative anchors to reduce search space for the classifier
      • coarsely adjust the locations and sizes of anchors to provide better initialization for regression
    • ODM
      • further improve the regression
      • predict multi labels
    • TCB

      • transfer the features in the ARM to handle the more challenging tasks in the ODM

  3. 方法

    • Transfer Connection Block

      • 没什么新的东西,上采样用了deconv,conv-relu,element-wise add

    • Two-Step Cascaded Regression

      • fisrt step ARM prediction
        • for each cell,for each predefined anchor boxes,predict 4 offsets and 2 scores
        • obtain refined anchor boxes
      • second step ODM prediction
        • with justified feature map,with refined anchor boxes
        • generate accurate boxes offset to refined boxes and multi-class scores,c+4
    • Negative Anchor Filtering

      • reject well-classified negative anchors
      • if the negative confidence is larger than 0.99,discard it in training the ODM
      • ODM接收所有pred positive和hard negative
    • Training and Inference details

      • back:VGG16 & resnet101
        • fc6 & fc7变成两个conv
        • different feature scales
        • L2 norm
        • two extra convolution layers and one extra residual block
      • 4 feature strides
        • each level:1 scale & 3 ratios
        • ensures that different scales of anchors have the same tiling density on the image
      • matching
        • 每个GT box match一个score最高的anchor box
        • 为每个anchor box找到最匹配的iou大于0.5的gt box
        • 相当于把ignore那部分也作为正样本了
      • Hard Negative Mining
        • select negative anchor boxes with top loss values
        • n & p ratio:3:1
      • Loss Function
        • ARM loss
          • binary class:只计算正样本???
          • box:只计算正样本
        • ODM loss
          • pass the refined anchors with the negative confidence less than the threshold
          • multi-class:计算均衡的正负样本
          • box:只计算正样本
        • 正样本数为0的时候,loss均为0:纯阴性样本无效??
1…789…18
amber.zhang

amber.zhang

要糖有糖,要猫有猫

180 日志
98 标签
GitHub
© 2023 amber.zhang
由 Hexo 强力驱动
|
主题 — NexT.Muse v5.1.4