transform in CNN

发表于 2021-02-03 |

综述

几何变换
- STN：
  - 普通的CNN能够隐式的学习一定的平移、旋转不变性，让网络能够适应这种变换：降采样结构本身能够使得网络对变换不敏感
  - 从数据角度出发，我们还会引入各种augmentation，强化网络对变化的不变能力
  - deepMind为网络设计了一个显式的变换模块来学习各种变化，将distorted的输入变换回去，让网络学习更简单的东西
  - 参数量：就是变换矩阵的参数，通常是2x3的纺射变化矩阵，也就是6个参数
- deformable conv：
  - based on STN
  - 针对分类和检测分别提出deformable convolution和deformable RoI pooling：
  - 感觉deformable RoI pooling和guiding anchor里面的feature adaption是一个东西
  - 参数量：regular kernel params 3x3 + deformable offsets 3x3x2
  - what’s new？
    - 个人认为引入更多的参数引入的变化
    - 首先STN是从output到input的映射，使用变换矩阵M通常只能表示depictable transformation，且全图只有1个transformation
    - 其次STN的sampling kernel也是预定义的算法，对kernel内的所有pixel使用相同的变化，也就是1个weight factor
    - deformable conv是从input到output的映射，映射可以是任意的transformation，且3x3x2的参数最多可以包含3x3种transformation
    - sampling kernel对kernel内的每个点，也可以有不同的权重，也就是3x3个weight factor
- 还有啥跟形变相关的
attention机制
- spatial attention：STN，sSE
- channel attention：SENet
- 同时使用空间attention和通道attention机制：CBAM
papers
- [STN] STN: Spatial Transformer Networks，STN的变换是pre-defined的，是针对全局featuremap的变换
- [DCN 2017] Deformable Convolutional Networks ，DCN的变换是更随机的，是针对局部kernel分别进行的变化，基于卷积核添加location-specific shift
- [DCNv2 2018] Deformable ConvNets v2: More Deformable, Better Results，进一步消除irrelevant context，基于卷积核添加weighted-location-specific shift，提升performance
- [attention系列paper] [SENet &SKNet & CBAM & GC-Net][https://amberzzzz.github.io/2020/03/13/attention%E7%B3%BB%E5%88%97/]

STN: Spatial Transformer Networks

动机
- 传统卷积：lack the ability of spacially invariant
- propose a new learnable module
  - can be inserted into CNN
  - spatially manipulate the data
  - without any extra supervision
  - models learn to be invariant to transformations
论点
- spatially invariant
  - the ability of being invariant to large transformations of the input data
- max-pooling
  - 在一定程度上spatially invariant
  - 因为receptive fields are fixed and local and small
  - 必须叠加到比较深层的时候才能实现，intermediate feature layers对large transformations不太行
  - 是一种pre-defined mechanism，跟sample无关
- spatial transformation module
  - conditioned on individual samples
  - dynamic mechanism
  - produce a transformation and perform it on the entire feature map
- task场景
  - distorted digits分类：对输入做tranform能够simplify后面的分类任务
  - co-localisation：
  - spatial attention
- related work
  - 生成器用来生成transformed images，从而判别器能够学习分类任务from transformation supervision
  - 一些methods试图从网络结构、feature extractors的角度的获得invariant representations，while STN aims to achieve this by manipulating the data
  - manipulating the data通常就是基于attention mechanism，crop涉及differentiable问题
方法
- formulation
  - localisation network：predict transform parameters
  - grid generator：基于predicted params生成sampling grid
  - sampler：element-multiply
- localisation network
  - input feature map $U \in R^{hwc}$
  - same transformation is applied to each channel
  - generate parameters of transformation $\theta$：1-d vector
  - fc / conv + final regression layer
- parameterised sampling grid
  - sampling kernel
  - applied by pixel
  - general affine transformation：cropping，translation，rotation，scale，skew
  - ouput map上任意一点一定来自变换前的某一点，反之不一定，input map上某一点可能是bg，被crop掉了，所以pointwise transformation写成反过来的：
  - target points构成的点集就是sampling points on the input feature map
- differentiable image sampling
  - 通过上一步的矩阵transformation，得到input map上需要保留的source point set
  - 对点集中每一点apply kernel
  - 通用的插值表达式：
  - 最近邻kernel是个pulse函数
  - bilinear kernel是个distance>1的全mute掉，分段可导
- STN：Spatial Transformer Networks
  - 把spatial transformer嵌进CNN去：learn how to actively transform the features to help minimize the overall cost
  - computationally fast
  - 几种用法
    - feed the output of the localization network $\theta$ to the rest of the network：因为transform参数explicitly encodes目标的位置姿态信息
    - place multiple spatial transformers at increasing depth：串行能够让深层的transformer学习更抽象的变换
    - place multiple spatial transformers in parallel：并行的变换使得每个变换针对不同的object
实验
- R、RTS、P、E：distortion ahead
- aff、proj、TPS：transformer predefined
  - aff：给定角度？？
  - TPS：薄板样条插值

Deformable Convolutional Networks

动机
- CNN：fixed geometric structures
- enhance the transformation modeling capability
  - deformable convolution
  - deformable RoI pooling
- without additional supervision
- share similiar spirit with STN
论点
- to accommodate geometric variations
  - data augmentation is limited to model large, unknown transformations
  - fixed receptive fields is undesirable for high level CNN layers that encode the semantics
  - 使用大量增广的数据，枚举不全，而且收敛慢，所需网络参数量大
  - 对于提取语义特征的高层网络来讲，固定的感受野对不同目标不友好
- introduce two new modules
  - deformable convolution
    - learning offsets for each kernel via additional convolutional layers
  - deformable RoI pooling
    - learning offset for each bin partition of the previous RoI pooling
方法
- overview
  - operate on the 2D spatial domain
  - remains the same across the channel dimension
- deformable convolution
  - 正常的卷积：
    - $y(p_0) = \sum w(p_n)*x(p_0 + p_n)$
    - $p_n \in R\{(-1,-1),(-1,0),…, (0,0), (1,1)\}$
  - deformable conv：with offsets $\Delta p_n$
    - $y(p_0) = \sum w(p_n)*x(p_0 + p_n + \Delta p_n)$
    - offset value is typically fractional
    - bilinear interpolation：
      - $x(p) = \sum_q G(q,p)x(q)$
      - 其中$G(q,p)$是条件：$G(q,p)=max(0, 1-|q_x-p_x|)*max(0, 1-|q_y-p_y|)$
      - 只计算和offset点距离小于1个单位的邻近点
  - 实现
    - offsets conv和特征提取conv是一样的kernel：same spatial resolution and dilation（N个position）
    - the channel dimension 2N：因为是x和y两个方向的offset
- deformable RoI pooling
  - RoI pooling converts an input feature map of arbitrary size into fixed size features
  - 常规的RoI pooling
    - divides ROI into k*k bins and for each bin：$y(i,j) = \sum_{p \in bin(i,j)} x(p_0+p)/n_{ij}$
    - 对feature map上划分到每个bin里面所有的点
  - deformable RoI pooling：with offsets $\Delta p_{ij}$
    - $y(i,j) = \sum_{p \in bin(i,j)} x(p_0+p+\Delta p_{ij})/n_{ij}$
    - scaled normalized offsets：$\Delta p_{ij} = \gamma \Delta p_{ij} (w,h) $
    - normalized offset value is fractional
    - bilinear interpolation on the pooled map as above
  - 实现
    - fc layer：k*k*2个element（sigmoid？）
  - position sensitive RoI Pooling
    - fully convolutional
    - input feature map先通过卷积扩展成k*k*(C+1)通道
    - 对每个C+1(包含kk个feature map)，conv出全图的offset(2\k*k个)
- deformable convNets
  - initialized with zero weights
  - learning rates are set to $\beta$ times of the learning rate for the existing layers
    - $\beta=1.0$ for conv
    - $\beta=0.01$ for fc
  - feature extraction
    - back：ResNet-101 & Aligned-Inception-ResNet
    - withoutTop：A randomly initialized 1x1 conv is added at last to reduce the channel dimension to 1024
    - last block
      - stride is changed from 2 to 1
      - the dilation of all the convolution filters with kernel size>1 is changed from 1 to 2
    - Optionally last block
      - use deformable conv in res5a,b,c
  - segmentation and detection
    - deeplab predicts 1x1 score maps
    - Category-Aware RPN run region proposal with specific class
    - modified faster R-CNN：add ROI pooling at last conv
    - optional faster R-CNN：use deformable ROI pooling
    - R-FCN：state-of-the-art detector
    - optional R-FCN：use deformable ROI pooling
实验
- Accuracy steadily improves when more deformable convolution layers are used：使用越多层deform conv越好，经验取了3
- the learned offsets are highly adaptive to the image content：大目标的间距大，因为reception field大，consistent in different layers
- atrous convolution also improves：default networks have too small receptive fields，但是dilation需要手调到最优
- using deformable RoI pooling alone already produces noticeable performance gains, using both obtains significant accuracy improvements

Deformable ConvNets v2: More Deformable, Better Results

动机
- DCN能够adapt一定的geometric variations，但是仍存在extend beyond image content的问题
- to focus on pertinent image regions
  - increased modeling power
    - more deformable layers
    - updated DCNv2 modules
  - stronger training
    - propose feature mimicking scheme
- verified on
  - incorporated into Faster-RCNN & Mask RCNN
  - COCO for det & set
- still lightweight and easy to incorporate
论点
- DCNv1
  - deformable conv：在standard conv的基础上generate location-specific offsets which are learned from the preceding feature maps
  - deformable pooling：offsets are learned for the bin positions in RoIpooling
  - 通过可视化散点图发现有部分散点落在目标外围
- propose DCNv2
  - equip more convolutional layers with offset
  - modified module
    - each sample not only undergoes a learned offset
    - but also a learned feature amplitude
  - effective trainin
    - use RCNN as the teacher network since RCNN learns features unaffected by irrelevant info outside the ROI
    - feature mimicking loss
方法
- stacking more deformable conv layers
  - replace more regular conv layers by their deformable counterparts：
    - resnet50的stage3、4、5的3x3conv都替换成deformable conv：13个conv layer
    - DCNv1是把stage5的3个resblock的3x3 conv替换成deformable conv：3个deconv layer
  - 因为DCNv1里面在PASCAL上面实验发现再多的deconv精度就饱和了，但是DCNv2是在harder dataset COCO上面的best-acc-efficiency-tradeoff
- modulated deformable conv
  - modulate the input feature amplitudes from different spacial locations/bins
    - set the learnable offset & scalar for the k-th location：$\Delta p_k$和$\Delta m_k$
    - set the conv kernel dilation：$p_k$，resnet里面都是1
    - the value for location p is：$y(p) = \sum_{k=1}^K w_k x(p+p_k+\Delta p_k)\Delta m_k$，bilinear interpolation
  - 目的是抑制无关信号
  - learnable offset & scalar obtained via a separate conv layer over the same input feature map x
  - 输出有3K个channel：2K for xy-offset，K for scalar
    - offset的conv后面没激活函数，因为范围无限
    - scalar的conv后面有个sigmoid，将range控制在[0,1]
    - 两个conv全0初始化
    - 两个conv layer的learning rate比existing layers小一个数量级
- modulated deformable RoIpooling
  - given an input ROI
  - split into K(7x7) spatial bins
  - average pooling over the sampling points for each bin计算bin的value
  - the bin value is：$y(k) = \sum_{j=1}^{n_k} x(p_{kj}+\Delta p_k)\Delta m_k /n_k$，bilinear interpolation
  - a sibling branch
    - 2个1024d-fc：gaussian initialization with 0.01 std dev
    - 1个3Kd-fc：全0初始化
    - last K channels + sigmoid
    - learning rate跟existing layers保持一致
- RCNN feature mimicking
  - 发现无论是conv还是deconv，error-bound都很大
  - 尽管从设计思路上，DCNv2是带有mute irrelevant的能力的，但是事实上并没做到
  - 说明such representation cannot be learned well through standard FasterRCNN training procedure：
    - 说白了就是supervision力度不够
    - 需要additional guidance
- feature mimic loss
  - enforced only on positive ROIs：因为背景类往往需要更长距离/更大范围的context信息
  - architecture
    - add an additional RCNN branch
    - RCNN input cropped images，generate 14x14 featuremaps，经过两个fc变成1024-d
    - 和FasterRCNN里对应的counterpart，计算cosine similarity
    - 这个太扯了不展开了

spineNet

发表于 2021-01-28 |

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization

动机
- object detection task
  - requiring simultaneous recognition and localization
  - solely encoder performs not well
  - while encoder-decoder architectures are ineffective
- propose SpineNet
  - scale-permuted intermediate features
  - cross-scale connections
  - searched by NAS on detection COCO
  - can transfer to classification tasks
  - 在轻量和重量back的一阶段网络中都涨点领先
论点
- scale-decreasing backbone
  - throws away the spatial information by down-sampling
  - challenging to recover
  - 接一个轻量的FPN：
- scale-permuted model
  - scales of features can increase/decrease anytime：retain the spacial information
  - connections go across scales：multi-scale fusion
  - searched by NAS
  - 是一个完整的FPN，不是encoder-decoder那种可分的形式
  - directly connect to classification and bounding box regression subnets
  - base on ResNet50
    - use bottleneck feature blocks
    - two inputs for each feature blocks
    - roughly the same computation
方法
- formulation
  - overall architecture
    - stem：scale-decreased architecture
    - scale-permuted network
    - blocks in the stem network can be candidate inputs for the following scale-permuted network
  - scale-permuted network
    - building blocks：$B_k$
    - feature level：$L_3 - L_7$
    - output features：1x1 conv，$P_3 - P_7$
- search space
  - scale-permuted network：
    - block只能从前往后connect
    - based on resNet blocks
    - channel 256 for $L_5, L_6, L_7$
  - cross-scale connections：
    - two input connections for each block
    - from lower ordering block / stem
    - resampling
      - narrow factor $\alpha$：1x1 conv
      - 上采样：interpolation
      - 下采样：3x3 s2 conv
      - element-wise add
  - block adjustment
    - intermediate blocks can adjust its scale level & type
    - level from {-1, 0, 1, 2}
    - select from bottleneck / residual block
- family of models
  - R[N] - SP[M]：N feature layers in stem & M feature layers in scale-permuted layers
  - gradually shift from stem to SP
  - with size decreasing
- spineNet family
  - basic：spineNet-49
  - spineNet-49S：channel数scaled down by 0.65
  - spineNet-96：double the number of blocks
  - spineNet-143：repeat 3 times，fusion narrow factor $\alpha=1$
  - spineNet-190：repeat 4 times，fusion narrow factor $\alpha=1$，channel数scaled up by 1.3
实验
- 在mid/heavy量级上，比resnet-family-FPN涨出两个点
- 在light量级上，比mobileNet-family-FPN涨出一个点

guided anchoring

发表于 2021-01-27 |

原作者知乎reference：https://zhuanlan.zhihu.com/p/55854246

不完全是anchor-free，因为还是有decision grid to choose from的，应该说是adaptive anchor instead of hand-picked
为了特征和adaptive anchor对齐，引入deformable conv

Region Proposal by Guided Anchoring

动机
- most methods
  - predefined anchors
  - do a uniformed dense prediction
- our method
  - use sematic features to guide the anchoring
  - anchor size也是网络预测参数，compute from feature map
  - arbitrary aspect ratios
- feature inconsistency
  - 不同的anchor loc都是对应feature map上某一个点
  - 变化的anchor size和固定的位置向量之间存在inconsistency
  - 引入feature adaption module
- use high-quality proposals
  - GA-RPN提升了proposal的质量
  - 因此我们对proposal进入stage2的条件更严格
- adopt in Fast R-CNN, Faster R-CNN and RetinaNet均涨点
  - RPN提升显著：9.1
  - MAP也有涨点：1.2-2.7
- 还可以boosting trained models
  - boosting a two-stage detector by a fine-tuning schedule
论点
- alignment & consistency
  - 我们用feature map的pixels作为anchor representations，那么anchor centers必须跟feature pixels保持align
  - 不同pixel的reception field必须跟对应的anchor size保持匹配
  - previous sliding window scheme对每个pixel都做一样的操作，用同样一组anchor，因此是align和consist的
  - previous progressly refining scheme对anchor的位置大小做了refinement，ignore the alignment & consistency issue，是不对的！！
- disadvantage of predefined anchors
  - hard hyperparams
  - huge pos/neg imbalance & computation
- we propose GA-RPN
  - learnable anchor shapes to mitigate the hand-picked issue
  - feature adaptation to solve the consistency issue
  - key concerns in this paper
    - learnable anchors
    - joint anchor distribution
    - alignment & consistency
    - high-quality proposals
方法
- formulation
  - $p(x,y,w,h|I) = p(x,y|I)p(w,h|x,y,I)$
  - 将问题解耦成位置和尺寸的预测，首先anchor的loc服从full image的均匀分布，anchor的size建立在loc存在的基础上
  - two branches for loc & shape prediction
    - loc：binary classification，hxwx1
    - shape：location-dependent shapes，hxwx2
    - anchors：loc probabilities above a certain threshold & correponding ‘most probable’ anchor shape
  - multi-scale
    - the anchor generation parameters are shared
  - feature adaptation module
    - adapts the feature according to the anchor shape
- anchor location prediction
  - indicates the probability of an object’s center
  - 一层卷积：1x1 conv，channel1，sigmoid
  - transform back：each grid(i,j) corresponds to coords ((i+0.5)*s, (j+0.5)*s) on the origin map
  - filter out 90% of the regions
  - thus replace the ensuing conv layers by masked convs
  - groud truth
    - binary label map
    - each level：center region & ignore region & outside region，基于object center的方框
      - $\sigma_1=0.2，\sigma_2=0.5$：region box的长宽系数
      - ？？？用centerNet的heatmap会不会更好？？？
  - focal loss $L_{loc}$
- anchor shape prediction
  - predicts the best shape for each location
  - best shape：a shape that lead to best iou with the nearest gt box
  - 一层卷积：1x1 conv，channel2，[-1,1]
  - transform layer：transform direct [-1,1] outputs to real box shape
    - $w = \sigma s e^{dw}$
    - $h = \sigma s e^{dh}$
    - s：stride
    - $\sigma$：经验参数，8 in experiments
  - set 9 pairs of (w,h) as RetinaNet，calculate the IoU of these sampled anchors with gt，take the max as target value
  - bounded iou loss：$L_{shape} = L_1(1-min(\frac{w}{w_g}, \frac{w_g}{w})) + L_1(1-min(\frac{h}{h_g}, \frac{h_g}{h}))$
- feature adaptation
  - intuition：the feature corresponding to different size of anchor shapes应该encode different content region
  - inputs：feature map & anchor shape
  - location-dependent transformation：3x3 deformable conv
  - deformable conv的offset是anchor shape得到的
  - outputs：adapted features
- with adapted features
  - then perform further classification and bounding-box regression
- training
  - jointly optimize：$L = \lambda_1 L_{loc} + \lambda_2 L_{shape} + L_{cls} + L_{reg}$
  - $\lambda_1=0.2，\lambda_2=0.5$
  - each level of feature map should only target objects of a specific scale range：但是ASFF论文主张说这种arrange by scale的模式会引入前背景inconsistency？？
- High-quality Proposals
  - set a higher positive/negative threshold
  - use fewer samples

ASFF

发表于 2021-01-25 |

Learning Spatial Fusion for Single-Shot Object Detection

动机
- inconsistency when fuse across different feature scales
- propose ASFF
  - suppress the inconsistency
  - spatially filter conflictive information：想法应该跟SSE-block类似
- build on yolov3
  - introduce a bag of tricks
  - anchor-free pipeline
论点
- ssd is one of the first to generate pyramidal feature representations
  - deeper layers reuse the formers
  - bottom-up path
  - small instances suffers low acc because containing insufficient semanic info
- FPN use top-down path
  - shares rich semantics at all levels
  - improvement：more strengthening feature fusion
- 在使用FPN时，通常不同scale的目标绑定到不同的level上面
  - inconsistency：其他level的feature map对应位置的信息则为背景
  - some methods set ignore region in adjacent features
方法
- introduce advanced techniques
  - mixup
  - cosine learning rate schedule
  - sync-bn
  - an anchor-free branch to run jointly with anchor-based ones
  - L1 loss + IoU loss
- fusion
  - 全联接而非adjacent merge：三个level的fuse map都来自三个level的feature map
  - 上采样：
    - 1x1 conv：对齐channel
    - upsamp with interpolation
  - 下采样：
    - s2：3x3 s2 conv
    - s4：maxpooling + 3x3 s2 conv
  - adaptive fusion
    - pixel level的reweight
    - shared across channels：hxwx1
    - 对来自三个level的feature map，resolution对齐以后，分别1x1conv，channel 1
    - norm the weights：softmax
    - 为啥能suppress inconsistency：三个level的像素点，只激活一个另外两个是0的情况是绝对不harm的，相当于上面ignore那个方法拓展成adaptive
- training
  - apply mixup on the classification pretraining of D53
  - turn off mixup augmentation for the last 30 epochs.
- inference
  - the detection header at each level first predicts the shape of anchors？？？这个不太懂
- ASFF & ASFF*
  - enhanced version of ASFF by integrating other lightweight modules
  - dropblock & RFB

实现

class ASFF(nn.Module):
    def __init__(self, level, activate, rfb=False, vis=False):
        super(ASFF, self).__init__()
        self.level = level
        self.dim = [512, 256, 128]
        self.inter_dim = self.dim[self.level]
        if level == 0:
            self.stride_level_1 = conv_bn(256, self.inter_dim, kernel=3, stride=2)
            self.stride_level_2 = conv_bn(128, self.inter_dim, kernel=3, stride=2)
            self.expand = conv_bn(self.inter_dim, 512, kernel=3, stride=1)
        elif level == 1:
            self.compress_level_0 = conv_bn(512, self.inter_dim, kernel=1)
            self.stride_level_2 = conv_bn(128, self.inter_dim, kernel=3, stride=2)
            self.expand = conv_bn(self.inter_dim, 256, kernel=3, stride=1)
        elif level == 2:
            self.compress_level_0 = conv_bn(512, self.inter_dim, kernel=1, stride=1)
            self.compress_level_1= conv_bn(256,self.inter_dim,kernel=1,stride=1)
            self.expand = conv_bn(self.inter_dim, 128, kernel=3, stride=1)
        compress_c = 8 if rfb else 16  
        self.weight_level_0 = conv_bn(self.inter_dim, compress_c, 1, 1, 0)
        self.weight_level_1 = conv_bn(self.inter_dim, compress_c, 1, 1, 0)
        self.weight_level_2 = conv_bn(self.inter_dim, compress_c, 1, 1, 0)
        self.weight_levels = conv_bias(compress_c * 3, 3, kernel=1, stride=1, padding=0)
        self.vis = vis

    def forward(self, x_level_0, x_level_1, x_level_2):
      	# 跟论文描述一样：上采样先1x1conv对齐，再upinterp，下采样3x3 s2 conv
        if self.level == 0:
            level_0_resized = x_level_0
            level_1_resized = self.stride_level_1(x_level_1)
            level_2_downsampled_inter = F.max_pool2d(x_level_2, 3, stride=2, padding=1)
            level_2_resized = self.stride_level_2(level_2_downsampled_inter)
        elif self.level == 1:
            level_0_compressed = self.compress_level_0(x_level_0)
            sh = torch.tensor(level_0_compressed.shape[-2:])*2
            level_0_resized = F.interpolate(level_0_compressed, tuple(sh), 'nearest')
            level_1_resized = x_level_1
            level_2_resized = self.stride_level_2(x_level_2)
        elif self.level == 2:
            level_0_compressed = self.compress_level_0(x_level_0)
            sh = torch.tensor(level_0_compressed.shape[-2:])*4
            level_0_resized = F.interpolate(level_0_compressed, tuple(sh), 'nearest')
            level_1_compressed = self.compress_level_1(x_level_1)
            sh = torch.tensor(level_1_compressed.shape[-2:])*2
            level_1_resized = F.interpolate(level_1_compressed, tuple(sh),'nearest')
            level_2_resized = x_level_2
        # 这里得到的resized特征图不直接转换成一通道的weighting map，
        # 而是先1x1conv降维到8/16，然后concat，然后3x3生成3通道的weighting map
        # weighting map相当于一个prediction head，所以是conv_bias_softmax，无bn
        level_0_weight_v = self.weight_level_0(level_0_resized)
        level_1_weight_v = self.weight_level_1(level_1_resized)
        level_2_weight_v = self.weight_level_2(level_2_resized)
        levels_weight_v = torch.cat((level_0_weight_v, level_1_weight_v, level_2_weight_v), 1)
        levels_weight = self.weight_levels(levels_weight_v)
        levels_weight = F.softmax(levels_weight, dim=1)
				
        # reweighting
        fused_out_reduced = level_0_resized * levels_weight[:, 0:1, :, :] + \
                            level_1_resized * levels_weight[:, 1:2, :, :] + \
                            level_2_resized * levels_weight[:, 2:, :, :]
				
        # 3x3的conv，是特征图平滑
        out = self.expand(fused_out_reduced)

        if self.vis:
            return out, levels_weight, fused_out_reduced.sum(dim=1)
        else:
            return out

VoVNet

发表于 2021-01-22 |

An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection

动机
- denseNet
  - dense path：diverse receptive fields
  - heavy memory cost & low efficiency
- we propose a backbone
  - preserve the benefit of concatenation
  - improve denseNet efficiency
  - VoVNet comprised of One-Shot Aggregation (OSA)
- apply to one/two stage object detection tasks
  - outperforms denseNet & resNet based ones
  - better small object detection performance
论点
- main difference between resNet & denseNet
  - aggregation：summation & concatenation
    - summation would washed out the early features
    - concatenation last as it preserves
- GPU parallel computation
  - computing utilization is maximized when operand tensor is larger
  - many 1x1 convs for reducing dimension
  - dense connections in intermediate layers are inducing the inefficiencies
- VoVNet
  - hypothesize that the dense connections are redundant
  - OSA：aggregates intermediate features at once
  - test as object detection backbone：outperforms DenseNet & ResNet with better energy efficiency and speed
- factors for efficiency
  - FLOPS and model sizes are indirect metrics
  - energy per image and frame per second are more practical
  - MAC：
    - memory accesses cost，$hw(c_i+c_o) + k^2 c_ic_o$
    - memory usage不止跟参数量有关，还跟特征图尺寸相关
    - MAC can be minimized when input channel size equals the output
  - FLOPs/s
    - splitting a large convolution operation into several fragmented smaller operations makes GPU computation inefficient as fewer computations are processed in parallel
    - 所以depthwise/bottleneck理论上降低了计算量FLOP，但是从GPU并行的角度efficiency降低，并没有显著提速：cause more sequential computations
    - 以时间为单位的FLOPs才是fair的
方法
- hypothesize
  - dense connection makes similar between neighbor layers
  - redundant
- OSA
  - dense connection：former features concats in every following features
  - one-shot connection：former features concats once in the last feature
  - 最开始跟dense block保持参数一致：一个block里面12个layers，channel20，发现深层特征contributes less，所以换成浅层，5个layers，channel43，发现有涨点：implies that building deep intermediate feature via dense connection is less effective than expected
  - in/out channel数相同
    - much less MAC：
      - denseNet40：3.7M
      - OSA：5layers，channel43，2.5M
      - 对于higher resolution的detection任务impies more fast and energy efficient
    - GPU efficiency
      - 不需要那好几十个1x1
- architecture
  - stem：3个3x3conv
  - downsamp：s2的maxpooling
  - stages：increasing channels enables more rich semantic high-level information，better feature representation
  - deeper：makes more modules in stage3/4
实验
- one-stage：refineDet
- two-stage：Mask-RCNN

GCN

发表于 2021-01-18 |

reference：https://mp.weixin.qq.com/s/SWQHgogAP164Kr082YkF4A

图
- $G = (V,E)$：节点 & 边，连通图 & 孤立点
- 邻接矩阵A：NxN，有向 & 无向
- 度矩阵D：NxN对角矩阵，每个节点连接的节点
- 特征矩阵X：NxF，每个1-dim F是每个节点的特征向量
特征学习
- 可以类比CNN：对其邻域（kernel）内特征进行线性变换（w加权），然后求和，然后激活函数
- $H^{k+1} = f(H^{k},A) = \sigma(AH^{k}W^{k})$
  - H：running updating 特征矩阵，NxFk
  - A：0-1邻接矩阵，NxN
  - W：权重，$F_k$x$F_{k+1}$
- 权重所有节点共享
- 节点的邻接节点可以看做感受野
- 网络加深，感受野增大：节点的特征融合了更多节点的信息
图卷积
- A中没有考虑自己的特征：添加自连接
  - A = A + I
- 加法规则对度大的节点，特征会越来越大：归一化
  - 使得邻接矩阵每行和为1：左乘度矩阵的逆
  - 数学实质：求平均
  - one step further：不单对行做平均，对度较大的邻接节点也做punish
- GCN网络

实现

weights：in x out，kaiming_uniform_initialize
bias：out，zero_initialize
activation：relu
A x H x W：左乘是系数矩阵乘法
邻接矩阵的结构从输入开始就不变了，和每层的特征矩阵一起作为输入，传入GCN
分类头：最后一层预测Nxn_class的特征向量，提取感兴趣节点F(n_class)，然后softmax，对其分类

归一化

# 对称归一化
def normalize_adj(adj):
    """compute L=D^-0.5 * (A+I) * D^-0.5"""
    adj += sp.eye(adj.shape[0])
    degree = np.array(adj.sum(1))
    d_hat = sp.diags(np.power(degree, -0.5).flatten())
    norm_adj = d_hat.dot(adj).dot(d_hat)
    return norm_adj
  
  
# 均值归一化
def normalize_adj(adj):
    """compute L=D^-1 * (A+I)"""
    adj += sp.eye(adj.shape[0])
    degree = np.array(adj.sum(1))
    d_hat = sp.diags(np.power(degree, -1).flatten())
    norm_adj = d_hat.dot(adj)
    return norm_adj

应用场景

[半监督分类GCN]：SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS，提出GCN

[skin GCN]：Learning Differential Diagnosis of Skin Conditions with Co-occurrence Supervision using Graph Convolutional Networks，体素，一个单独的基于图的相关性分支，给feature加权

[Graph Attention]：Graph Attention Networks，图注意力网络

Learning Differential Diagnosis of Skin Conditions with Co-occurrence Supervision using Graph Convolutional Networks

动机
- 皮肤病：发病率高，experts少
- differential diagnosis：鉴别诊断，就是从众多疾病类别中跳出正确类别
- still challenging：timely and accurate
- propose a DLS(deep learning system)
  - clinical images
  - multi-label classification
  - 80 conditions，覆盖病种
  - labels incompleteness：用GCN建模成Co-occurrence supervision，benefit top5
论点
- google的DLS
  - 26中疾病
  - 建模成multi-class classification problem：非0即1的多标签表达破坏了类别间的correlation
- our DLS：GCN-CNN
  - multi-label classification task over 80 conditions
  - incomplete image labels：GCN that characterizes label co-occurrence supervision
  - combine the classification network with the GCN
  - 数据量：136,462 clinical images
  - 精度：test on 12,378 user taken images，top-5 acc 93.6%
- GCN
  - original application：
    - nodes classification，only a small subset of nodes had their labels available：半监督文本分类问题，只有一部分节点用于训练
    - the graph structure is contructed from data
  - ML-GCN：
    - multi-label classification task
    - correlation map（图结构）则是通过数据直接建立
    - 图节点是每个类别的semantic embeddings
方法
- overview
  - 一个trainable的CNN，将图片转化成feature vector
  - 一个GCN branch：两层图卷积，都是order-1，图结构是基于训练集计算，无向图，encoding的是图像labels之间的dependency，用它 implicitly supervises the classification task
  - 然后两个feature vector相乘，给出最终结果
- GCN branch
  - two graph convolutional (GC) layers
  - 一种estimated图结构：build co-occurence graph using only training data
    - node embed semantic meaning to labels
    - 边的值定义有点像类别间的相关性强度：$e_{ij} = 1(\frac{C(i,j)}{C(i)+C(j)} \geq t)$，分子是有两种标签的样本量，分母是各自样本量
  - 一种designed图结构：intial value是基于有经验的专家构建
  - node representation
    - graph branch的输入 label embedding
    - 用了BioSentVec，一个基于生物医学语料库训练的word bag
  - GCN
    - randomly initialize
    - GCN-0：dim 700
    - GCN-1：dim 1024
    - GCN-2：dim 2048
    - 最终得到(cls,2048)的node features
- cls branch
  - input：downsized to 448x448
  - resnet101：执行到FC-2048，作为image features
  - 先训练300 epochs，lr 0.1，step decay
- GCN-CNN
  - 先预训练resnet backbone，
  - 然后整体一起训练300 epochs，lr 0.0003，
  - image feature和node features通过dot product融合，得到(cls, )的cls vec，
实验
- 图结构不能random initialization，会使结果变差
- 基于数据集估计的graph initialization有显著提升
- 基于专家设计的graph initialization有进一步提升，但是不明显，考虑到标注工作繁重不太推荐

SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS

reference
- http://tkipf.github.io/graph-convolutional-networks/，官方博客
- https://zhuanlan.zhihu.com/p/35630785，知乎笔记
论点
- 场景
  - semi-supervised learning
  - on graph-structured data
  - 比如：在一个citation network，classifying nodes (such as documents)，labels are only available for a small subset of nodes，任务的目标是对大部分未标记的节点预测类别
- previous approach
  - Standard Approach
    - loss由两部分组成：单个节点的fitting error，和相邻节点的distance error
    - 基于一个假设：相邻节点间的label相似
    - 限制了模型的表达能力
  - Embedding-based Approach
    - 分两步进行：先学习节点的embedding，再基于embedding训练分类器
    - 不end-to-end，两个task分别执行，不能保证学到的embedding是适合第二个任务的
- 思路
  - train on a supervised target for nodes with labels
  - 然后通过图的连通性，trainable adjacency matrix，传递梯度给unlabeled nodes
  - 使得全图得到监督信息
- contributions
  - introduce a layer-wise propagation rule，使得神经网络能够operate on graph，实现end-to-end的图结构分类器
  - use this graph-based neural network model，训练一个semi-supervised classification of nodes的任务
方法
- fast approximate convolutions on graphs
  - given：
    - layer input：$H^l$
    - layer output：$H^{l+1}$
    - kernel pattern：$A$，在卷积里面是fixed kxk 方格，在图里面就是自由度更高的邻接矩阵
    - kernel weights：$W$
  - general layer form：$H^{l+1}=f(H^l,A)$
  - inspiration：卷积其实是一种特殊的图，每个grid看作一个节点，每个节点都加上其邻居节点的信息，也就是：
    - W是在对grids加权
    - A是在对每个grids加上他的邻接节点
  - details in practice
    - 自环：保留自身节点信息，$\hat A=A+I$
    - 正则化：stabilize the scale，$H^{l+1}=\sigma(\hat D^{-\frac{1}{2}}\hat A\hat D^{-\frac{1}{2}}H^lW)$
    - 一个实验：只利用图的邻接矩阵，就能够学得效果不错
- semi-supervised node classification
  - 思路就是在所有有标签节点上计算交叉熵loss
  - 模型结构
    - input：X，(b,N,D)
    - 两层图卷积
      - GCN1-relu：hidden F，(b,N,F)
      - GCN2-softmax：output Z，(b,N,cls)
    - 计算交叉熵
code
- torch/keras/tf官方都有：
  - https://github.com/tkipf/gcn，论文里给的tf这个链接
  - torch和keras的readme里面有说明，initialization scheme, dropout scheme, and dataset splits和tf版本不同，不是用来复现论文
  - python setup.py bdist_wheel
- 数据集：Cora dataset，是一个图数据集，用于分类任务，数据集介绍https://blog.csdn.net/yeziand01/article/details/93374216
  - cora.content是所有论文的独自的信息，总共2708个样本，每一行都是论文编号+词向量1433-dim+论文类别
  - cora.cites是论文之间的引用记录，A to B的reflect pair，5429行，用于创建邻接矩阵

transformers

发表于 2021-01-18 |

startup

reference1：https://mp.weixin.qq.com/s/Rm899vLhmZ5eCjuy6mW_HA

reference2：https://zhuanlan.zhihu.com/p/308301901

NLP & RNN
- 文本涉及上下文关系
- RNN时序串行，建立前后关系
- 缺点：对超长依赖关系失效，不好并行化
NLP & CNN
- 文本是1维时间序列
- 1D CNN，并行计算
- 缺点：CNN擅长局部信息，卷积核尺寸和长距离依赖的balance
NLP & transformer
- 对流入的每个单词，建立其对词库的权重映射，权重代表attention
- 自注意力机制
- 建立长距离依赖
put in CV
- 插入类似的自注意力层
- 完全抛弃卷积层，使用Transformers
RNN & LSTM & GRU cell
- 标准要素：输入x、输出y、隐层状态h
- RNN
  - RNN cell每次接收一个当前输入$x_t$，和前一步的隐层输出$h_{t-1}$，然后产生一个新的隐层状态$h_t$，也是当前的输出$y_t$
  - formulation：$y_t, h_t = f(x_t, h_{t-1})$
  - same parameters for each time step：同一个cell每个time step的权重共享
  - 一个问题：梯度消失/爆炸
    - 考虑hidden states’ chain的简化形式：$h_t = \theta^t h_0$，一个sequence forward下去就是same weights multiplied over and over again
    - 另外tanh也是会让神经元梯度消失/爆炸
- LSTM
  - key ingredient
    - cell：增加了一条cell state workflow，优化梯度流
    - gate：通过门结构删选携带信息，优化长距离关联
  - 可以看到LSTM的循环状态有两个：细胞状态$c_t$和隐层状态$h_t$，输出的$y_t$仍旧是$h_t$
- GRU
  - LSTM的变体，仍旧是门结构，比LSTM结构简单，参数量小，据说更好训练
papers

[一个列了很多论文的主页] https://github.com/dk-liang/Awesome-Visual-Transformer

[经典考古]

* [Seq2Seq 2014] Sequence to Sequence Learning with Neural Networks，Google，最早的encoder-decoder stacking LSTM用于机翻

* [self-attention/Transformer 2017] Transformer: Attention Is All You Need，Google，

* [bert 2019] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding，Google，NLP，输入single sentence/patched sentences，用Transformer encoder提取bidirectional cross sentence representation，用输出的第一个logit进行分类

[综述]

* [综述2020] Efficient Transformers: A Survey，Google，

* [综述2021] Transformers in Vision: A Survey，迪拜，

* [综述2021] A Survey on Visual Transformer，华为，

[classification]

* [ViT 2020] AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE，Google，分类任务，用transformer的encoder替换CNN再加分类头，每个feature patch作为一个input embedding，channel dim是vector dim，可以看到跟bert基本一样，就是input sequence换成patch，后续基于它的提升有DeiT、LV-ViT

* [BotNet 2021] Bottleneck Transformers for Visual Recognition，Google，将CNN backbone最后几个stage替换成MSA

* [CvT 2021] CvT: Introducing Convolutions to Vision Transformers，微软，

* [Swin 2021] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows，微软

* [PVT2021] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions，跟swin一样也是multi-scale features

[detection]

* [DeTR 2020] DeTR: End-to-End Object Detection with Transformers，Facebook，目标检测，CNN+transformer(en-de)+预测头，每个feature pixel作为一个input embedding，channel dim是vector dim

* [Deformable DETR]

* [Anchor DETR]

* 详见《det-transformers》

[segmentation]

* [SETR] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers，复旦，水，感觉就是把FCN的back换成transformer

[Unet+Transformer]：

* [UNETR 2021] UNETR: Transformers for 3D Medical Image Segmentation，英伟达，直接使用transformer encoder做unet encoder

* [TransUNet 2021] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation，encoder stream里面加transformer block

* [TransFuse 2021] TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation，大学，CNN feature和Transformer feature进行bifusion

* 详见《seg-transformers》

Sequence to Sequence

[a keras tutorial][https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html]
- general case
  - extract the information of the entire input sequence
  - then start generate the output sequence
- seq2seq model workflow
  - a (stacking of) RNN layer acts as encoder
    - processes the input sequence
    - returns its own internal state：不要RNN的outputs，只要internal states
    - encoder编码得到的东西叫Context Vector
  - a (stacking of) RNN layer acts as decoder
    - given previous characters of the target sequence
    - it is trained to predict the next characters of the target sequence
    - teacher forcing：
      - 输入是target sequence，训练目标是使模型输出offset by one timestep的target sequence
      - 也可以不teacher forcing：直接把预测作为next step的输入
    - Context Vector的同质性：每个step，decoder都读取一样的Context Vector作为initial_state
  - when inference
    - 第一步获取input sequence的state vectors
    - repeat
      - 给decoder输入input states和out sequence(begin with a 起始符)
      - 从prediction中拿到next character
    - append the character to the output sequence
    - until：得到end character / hit the character limit
- implementation
  
  https://github.com/AmberzzZZ/transformer/blob/master/seq2seq.py
one step further
- 改进方向
  - bi-directional RNN：粗暴反转序列，有效涨点
  - attention：本质是将encoder的输出Context Vector加权
  - ConvS2S：还没看
- 主要都是针对RNN的缺陷提出
动机
- present a general end-to-end sequence learning approach
  - multi-layered LSTMs
  - encode the input seq to a fix-dim vector
  - decode the target seq from the fix-dim vector
- LSTM did not have difficulty on long sentences
- reversing the order of the words improved performance
方法
- standard RNN
  - given a sequence $(x_1, x_2, …, x_T)$
  - iterating：
    $h_t = sigm(W^{hx} x_t + W^{hh}h_{t-1})\\ y_t = W^{yh} h_t$
  - 如果输入、输出的长度事先已知且固定，一个RNN网络就能建模seq2seq model了
  - 如果输入、输出的长度不同、并且服从一些更复杂的关系？就得用两个RNN网络，一个将input seq映射成fixed-sized vector，另一个将vector映射成output seq，but long-term-dependency issue
- LSTM
  - LSTM是始终带着全部seq的信息的，如上图那样
- our actual model
  - use two LSTMs：encoder-decoder能够增加参数量
  - an LSTM with four layers：deeper
  - input sequence倒序：真正的句首更接近trans的句首，makes it easy for SGD to establish communication
- training details
  - LSTM：4 layers，1000 cells
  - word-embedding：1000-dim，(input vocab 160,000, output vocab 80,000)
  - naive softmax
  - uniform initialization：(-0.08, 0.08)
  - SGD，lr=0.7，half by every half epoch，total 7.5 epochs
  - gradient norm [10, 25]
  - all sentences in a minibatch are roughly of the same length

Transformer: Attention Is All You Need

动机
- sequence2sequence models
  - encoder + decoder
  - RNN / CNN + an attention path
- we propose Transformer
  - base solely on attention mechanisms
  - more parallelizable and less training time
论点
- sequence modeling
  - 主流：RNN，LSTM，gated
    - align the positions to computing time steps
    - sequential本质阻碍并行化
  - Attention mechanisms acts as a integral part
    - in previous work used in conjunction with the RNN
  - 为了并行化
    - some methods use CNN as basic building blocks
    - difficult to learn dependencies between distant positions
- we propose Transformer
  - rely entirely on an attention mechanism
  - draw global dependencies
- self-attention
  - relating different positions of a single sequence
  - to generate a overall representation of the sequence
方法
- encoder-decoder
  - encoder：doc2emb
    - given an input sequence of symbol representation $(x_1, x_2, …, x_n)$
    - map to a sequence of continuous representations $(z_1, z_2, …, z_n)$，(embeddings)
  - decoder：hidden layers
    - given embeddings z
    - generate an output sequence $(y_1, y_2, …, y_m)$ one element at a time
    - the previous generated symbols are served as additional input when computing the current time step
- Transformer Architecture
  - Transformer use
    - for both encoder and decoder
    - stacked self-attention and point-wise fully-connected layers
  - encoder
    - N=6 identical layers
    - each layer has 2 sub-layers
      - multi-head self-attention mechanism
      - postision-wise fully connected layer
    - residual
      - for two sub-layers independently
      - add & layer norm
    - d=512
  - decoder
    - N=6 identical layers
    - 3 sub-layers
      - [new] masked multi-head self-attention：combine了先验知识，output embedding只能基于在它之前的time-step的embedding计算
      - multi-head self-attention mechanism
      - postision-wise fully connected layer
    - residual
  - attention
    - reference：https://bbs.cvmart.net/articles/4032
    - step1：project embedding to query-key-value pairs
      - $Q = W_Q^{dd} A^{dN}$
      - $K = W_K^{dd} A^{dN}$
      - $V = W_V^{dd} A^{dN}$
    - step2：scaled dot-product attention
      - $A^{N*N}=softmax(K^TQ/\sqrt{d})$
      - $B^{dN} = V^{dN}A^{N*N}$
    - multi-head attention
      - 以上的step1&step2操作performs a single attention function
      - 事实上我们可以用多组projection得到多组$\{Q,K,V\}^h$，in parallel地执行attention运算，得到多组$\{B^{d*N}\}^h$
      - concat & project
        
        concat in d-dim：$B\in R^{(dh)N}$
        
        linear project：$out = W^{d(dh)} B$
      - h=8
      - $d_{in}/h=64$：embedding的dim
      - $d_{out}=64$：query-key-value的dim
  - positional encoding
    - 数学本质是一个hand-crafted的映射矩阵$W^P$和one-hot的编码向量$p$：
      $\left[ \begin{array}{ccc} a\\ e \end{array} \right ] = [W^I, W^P] \left[ \begin{array}{ccc} x\\ p \end{array} \right ]$
    - 用PE表示e
      - pos是sequence x上的position
      - 2i和2i+1是embedding a上的idx
  - point-wise feed-forward network
    - fc-ReLU-fc
    - dim_fc=2048
    - dim_in & dim_out = 512
- 运行过程
  - encoder是可以并行计算的
    - 输入是sequence embedding和positional embedding：$A\in R^{d*N}$
    - 经过repeated blocks
    - 输出是另外一个sequence：$B\in R^{d*N}$
    - self-attention：Q、K、V是一个东西
    - encoder的本质就是在解析自注意力：
    - 并行的全局两两比较，一步到位
      - RNN要by step
    - CNN要stack layers

decoder是在训练阶段是可以并行的，在inference阶段by step
- 输入是encoder的输出和上一个time-step decoder的输出embedding
- 输出是当前time-step对应position的输出词的概率
- 第一个attention layer是out embedding的self-attention：要实现像RNN一样依次解码出来，每个time step要用到上一个位置的输出作为输入——masking
  - given输入sequence是\ I have a cat，5个元素
  - 那么mask就是$R^{5*5}$的下三角矩阵
  - 输入embedding经过transformation变成Q、K、V三个矩阵
  - 仍旧是$A=K^TQ$计算attention
  - 这里有一些attention是非法的：位置靠前的query只能用到比他位置更靠前的query，因此要乘上mask矩阵：$A=M A$
  - softmax：$A=softmax(A)$
  - scale：$B = VA$
  - concat & projection
    - 第二个attention layer是in & out sequence的注意力，其key和value来自encoder，query来自上一个decoder block的输出

why self-attention
- 衡量维度
  - total computational complexity per layer
  - amount of computation that can be parallelized
  - path-length between long-range dependencies
- given input sequence with length N & dim $d_{in}$，output sequence with dim $d_{out}$
  - RNN need N sequencial operations of $W\in R^{d_{in} * d_{out}}$
  - CNN need N/k stacking layers of $d_{in}d_{out}$ sequence operations of $W\in R^{kk}$，generally是RNN的k倍
training
- optimizer：$Adam(lr, \beta_1=0.9, \beta_2=0.98, \epsilon=10^{-9})$
- lrschedule：warmup by 4000 steps，then decay
- dropout
  - residual dropout：就是stochastic depth
  - dropout to the sum of embeddings & PE for both encoder and decoder
  - drop_rate = 0.1
- label smoothing：smooth_factor = 0.1
实验
- A：vary the number of attention heads，发现多了少了都hurts
- B：reduce the dim of attention key，发现hurts
- C & D：大模型+dropout helps
- E：learnable & sincos PE：nearly identical
- 最后是big model的参数

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

动机
- BERT：Bidirectional Encoder Representations from Transformers
  - Bidirectional
  - Encoder
  - Representations
  - Transformers
- workflow
  - pretrain bidirectional representations from unlabeled text
  - tune with one additional output layer to obtain the model
- SOTA
  - GLUE score 80.5%
论点
- pretraining is effective in NLP tasks
  - feature-based method：use task-specfic architectures，仅使用pretrained model的特征
  - fine-tuining method：直接fine-tune预训练模型
  - 两种方法在预训练阶段训练目标一致：use unidirectional language models to learn general language representations
  - reduce the need for many heavily-engineered task- specific architectures
- current methods’ limitations
  - unidirectional：
    - limit the choice of architectures
    - 事实上token的上下文都很重要，不能只看上文
  - 简单的concat两个independent的L2R和R2L模型（biRNN）
    - independent
    - shallow concat
- BERT
  - masked language model：在一个sequence中预测被遮挡的词
  - next sentence prediction：trains text-pair representations
方法
- two steps
  - pre-training
    - unlabeled data
    - different pretraining tasks
  - fine-tuning
    - labeled data of the downstream tasks
    - fine-tune all the params
  - 两个阶段的模型，只有输出层不同
    - 例如问答模型
    - pretraining阶段，输入是两个sentence，输入的起始有一个CLS symbol，两个句子的分隔有一个SEP symbol
    - fine-tuning阶段，输入分别是问和答，【输出是啥？】
- architecture
  - multi-layer bidirectional Transformer encoder
    - number of transfomer blocks L
    - hidden size H
    - number of self-attention heads A
    - FFN dim 4H
  - Bert base：L=12，H=768，A=12
  - Bert large：L=24，H=1024，A=16
  - input/output representations
    - a single sentence / two packed up sentence：
      - 拼接的sentence用特殊token SEP衔接
      - segment embedding：同时add a learned embedding to every token indicating who it belongs
    - use WordPiece embeddings with 30000 token vocabulary
    - 输入sequence的第一个token永远是一个特殊符号CLS，它对应的final state输出作为sentence整体的representation，用于分类任务
    - overall网络的input representation是通过将token embeddings拼接上上特殊符号，加上SE和PE得到
- pre-training
  - two unsupervised tasks
    - Masked LM (MLM)
      - mask some percentage of the input tokens at random：15%
        
        80%的概率用MASK token替换
        
        10%的概率用random token替换
        
        10%的概率unchanged
      - then predict those masked tokens
      - the final hidden states corresponding to the masked tokens are fed into a softmax
      - 相比较于传统的left2right/right2left/concat模型
        
        既有前文又有后文
        
        只预测masked token，而不是全句预测
    - Next Sentence Prediction (NSP)
      - 对于relationship between sentences：
        
        例如question&answer，句子推断
        
        not direatly captured by language modeling，模型直观学习的是token relationship
      - binarized next sentence prediction task
        
        选取sentence A&B：
        
        50%的概率是真的上下文（IsNext）
        
        50%的概率是random（NotNext）
        
        构成了一个二分类问题：仍旧用CLS token对应的hidden state C来预测
- fine-tuning
  - BERT兼容many downstream tasks：single text or text pairs
  - 直接组好输入，end-to-end fine-tuning就行
  - 输出还是用CLS token对应的hidden state C来预测，接分类头

A Survey on Visual Transformer

动机
- provide a comprehensive overview of the recent advances in visual transformers
- discuss the potential directions for further improvement
- develop timeline
按照应用场景分类
- backbone：分类
- high/mid-level vision：通常是语义相关的，检测/分割/姿态估计
- low-level vision：对图像本身进行操作，超分/图像生成，目前应用较少
- video processing
revisiting transformer
- key-concepts：sentence、embedding、positional encoding、encoder、decoder、self-attention layer、encoder-decoder attention layer、multi-head attention、feed-forward neural network
- self-attention layer
  - input vector is transformed into 3 vectors
    - input vector is embedding+PE(pos,i)：pos是word在sequence中的位置，i是PE-element在embedding vec中的位置
    - query vec q
    - key vec k
    - value vec v
    - $d_q = d_k = d_v = d_{model} = 512$
  - then calculate：$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$
  - encoder-decoder attention layer
    - K和V是从encoder中拿到
    - Q是从前一层拿到
    - 计算是相似的
- multi-head attention
  - 一个attention是一个softmax，对应了一对强相关，同时抑制了其他word的相关性
  - 考虑一个词往往与几个词强相关，这就需要多个attention
  - multi-head：different QKV matrices are used for different heads
  - given a input vector，the number of heads h
    - 先产生h个 pairs
    - $d_q=d_k=d_v=d_{model}/h=64$
    - 这h个pair，分别计算attention vector，得到h个[b,d]的context vector
    - concat along-d-axis and linear projection to final [b,d] vector
- residual & layer-norm：layer-norm在residual-add以后
- feed-forward network
  - fc-GeLU-fc
  - $d_h=2048$
- final-layer in decoder
  - dense+softmax
  - $d_{words}=$ number of words in the vocabulary
- when applied in CV tasks
  - most transformers adopt the original transformer’s encoder module
  - used as a feature selector
  - 相比较于CNN，能够capture long-distance characteristics，derive global information
  - 相比较于RNN，能够并行计算
- 计算量
  - 首先是三个线性层：线性时间复杂度O(n)，计算量与$d_{model}$成正比
  - 然后是self-attention层：QKV矩阵乘法运算，平方时间复杂度O(n^2)
  - multi-head的话，还有一个线性层：平方时间复杂度O(n^2)
revisiting transformers for NLP
- 最早期的RNN + attention：rnn的sequential本质影响了长距离/并行化/大模型
- transformer的solely attention结构：解决以上问题，促进了large pre-trained models (PTMs) for NLP
- BERT and its variants
  - are a series of PTMs built on the multi-layer transformer encoder architecture
  - pre-trained
    - Masked language modeling
    - Next sentence prediction
  - fine-tuned
    - add an output layer
- Generative Pre-trained Transformer models (GPT)
  - are another type of PTMs based on the transformer decoder architecture
  - masked self-attention mechanisms
  - pre-trained
    - 与BERT最大的不同是有向性
visual transformer
- 【category1】: backbone for image classification
  - transformer的输入是tokens，在NLP里是embedding形式的分词序列，在CV里就是representing a certain semantic concept的visual token
    - visual token可以来自CNN的feature
    - 也可以直接来自image的小patch
  - purely use transformer来做image classification任务的模型有iGPT、ViT、DeiT
  - iGPT
    - pretraining stage + finetuning stage
    - pre-training stage
      - self-supervised：自监督，所以结果较差
      - given an unlabeled dataset
      - train the model by minimizing the -log(density)，感觉是在force光栅排序正确
    - fine-tuning stage
      - average pool + fc + softmax
      - jointly train with L_gen & L_CE
  - ViT
    - pre-trained on large datasets
      - standard transformer’s encoder + MLP head
      - treats all patches equally
      - 有一个类似BERT class token的东西
        
        从训练的角度，gather knowledge of the entire class
        
        inference的时候，只拿了这第一个logit用来做预测
    - fine-tuning
      - 换一个zero-initialized的MLP head
      - use higher resolution & 插值pe
  - DeiT
    - Data-efficient image transformer
    - better performance with
      - a more cautious training strategy
      - and a token-based distillation
- 【category2】: High/Mid-level Vision
- 【category3】: Low-level Vision
- 【category4】: Video Processing
- efficient transformer：瘦身&加速
  - Pruning and Decomposition
  - Knowledge Distillation
  - Quantization
  - Compact Architecture Design

ViT: AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

动机
- attention in vision
  - either in conjunction with CNN
  - or replace certain part of a CNN
  - overall都还是CNN-based
- use a pure transformer to sequence of image patches
- verified on image classification tasks in supervised fashion
论点
- transformer lack some inductive biases inherent to CNNs，所以在insufficient data上not generalize well
- however large scale training trumps inductive bias，大数据集上ViT更好
- naive application of self-attention
  - 建立pixel之间的两两关联：计算量太大了
  - 需要approximation：local/改变size
- we use transformer
  - wih global self-attention
  - to full-sized images
方法
- input 1D-embedding sequence
  - 将image $x\in R^{HWC}$ 展开成patches $\{x_p \in R^{P^2C}\}$
  - thus sequence length $N=HW/P^2$
  - patch embedding：
    - use a trainable linear projection
    - fixed dimension size through-all
  - position embedding：
    - add to patch embedding
    - standard learnable 1D position embedding
  - prepended embedding：
    - 前置的learnable embedding $x_{class}$
    - similar to BERT’s class token
  - 以上三个embedding组合起来，作为输入sequence
- transformer encoder
  - follow the original Transformer
  - 交替的MSA和MLP
  - layer norm LN
  - residual
  - GELU
- hybrid architecture
  - input sequence也可以来源于CNN的feature maps
  - patch size可以是1x1
- classification head
  - attached to $z_L^0$：是class token用来做预测
  - pre-training的时候是MLP
  - fine-tuning的时候换一个zero-initialized的single linear layer
- workflow
  - typically先pre-train on large datasets
  - 再fine-tune to downstream tasks
  - fine-tune的时候替换一个zero-initialized的新线性分类头
  - when feeding images with higher resolution
    - keep the patch size
    - results in larger sequence length
    - 这时候pre-trained PE就no longer meaningful了
    - we therefore perform 2D interpolation基于它在原图上的位置
- training details
  - Adam：$\beta_1=0.9，\beta_2=0.999$
  - batch size 4096
  - high weight decay 0.1
  - linear lr warmup & decay
- fine-tuning details
  - SGDM
  - cosine LR
  - no weight decay
  - 【？？？？】average 0.9999

win Transformer: Hierarchical Vision Transformer using Shifted Windows

动机
- use Transformer as visual tasks’ backbone
- challenges of Transformer in vision domain
  - large variations of scales of the visual entities
  - high resolution of pixels
- we propose hierarchical Transformer
  - shifted windows
  - self-attention in local windows
  - cross-window connection
- verified on
  - classification：ImageNet top1 acc 86.4
  - detection：COCO box-MAP 58.7
  - segmentation：ADE20K
  - this paper主要介绍分类，检测是以swin作为backbone，用MaskRCNN等二阶段架构来训练的，分割是以swin作为backbone，用UperNet去训练的，具体模型配置official repo的readme里面有详细列表
论点
- when transfer Transformer’s high performance in NLP domain to CV domain
  - differences between the two modalities
    - scale：NLP里面，word tokens serves as the basic element，但是CV里面，patch的形态大小都是可变的，previous methods里面，都是统一设定固定大小的patch token
    - resolution：主要问题就是self-attention的计算复杂度，是image size的平方
  - we propose Swin Transformer
    - hierarchial feature maps
    - linear computatoinal complexity to image size
- hierarchical
  - start from small patches
  - merge in deeper layers
  - 所以对不同尺度的特征patch进行了融合
- linear complexity
  - compute self-attention locally in each window
  - 每个window的number of patches是设定好的，window数是与image size成正比的
  - 所以是线性
- shifted window approach
  - 跨层的window shift，建立起相邻window间的桥梁
  - 【QUESTION】all query patches within a window share the same key set
- previous attemptations of Transformer
  - self-attention based backbone architectures
    - 将部分/全部conv layers替换成self-attention
    - 模型主体架构还是ResNet
    - slightly better acc
    - larger latency caused by self-att
  - self-attention complement CNNs
    - 作为additional block，给到backbone/head，提供长距离信息
    - 有些检测/分割网络也开始用了transformer的encoder-decoder结构
  - transformer-based vision backbones
    - 主要就是ViT及其衍生品
    - ViT requires large-scale training sets
    - DeiT introduces training strategies
    - 但是还存在high resolution计算量的问题
方法
- overview
  - Swin-T：tiny version
  - 第一步是patch partition：
    - 将RGB图切成non-overlapping patches
    - patches：token，basic element
    - feature input dim：with patch size 4x4，dim=4x4x3=48
  - 然后是linear embedding layer
    - 将raw feature re-projection到指定维度
    - 指定维度C：default=96
  - 接下来是Swin Transformer blocks
    - the number of tokens maintain
  - patch merging layers负责reduce the number of tokens
    - 第一个patch merging layer concat 所有2x2的neighbor patches：4C-dim vec each
    - 然后用了一个线性层re-projection
    - number of tokens（resolution）：（H/4*W/4）/4 = （H/8*W/8），跟常规的CNN一样变化的
    - token dims：2C
    - 后面接上一个Transformer blocks
    - 合起来叫stage2（stage3、stage4）
- Swin Transformer blocks
  - 跟原始的Transformer block比，就是把原始的MSA替换成了window-based的MSA
  - 原始的attention：global computation leads to quadratic complexity
  - window-based attention：
    - attention的计算只发生在每个window内部
    - non-overlapping partition
    - 很显然lacks connections across windows
  - shifted window partitioning in successive blocks
    - 两个attention block
    - 第一个用常规的window partitioning strategy：从左上角开始，take M=4，window size 4x4（一个window里面包含4x4个patch）
    - 第二层的window，基于前一层，各平移M/2
    - introduce connections between neighbor non-overlapping windows in the previous layer
    - efficient computation
      - shifted window会导致window尺寸不一致，不利于并行计算
  - relative position bias
    - 我们在MxM的window内部计算local attention：也就是input sequence的time-step是$M^2$
    - Q、K、V $\in R ^ {M^2 d}$
    - $Attention(Q,K,V)=Softmax(QK^T/\sqrt{d}+B)V$
    - 这个B作为local的position bias，在二维上，在每个轴上的变化范围[-M+1,M-1]
    - we parameterized a smaller-sized bias matrix $\hat B\in R ^{(2M-1)*(2M-1)}$
    - values in $B \in R ^ {M^2*M^2}$ are taken from $\hat B$
    - the learnt relative position bias可以用来initialize fine-tuned model
- Architecture variants
  - base model：Swin-B，参数量对标ViT-B
  - Swin-T：0.25x，对标ResNet-50 (DeiT-S)
  - Swin-S：0.5x，对标ResNet-101
  - Swin-L：2x
  - window size：M=7
  - query dim：d=32，（每个stage的input sequence dim逐渐x2，heads num逐渐x2）
  - MLP：expansion ratio=4
  - channel number C：第一个stage的embdding dim，（后续逐渐x2）
  - hypers：
    - drop_rate：0.0
    - drop_path_rate：0.1
  - acc
official repo: https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md
- keras官方也出了一版：https://github.com/keras-team/keras-io/blob/master/examples/vision/swin_transformers.py
- model zoo
  
  model | resolution | C | num_layers | num_heads | window_size
  
  Swin-T | 224 | 96 | {2,2,6,2} | {3,6,12,24} | 7
  
  Swin-S | 224 | 96 | {2,2,18,2} | {3,6,12,24} | 7
  
  Swin-B | 224/384 | 128 | {2,2,18,2} | {4,8,16,32} | 7/12
  
  Swin-L | 224/384 | 192 | {2,2,18,2} | {6,12,24,48} | 7/12
- models/build.py
  - SwinTransformer & SwinMLP：前者就是论文里的，basic block是transformer的MSA加上MLP layers，后者是没用MSA，就用MLP来建模相邻windows之间的global relationship的，用的conv1d。

DETR: End-to-End Object Detection with Transformers

动机
- new task formulation：a direct set prediction problem
- main gradients
  - a set-based global loss
  - a transformer en-de architecture
  - remove the hand-designed componets like nms & anchor
- acc & run-time on par with Faster R-CNN on COCO
  - significantly better performance on large objects
  - lower performances on small objects
论点
- modern detectors run object detection in an indirect way
  - 基于格子/anchor/proposals进行回归和分类
  - 算法性能受制于nms机制、anchor设计、target-anchor的匹配机制
- end-to-end approach
  - transformer的self-attention机制，explicitly model all pairwise interactions between elements：内含了去重（nms）的能力
  - bipartite matching：set loss function，将预测和gt的box一一匹配，run in parallel
  - DETR does not require any customized layers, thus can be reproduced easily
  - expand to segmentation task：a simple segmentation head trained on top of a pre-trained DETR
- set prediction：to predict a set of bounding boxes and the categories for each
  - basic：multilabel classification
  - detection task has near-duplicates issues
  - set prediction是postprocessing-free的，它的global inference schemes能够avoid redundancy
  - usual loss：bipartite match
- object detection
  - set-based loss
    - modern detectors use non-unique assignment rules together with NMS
    - bipartite matching是target和pred一一对应
方法
- overall
  - three main components
    - a CNN backbone
    - an encoder-decoder transformer
    - a simple FFN
- backbone
  - conventional r50
  - input：$[H_0, W_0, 3]$
  - output：$[H,W,C], H=\frac{H_0}{32}, W=\frac{W_0}{32}, C=2048$
- transformer encoder
  - reduce channel dim to $d$：1x1 conv，$d=512$
  - collapse the spatial dimensions：feature sequence [d, HW]，每个spatial pixel作为一个feature
  - fixed positional encodings：
    - added to the input of each attention layer
    - 【QUESTION】加在K和Q上还是embedding上？
- transformer decoder
  - 输入N个dim=d的embedding
    - 叫object queries：表示我们预测固定值N个目标
    - 因为decoder也是permutation-invariant的（因为all shared），所以要输入N个不一样的embedding
    - learnt positional encodings
    - add them to the input of each attention layer
  - decodes the N objects in parallel
- prediction FFN
  - 3 layer，ReLU，
  - box prediction：normalized center coords & height & width
  - class prediction：
    - an additional class label $\varnothing$ 表示no object
- auxiliary losses
  - each decoder layer后面都接一个FFN prediction和Hungarian loss
  - shared FFN
  - an additional shared LN to norm the inputs of FFN
  - three components of the loss
    - class loss：CE loss
    - box loss
      - GIOU loss
      - L1 loss
- technical details
  - AdamW：
    - initial transformer lr=10e-4
    - initial backbone lr=10e-5
    - weight decay=10e-4
  - Xavier init
  - imagenet-pretrained resnet weights with frozen batchnorm layers：r50 & r101，DETR & DETR-R101
  - a variant：
    - increase feature resolution version
    - remove stage5’s stride and add a dilation
    - DETR-DC5 & DETR-DC5-R101
    - improve performance for small objects
    - overall 2x computation increase
  - augmentation
    - resize input
    - random crop：with 0.5 prob then resize
  - transformer default dropout 0.1
  - lr schedule
    - 300 epochs
    - drop by factor 10 after 200 epochs
  - 4 images per GPU，total batch 64
- for segmentation task：全景分割
  - 给decoder outputs加mask head
  - compute multi-head attention among
    - decoder box predictions
    - encoder outputs
  - generate M attention heatmaps per object
  - add a FPN styled CNN to recover resolution
  - pixel-wise argmax

UNETR: Transformers for 3D Medical Image Segmentation

动机
- unet结构用于医学分割
  - encoder learns global context
  - decoder utilize the representations to predict the semanic ouputs
  - the locality of CNN limits long-range spatial dependency
- our method
  - use a pure transformer as the encoder
  - learn sequence representations of the input volume
  - global
  - multi-scale
  - encoder directly connects to decoder with skip connections
论点
- unet结构
  - encoder用来提取全图特征
  - decoder用来recover
  - skip connections用来补充spatial information that is lost during downsampling
  - localized receptive fields：
    - disadvantage in capturing multi-scale contextual information
    - 如不同尺寸的脑肿瘤
    - 缓和手段：atrous convs，still limited
- transformer
  - self-attention mechanism in NLP
    - highlight the important features of word sequences
    - learn its long-range dependencies
  - in ViT
    - an image is represented as a patch embedding sequence
- our method
  - formulation
    - 1D seq2seq problem
    - use embedded patches
  - the first completely transformer-based encoder
- other unet- transformer methods
  - 2D (ours 3D)
  - employ only in the bottleneck (ours pure transformer)
  - CNN & transformer in separate streams and fuse
方法
- overview
- transformer encoder
  - input：1D sequence of input embeddings
  - given 3D volume $x \in R^{HWDC}$
  - divide into flattened uniform non-overlapping patches $x\in R^{LCN^3}$
    - $L=HWD/N^3$：the sequence length
    - $N^3$：patch dimension
  - linear projection to K-dim $E \in R^{LCK}$：remain constant through transformer
  - 1D learnable positional embedding $E_{pos} \in R^LD$
  - 12 self-att blocks：MSA + MLP
- decoder &skip connections
  - 选取encoder第{3,6,9,12}个block的输出
  - reshape back to 3D volume $[\frac{H}{N},\frac{W}{N},\frac{D}{N},C]$
  - consecutive 3x3x3 conv+BN+ReLU
  - bottleneck
    - deconv by 2 to increase resolution
    - then concat with the previous resized feature
    - then jointly consecutive conv
    - then upsample with deconv…
  - concat到原图resolution以后，consecutive conv以后，再1x1x1 conv+softmax
- loss
  - dice loss
    - dice：for each class channel，计算dice，然后求类平均
    - 1-dice
  - ce loss
    - for each pixel，求bce，然后求所有pixel的平均

pre-training & self-training

发表于 2021-01-17 |

[pre-training] Rethinking ImageNet Pre-training，He Kaiming，imageNet pre-training并没有真正helps acc，只是speedup，random initialization能够reach no worse的结果，前提是数据充足增强够猛，对小门小户还是没啥用，我们希望speedup

[pre-training & self-training] Rethinking Pre-training and Self-training，Google Brain，提出task-specific的pseudo label要比pre-training中搞出来的各种标签要好，前提还是堆数据，对小门小户没啥用，low-data下还是pre-train保平安

总体上都是针对跨任务下，imageNet pre-training意义的探讨，

分类问题还是可以继续pretrained
kaiming这个只是fact，没有现实指导意义
google这个one step further，提出了self-training在现实条件中可以一试

Rethinking Pre-training and Self-training

动机
- given fact：ImageNet pre-training has limited impact on COCO object detection
- investigate self-training to utilize the additional data
论点
- common practice pre-training
  - supervised pre-training
    - 首先要求数据有标签
    - pre-train the backbone on ImageNet as a classification task
  - 弱监督学习
    - with pseudo/noisy label
    - kaiming：Exploring the limits of weakly supervised pretraining
  - self-supervised pre-training
    - 无标签的海量数据
    - 构造学习目标：autoencoder，contrastive，…
    - https://zhuanlan.zhihu.com/p/108906502
- self-training paradigm on COCO
  - train an object detection model on COCO
  - generate pseudo labels on ImageNet
  - both labeled data are combined to train a new model
  - 基本基于noisy student的方法
- observations
  - with stronger data augmentation, pre-training hurts the accuracy, but helps in self-training
  - both supervised and self-supervised pre-training methods fails
  - the benefit of pre-training does not cancel out the gain by self-training
  - flexible about unlabeled data sources, model architectures and computer vision tasks
方法
- data augmentation
  - vary the strength of data augmentation as 4 levels
- pre-training
  - efficientNet-B7
  - AutoAugment weights & noisy student weights
- self-training
  - noisy student scheme
  - 实验发现self-training with this standard loss function can be unstable
  - implement a loss normalization technique
- experimental settings
  - object detection
    - COCO dataset for supervised learning
    - unlabeled ImageNet and OpenImages dataset for self-training：score thresh 0.5 to generate pesudo labels
    - retinaNet & spineNet
    - batch：half supervised half pesudo
  - semantic segmentation
    - PASCAL VOC 2012 for supervised learning
    - augmented PASCAL & COCO & ImageNet for self-training：score thresh 0.5 to generate pesudo masks & multi-scale
    - NAS-FPN
实验
- pre-training
  - Pre-training hurts performance when stronger data augmentation is used：因为会sharpen数据差异？
  - More labeled data diminishes the value of pre-training：通常我们的实验数据fraction都比较小的相对imageNet，所以理论上不会harm？
  - self-supervised pre-training也会一样harm，在augment加强的时候
- self-training
  - Self-training helps in high data/strong augmentation regimes, even when pre-training hurts：不同的augment level，self-training对最终结果都有加成
  - Self-training works across dataset sizes and is additive to pre-training：不同的数据量，也都有加成，但是low data regime下enjoys the biggest gain
- discussion
  - weak performance of pre-training is that pre-training is not aware of the task of interest and can fail to adapt
  - jointly training also helps：address the mismatch between two dataset
  - noisy labeling is worse than targeted pseudo labeling
- 总体结论：小样本量的时候，pre-training还是有加成的，再加上self-training进一步提升，样本多的时候就直接self-training

Rethinking ImageNet Pre-training

动机
- thinking random initialization & pre-training
- ImageNet pre-training
  - speed up
  - but not necessarily improving
- random initialization
  - can achieve no worse result
  - robust to data size, models, tasks and metrics
- rethink current paradigm of ‘pre- training and fine-tuning’
论点
- no fundamental obstacle preventing us from training from scratch
  - if use normalization techniques appropriately
  - if train sufficiently long
- pre-training
  - speed up
  - when fine-tuning on small dataset new hyper-parameters must be selected to avoid overfitting
  - localization-sensitive task benefits limited from pre-training
  - aimed at communities that don’t have enough data or computational resources
方法
- normalization
  - form
    - normalized parameter initialization
    - normalization layers
  - BN layers makes training from scratch difficult
    - small batch size degrade the acc of BN
    - fine-tuning可以freeze BN
    - alternatives
      - GN：对batch size不敏感
      - syncBN
  - with appropriately normalized initialization可以train from scratch VGG这种不用BN层的
- convergence
  - pre-training model has learned low-level features that do not need to be re-learned during
  - random-initial training need more iterations to learn both low-level and semantic features
实验
- investigate maskRCNN
  - 替换BN：GN/sync-BN
  - learning rate：
    - training longer for the first (large) learning rate is useful
    - but training for longer on small learning rates often leads to overfitting
- 10k COCO往上，train from scratch results能够catch up pretraining results，只要训的够久
- 1k和3.5k的COCO，converges show no worse，但是在验证集上差一些：strong overfitting due to lack of data
- PASCAL的结果也差一点，因为instance和category都更少，not directly comparable to the same number of COCO images：fewer instances and categories has a similar negative impact as insufficient training data

long-tailed

发表于 2021-01-11 |

[bag of tricks] Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks：结论就是两阶段，input mixup + CAM-based DRS + muted mixup fine-tuning组合使用最好

[balanced-meta softmax] Balanced Meta-Softmax for Long-Tailed Visual Recognition：商汤

[eql] Equalization Loss for Long-Tailed Object Recognition

[eql2] Equalization Loss v2: A New Gradient Balance Approach for Long-tailed Object Detection

[Class Rectification Loss] Imbalanced Deep Learning by Minority Class Incremental Rectification：提出CRL使得模型能够识别分布稀疏的小类们的边界，以此避免大类主导的影响

Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks

动机
- to give a detailed experimental guideline of common tricks
- to obtain the effective combinations of these tricks
- propose a novel data augmentation approach
论点
- long-tailed datasets
  - poor accuray on the under-presented minority
  - long-tailed CIFAR：
    - 指数型衰减
    - imbalance factor：50/100
    - test set unchanged
  - ImageNet-LT
    - sampling the origin set follow the pareto distribution
    - test set is balanced
  - iNaturalist
    - extremely imbalanced real world dataset
    - fine-grained problem
- different learning paradigms
  - metric learning
  - meta learning
  - knowledge transfer
  - suffer from high sensitivity to hyper-parameters
- training tricks
  - re-weighting
  - re-sample
  - mixup
  - two-stage training
  - different tricks might hurt each other
  - propose a novel data augmentation approach based on CAM：generate images with transferred foreground and unchanged background
方法
- start from baseline
- re-weighting
  - baseline：CE
  - re-weighting methods：
    - cost-sensitive CE：按照样本量线性加权$\frac{n_c}{n_{min}}$
    - focal loss：困难样本加权
    - class-balanced loss：
      - effective number rather than 样本量$n_c$
      - hyperparameter $\beta$ and weighting factor：$\frac{1-\beta}{1-\beta^{n_c}}$
    - 在cifar10上有效，但是cifar100上就不好了
      - directly application in training procedure is not a proper choice
      - especially when类别增多，imbalance加剧的时候
- re-sampling
  - re-sampling methods
    - over-sampling：
      - 随机复制minority
      - might leads to overfitting
    - under-sampling
      - 随机去掉一些majority
      - be preferable to over-sampling
    - 有规律地sampling
      - 大体都是imbalanced向着lighter imbalanced向着balanced推动
    - artificial sampling methods
      - create artificial samples
      - sample based on gradients and features
      - likely to introduce noisy data
  - 观察到提升效果不明显
- mixup
  - input mixup：input mixup can be further improved if we remove the mixup in last several epochs
  - manifold mixup：on only one layer
  - 观察到两种mixup功效差不多，后面发现input mixup更好些
    - input mixup去掉再finetuning几个epoch结果又提升，manifold则会变差
- two-stage training
  - imbalanced training + balanced fine-tuning
  - vanilla training schedule on imbalanced data
    - 先学特征
  - fine-tune on balanced subsets
    - 再调整recognition accuracy
    - deferred re-balancing by re-sampling (DRS) ：propose CAM-based sampling
    - deferred re-balancing by re-weighting (DRW)
  - proposed CAM-based sampling
    - DRS only replicate or remove
    - for each sampled image, apply the trained model & its ground truth label to generate CAM
    - 用heatmap的平均值作为阈值来区分前背景
    - 对前景apply transformations
      - horizontal flipping
      - translation
      - rotating
      - scaling
  - 发现fine-tuning时候再resample比直接resample的结果好
  - proposed CAM-based sampling好于其他sampling，其中CAM-based balance- sampling最好
  - ImageTrans balance-sampling只做变换，不用CAM区分前背景，结果不如CAM-based，证明CAM有用
  - 发现fine-tuning时候再reweight比直接reweight的结果好
  - 其中CSCE（按照样本量线性加权）最好
  - 整体来看DRS的结果稍微比DRW好一点
- trick combinations
  - two-stage的CAM-based DRS略好于DRW，两个同时用不会further improve
  - 再加上mixup的话，input比manifold好一些
  - 结论就是：input mixup + CAM-based DRS + mute fine-tuning，apply the tricks incrementally

Balanced Meta-Softmax for Long-Tailed Visual Recognition

动机
- long-tailed：mismatch between training and testing distributions
- softmax：biased gradient estimation under the long-tailed setup
- propose
  - Balanced Softmax：an elegant unbiased extension of Softmax
  - apply a complementary Meta Sampler：optimal sample rate
- classification & segmentation
论点
- raw baseline：a model that minimizes empirical risk on long-tailed training datasets often underperforms on a class-balanced test set
- most methods use re-sampling or re-weighting
  - to simulate a balanced dataset
  - may under-class the majority or have gradient issue
- meta-learning
  - optimize the weight per sample
  - need a clean and unbiased dataset
- decoupled training
  - 就是上面一篇论文中的两阶段，第一阶段先学表征，第二阶段调整分布fine-tuning
  - not adequate for datasets with extremely high imbalance factor
- LDAM
  - Label-Distribution-Aware Margin Loss
  - larger generalization error bound for minority
  - suit for binary classification
- we propose BALMS
  - Balanced Meta-Softmax
  - theoretically equivalent with generalization error bound
  - for datasets with high imbalance factors should combine Meta Sampler
方法
- balanced softmax
  - biased：从贝叶斯条件概率公式看，standard softmax上默认了均匀采样的p(y)，在长尾分布的时候，就是有偏的
  - 加权：
    - 加在softmax项里面
    - 基于样本量线性加权
  - 数学意义上：we need to focus on minimizing the training loss of the tail classes
- meta sampler
  - resample和reweight直接combine可能会worsen performance
  - class balance resample可能有over-balance issue
- combination procedures
  - 对当前分布，先计算balanced-softmax，保存一个梯度更新后的模型
  - 计算这个临时模型在meta set上的CE，对分布embedding进行梯度更新：评估当前分布咋样，往一定方向矫正
  - 对真正的模型，用最新的分布，计算balanced-softmax，进行梯度更新：用优化后的分布，引导模型学习
实验
- CE的结果呈现明显的长尾同分布趋势
- CBS有缓解
- BS更好
- BS+CBS会over sample
- BS+meta最好

Imbalanced Deep Learning by Minority Class Incremental Rectification

动机
- significantly imbalanced training data
- propose
  - batch-wise incremental minority class rectification model
  - Class Rectification Loss (CRL)
- bring benefits to both minority and majority class boundary learning
论点
- Most methods produce learning bias towards the majority classes
  - to eliminate bias
    - lifting the importance of minority classes：over-sampling can easily cause model overfitting，可能造成对小类别的过分关注，而对大类别不够重视，影响模型泛化能力
    - cost-sensitive learning：difficult to optimise
    - threshold-adjustment technique：given by experts
- previous methods mainly investigate single-label binary-class with small imbalance ratio
- real data
  - large ratio：power-law distributions
  - Subtle appearance discrepancy
- hard sample mining
  - hard negatives are more informative than easy negatives as they violate a model class boundary
  - we only consider hard mining on the minority classes for efficiency
  - our batch-balancing hard mining strategy：eliminating exhaustive searching
- LMLE
  - 唯一的竞品：考虑了data imbalance的细粒度分类
  - not end-to-end
  - global hard mining
  - computationally complex and expensive
方法
- CRL overview
  - explicitly imposing structural discrimination of minority classes
  - batch-wise
  - operate on CE
  - forcus on minority class only：the conventional CE loss can already model the majority classes well
- limitations of CE
  - CE treat the individual samples and classes as equally important
  - the learned model is suboptimal
  - boundaries are biased towards majority classes
- profile the class distribution for each class
  - hard mining
  - overview
- minority class hard sample mining
  - selectively “borrowing” majority class samples from class decision boundary
  - to minority class’s perspective：mining both hard-positive and hard-negative samples
  - define minority class：selected in each mini-batch
  - Incremental refinement：
    - eliminates the LMLE’s drawback in assuming that local group structures of all classes can be estimated reliably by offline global clustering
    - mini-batch的data distribution和训练集不是完全一致的
  - steps
    - profile the minority and majority classes per label in each training mini-batch
      - for each sample，for each class $j$，for each pred class $k$，we have $h^j=[h_1^j, …, h_k^j, …, h_{n_cls}^j]$
      - sort $h_k^j$ in descent order，define the minority classes for each class with $C_{min}^j = \sum_{k\in C_{min}^j}h_k^j \leq \rho * n_{bs}$，with $\rho=0.5$
    - hard mining
      - hardness
        
        score based：prediction score，class-level
        
        feature based：feature distance，instance-level
      - class-level，for class c
        
        hard-positives：same gt class，but low prediction
        
        hard-negative：different gt class，with high prediction
      - instance-level，for each sample in class c
        
        hard-positives：same gt class，large distance with current sample
        
        hard-negative：different gt class，small distance with current sample
      - top-k mining
        
        hard-positives：bottom-k scored on c/top-k distance on c
        
        hard-negative：top-k scored on c/bottom-k distance on c
      - score-based yields superior to distance-based
- CRL
  - final weighted loss：$L = \alpha L_{crl}+(1-\alpha)L_{ce}$，$\alpha=\eta\Omega_{imbalance}$
  - class imbalance measure $\Omega$：more weighting is assigned to more imbalanced labels
  - form
    - triplet loss：类内+类间
    - contrastive loss：类内
    - modelling the distribution relationship of positive and negative pairs：没看懂
总结

就是套用现有的metric learning，定义了一个变化的minority class，垃圾。

说到底就是大数据——CE，小数据——metric learning。

refineDet

发表于 2021-01-08 |

和refineNet没有任何关系

动机
- inherit the merits of both two-stage and one-stage：accuracy and efficiency
- single-shot
- multi-task
- refineDet
  - anchor refinement module (ARM)
  - object detection module (ODM)
  - transfer connection block (TCB)
论点
- three advantages that two-stage superior than one-stage
  - RPN：handle class imbalance
  - two step regress：coarse to refine
  - two stage feature：RPN任务和regression任务有各自的feature
- 模拟二阶段检测的RPN，把classifier任务中的大量阴性框先排掉，但不是以两个阶段的形式，而是multi-task并行
- 将一阶段检测的objectness和box regression任务解耦，两个任务通过transfer block连接
- ARM
  - remove negative anchors to reduce search space for the classifier
  - coarsely adjust the locations and sizes of anchors to provide better initialization for regression
- ODM
  - further improve the regression
  - predict multi labels
- TCB
  - transfer the features in the ARM to handle the more challenging tasks in the ODM
方法
- Transfer Connection Block
  - 没什么新的东西，上采样用了deconv，conv-relu，element-wise add
- Two-Step Cascaded Regression
  - fisrt step ARM prediction
    - for each cell，for each predefined anchor boxes，predict 4 offsets and 2 scores
    - obtain refined anchor boxes
  - second step ODM prediction
    - with justified feature map，with refined anchor boxes
    - generate accurate boxes offset to refined boxes and multi-class scores，c+4
- Negative Anchor Filtering
  - reject well-classified negative anchors
  - if the negative confidence is larger than 0.99，discard it in training the ODM
  - ODM接收所有pred positive和hard negative
- Training and Inference details
  - back：VGG16 & resnet101
    - fc6 & fc7变成两个conv
    - different feature scales
    - L2 norm
    - two extra convolution layers and one extra residual block
  - 4 feature strides
    - each level：1 scale & 3 ratios
    - ensures that different scales of anchors have the same tiling density on the image
  - matching
    - 每个GT box match一个score最高的anchor box
    - 为每个anchor box找到最匹配的iou大于0.5的gt box
    - 相当于把ignore那部分也作为正样本了
  - Hard Negative Mining
    - select negative anchor boxes with top loss values
    - n & p ratio：3:1
  - Loss Function
    - ARM loss
      - binary class：只计算正样本？？？
      - box：只计算正样本
    - ODM loss
      - pass the refined anchors with the negative confidence less than the threshold
      - multi-class：计算均衡的正负样本
      - box：只计算正样本
    - 正样本数为0的时候，loss均为0：纯阴性样本无效？？

amber.zhang

要糖有糖，要猫有猫

GitHub

综述