RCNN系列

发表于 2020-01-08 |

综述

papers

[R-CNN] R-CNN: Rich feature hierarchies for accurate object detection and semantic segmentation

[SPP] SPP-net: Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

[Fast R-CNN] Fast R-CNN: Fast Region-based Convolutional Network

[Faster R-CNN] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

[Mask R-CNN] Mask R-CNN

[FPN] FPN: Feature Pyramid Networks for Object Detection

[Cascade R-CNN] Cascade R-CNN: Delving into High Quality Object Detection

R-CNN: Rich feature hierarchies for accurate object detection and semantic segmentation

动机
- localizing objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data
  - apply CNN to region proposals: R-CNN represents ‘Regions with CNN features’
  - supervised pre-training
论点
- model as a regression problem: not fare well in practice
- build a sliding-window detector: have to maintain high spatial resolution
- what we do: our method gener- ates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs
- conventional solution to training a large CNN is ‘using unsupervised pre-training, followed by supervised fine-tuning’
- what we do: ‘supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL)’
- we also demonstrate: a simple bounding box regression method significantly reduces mislocalizations
- R-CNN operates on regions: it is natural to extend it to the task of semantic segmentation
要素
- category-independent region proposals
- a large convolutional neural network that extracts a fixed-length feature vector from each region
- a set of class-specific linear SVMs
方法
- Region proposals: we use selective search
- Feature extraction: we use Krizhevsky CNN, 227*227 RGB input, 5 convs, 2 fcs, 4096 output
  - we first dilate the tight bounding box (padding=16)
  - then warp the bounding box to the required size (各向异性缩放)
- Test-time detection:
  - we score each extracted feature vector using the SVM trained for each class
  - we apply a greedy non-maximum suppression (for each class independently)
  - 对留下的这些框进行canny边缘检测，就可以得到bounding-box
  - (then B-BoxRegression)
- Supervised pre-training: pre-trained the CNN on a large auxiliary dataset (ILSVRC 2012) with image-level annotations
- Domain-specific fine-tuning:
  - continue SGD training of the CNN using only warped region proposals from VOC
  - replace the 1000-way classification layer with a randomly initialized 21-way layer (20 VOC classes plus background)
  - class label: all region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives, else negatives
  - 1/10th of the initial pre-training rate
  - uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128
- Object category classifiers:
  - considering a binary classifier for a specific class
  - class label: take IoU overlap threshold <0.3 as negatives, take only regions tightly enclosing the object as positives
  - take the ground-truth bounding boxes for each class as positives
- unexplained:
  - the positive and negative examples are defined differently in CNN fine-tuning versus SVM training
    
    CNN容易过拟合，需要大量的训练数据，所以在CNN训练阶段我们对Bounding box的位置限制条件限制的比较松(IOU只要大于0.5都被标注为正样本)，svm适用于少样本训练，所以对于训练样本数据的IOU要求比较严格，我们只有当bounding box把整个物体都包含进去了，我们才把它标注为物体类别。
  - it’s necessary to train detection classifiers rather than simply use outputs of the fine-tuned CNN
    
    上一个回答其实同时也解释了CNN的head已经是一个分类器了，还要用SVM分类：按照上述正负样本定义，CNN softmax的输出比采用svm精度低。
分析
- learned features:
  - compute the units’ activations on a large set of held-out region proposals
  - sort from the highest to low
  - perform non-maximum suppression
  - display the top-scoring regions
- Ablation studies:
  - without fine-tuning: features from fc7 generalize worse than features from fc6, indicating that most of the CNN’s representational power comes from its convolutional layers
  - with fine-tuning: The boost from fine-tuning is much larger for fc6 and fc7 than for pool5, suggests that pool features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them
- Detection error analysis:
  - more of our errors result from poor localization rather than confusion
  - CNN features are much more discriminative than HOG
  - Loose localization likely results from our use of bottom-up region proposals and the positional invariance learned from pre-training the CNN for whole-image classification(粗暴的IOU判定前背景，二值化label，无法体现定位好坏差异)
- Bounding box regression：
  - a linear regression model use the pool5 features for a selective search region proposal as input
  - 输出为xy方向的缩放和平移
  - 训练样本：判定为本类的候选框中和真值重叠面积大于0.6的候选框
- Semantic segmentation：
  - three strategies for computing features:
    - ‘full ‘ ignores the region’s shape, two regions with different shape might have very similar bounding boxes(信息不充分)
    - ‘fg ‘ slightly outperforms full, indicating that the masked region shape provides a stronger signal
    - ‘full+fg ‘ achieves the best, indicating that the context provided by the full features is highly informative even given the fg features(形状和context信息都重要)

SPP-net: Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

动机：
- propose a new pooling strategy, “spatial pyramid pooling”
- can generate a fixed-length representation regardless of image size/scale
- also robust to object deformations
论点：
- existing CNNs require a fixed-size input
  - reduce accuracy for sub-images of an arbitrary size/scale (need cropping/warping)
  - cropped region lost content, while warped content generates unwanted distortion
  - overlooks the issues involving scales
- convolutional layers do not require a fixed image size, whle the fully-connected layers need to have fixed- size/length input by their definition
- by introducing the SPP layer
  - between the last convolutional layer and the first fully-connected layer
  - pools the features and generates fixed- length outputs
- Spatial pyramid pooling
  - partitions the image into divisions from finer to coarser levels, and aggregates local features in them
  - generates fixed- length output
  - uses multi-level spatial bins(robust to object deformations )
  - can run at variable scales
  - also allows varying sizes or scales during training:
    - train the network with different input size at different epoch
    - increases scale-invariance
    - reduces over-fitting
  - in object detection
    - run the convolutional layers only once on the entire image
    - then extract features by SPP-net on the feature maps
    - speedup
    - accuracy
1. 方法：
  - Convolutional Layers and Feature Maps
    - the outputs of the convolutional layers are known as feature maps
    - feature maps involve not only the strength of the responses(the strength of activation), but also their spatial positions(the reception field)
  - The Spatial Pyramid Pooling Layer
    - it can maintain spatial information by pooling in local spatial bins
    - the spatial bins have sizes proportional to the image size(k-level: 1*1, 2*2, …, k*k)
    - we can resize the input image to any scale, which is important for the accuracy
    - the coarsest pyramid level has a single bin that covers the entire image, which is in fact a “global pooling” operation
    - for a feature map of $a×a$, with a pyramid level of $n×n$ bins:
      $the\ window\ size:\ win = ceiling(a/n)\\ the\ stride:\ str = floor(a/n)$
- Training the Network
  - Single-size training: fixed-size input (224×224) cropped from images, cropping for data augmentation
    - Multi-size training: rather than cropping, we resize the aforementioned 224×224 region to 180×180, then we train two fixed-size networks that share parameters by altenate epoch
分析
- 50 bins vs. 30 bins: the gain of multi-level pooling is not simply due to more parameters, it is because the multi-level pooling is robust to the variance in object deformations and spatial layout
  - multi-size vs. single-size: multi results are more or less better than the single-size version
  - full vs. crop: shows the importance of maintaining the complete content
SPP-NET FOR OBJECT DETECTION
- We extract the feature maps from the entire image only once
- we apply the spatial pyramid pooling on each candidate window of the feature maps
- These representations are provided to the fully-connected layers of the network
- SVM samples: We use the ground-truth windows to generate the positive samples, use the samples with IOU<30% as the negative samples
- multi-scale feature extraction:
  - We resize the image at {480, 576, 688, 864, 1200}, and compute the feature maps of conv5 for each scale.
    - we choose a single scale s ∈ S such that the scaled candidate window has a number of pixels closest to 224×224.
    - And we use the corresponding feature map to compute the feature for this window
    - this is roughly equivalent to resizing the window to 224×224
- fine-tuning:
  - Since our features are pooled from the conv5 feature maps from windows of any sizes
    - for simplicity we only fine-tune the fully-connected layers
- Mapping a Window to Feature Maps**
  - we project the corner point of a window onto a pixel in the feature maps, such that this corner point in the image domain is closest to the center of the receptive field of that feature map pixel.
    
    确定原图上的两个角点（左上角和右下角），映射到 feature map上的两个对应点，使得映射点$(x^{‘}, y^{‘})$在原始图上感受野（上图绿色框）的中心点与$(x,y)$尽可能接近。

Fast R-CNN: Fast Region-based Convolutional Network

动机
- improve training and testing speed
- increase detection accuracy
论点
- current approaches train models in multi-stage pipelines that are slow and inelegant
  - R-CNN & SPPnet: CNN+SVM+bounding-box regression
  - disk storage: features are written to disk
  - SPPnet: can only fine-tuning the fc layers, limits the accuracy of very deep networks
- task complexity:
  - numerous candidate proposals
  - rough localization proposals must be refined
- We propose:
  - a single-stage training algorithm
  - multi-task: jointly learns to classify object proposals and refine their spatial locations
要素
- input: an entire image and a set of object proposals
- convs
- a region of interest (RoI) pooling layer: extracts a fixed-length feature vector from the feature map
- fcs that finally branch into two sibling output layers
- multi-outputs:
  - one produces softmax probability over K+1 classes
  - one outputs four bounding-box regression offsets per class
方法
- RoI pooling
  - an RoI is a rectangular window inside a conv feature map, which can be defined by (r, c, h, w)
  - the RoI pooling layer converts the features inside any valid RoI into a small feature map with a fixed size H × W
  - it is a special case of SPPnet when there is only one pyramid level (pooling window size = h/H * w/W)
- Initializing from pre-trained networks
  - the last max pooling layer is replaced by a RoI pooling layer
  - the last fully connected layer and softmax is replaced by the wo sibling layers + respective head (softmax & regressor)
  - modified to take two inputs
- Fine-tuning for detection
  - why SPPnet is unable to update weights below the spatial pyramid pooling layer:
    - 原文提到feature vector来源于不同尺寸的图像——不是主要原因
    - feature vector在原图上的感受野通常很大（接近全图）——forward pass的计算量就很大
    - 不同的图片forward pass的计算结果不能复用（when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained）
  - We propose:
    - takes advantage of feature sharing
    - mini-batches are sampled hierarchically: N images and R/N RoIs from each image
    - RoIs from the same image share computation and memory in the forward and backward passes
    - jointly optimize the two tasks
      
      each RoI is labeled with a ground-truth class $u$ and a ground-truth bounding-box regression target $v$
      
      the network outputs are K+1 probability $p=(p_0,…p_k)$ and K b-box regression offsets $t^k=(t_x^k, t_y^k, t_w^k,t_h^k)$
      $L(p, u, t^u, v) = L_{cls}(p,u) + \lambda[u>0]L_{loc}(t^u,v)\\$
      $L_{cls}$:
      $L_{cls}(p,u) = -log p_u\\$
      $L_{loc}$:
      $L_{loc}(t^u, v) = \sum_{i \in \{x,y,w,h\}}smooth_{L_1}(t^u_i - v_i)\\ smooth_{L_1}(x) = \begin{cases} 0.5x^2\ \ \ \ \ \ \ \ \ \ \ if |x|<1\\ |x| - 0.5\ \ \ \ \ \ otherwise \end{cases}$
      作者表示这种形式可以增强模型对异常数据的鲁棒性
    - class label: take $IoU\geq0.5$ as a foreground object, take negatives with $IoU \in [0.1,0.5)$
      
      The lower threshold of 0.1 appears to act as a heuristic for hard example mining
- Truncated SVD for faster detection
  - Large fully connected layers are easily accelerated by compressing them with truncated SVD $W \approx U \Sigma_t V^T$

the single fully connected layer corresponding to W is replaced by two fully connected layers, without non-linearity
The first layers uses the weight matrix $\Sigma_t V^T$(and no biases)
the second uses U (with the original biases)

分析
- Fast R-CNN vs. SPPnet: even though Fast R-CNN uses single-scale training and testing, fine-tuning the conv layers provides a large improvement in mAP
- Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percent- age point) drop in mAP
- deep vs. small networks:
  - for very deep networks fine-tuning the conv layers is important
  - in the smaller networks (S and M) we find that conv1 is generic and task independent
  - all Fast R-CNN results in this paper using models L fine-tune layers conv3_1 and up
  - all experiments with models S and M fine-tune layers conv2 and up
- multi-task training vs. stage-wise: it has the potential to improve results because the tasks influence each other through a shared representation (the ConvNet)
- single-scale vs. multi-scale:
  - single-scale detection performs almost as well as multi-scale detection
  - deep ConvNets are adept at directly learning scale invariance
  - single-scale processing offers the best tradeoff be- tween speed and accuracy thus we choose single-scale
- softmax vs. SVM:
  - “one-shot” fine-tuning is sufficient compared to previous multi-stage training approaches
  - softmax introduces competition, while SVMs are one-vs-rest

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

动机
- shares the convolutional features
- merge the system using the concept of “attention” mechanisms
- sharing convolutions across proposals —-> across tasks
- translation-Invariant & scale/ratio-Invariant
论点
- proposals are now the test-time computational bottleneck in state-of-the-art detection systems
- the region proposal methods are generally implemented on the CPU
- we observe that the convolutional feature maps used by region-based detectors, like Fast R- CNN, can also be used for generating region proposals
要素
- RPN: On top of the convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid
- anchor: serves as references at multiple scales and aspect ratios
- unify RPN and Fast R-CNN detector: we propose a training scheme that alternately fine-tuning the region proposal task and the object detection task
方法

4.1 Region Proposal Networks
- This architecture is naturally implemented with an n×n convolutional layer followed by two sibling 1 × 1 convolutional layers (for reg and cls, respectively)
- conv: an n × n sliding window
- feature: 256-d for ZF(5 convs backbone) and 512-d for VGG(13 convs backbone)
- two sibling fully-connected layers + respective output layer
- anchors
  - predict multiple region proposals: denoted as k
  - the reg head has 4k outputs, the cls head has 2k outputs
  - the k proposals are parameterized relative to k reference boxes————the anchors
  - an anchor box is centered at the sliding window in question, and is associated with a scale and aspect ratio
  - for a convolutional feature map of a size W × H , that is WHk anchors in total
- class label
  - positives1: the anchors with the highest IoU with a ground-truth box
  - positives2: the anchors that has an IoU higher than 0.7 with any ground-truth box
  - negatives: non-positive anchors if their IoU is lower than 0.3 for all ground-truth boxes
  - the left: do not contribute
  - ignored: all cross-boundary anchors
- Loss function
  - similar multi-task loss as fast-RCNN, with a normalization term
  - with $x,y,w,h$ denoting the box’s center coordinates and its width and height, the regression branch outputs $t_i$:
    $t_x = (x - x_a) / w_a\\ t_y = (y - y_a) / h_a\\ t_w = log(w/ w_a)\\ t_h = log(h/ h_a)$
- mini-batch: sampled the positive and negative anchors from a single image with the ratio of 1:1
  
  4.2 the unified network
- Alternating training
  - ImageNet-pre-trained model, fine-tuning end-to-end for the region proposal task
  - ImageNet-pre-trained model, using the RPN proposals, fine-tuning end-to-end for the detection task
  - fixed detection network convs, fine-tuning the unique layers for region proposal
  - fixed detection network convs, fine-tuning the unique layers for detection
- Approximate joint training
  - multi-task loss
  - approximate
  4.3 at training time
- the total stride is 16 (input size / feature map size)
- for a typical 1000 × 600 image, there will be roughly 20000 (60*40*9) anchors in total
- we ignore all cross-boundary anchors, there will be about 6000 anchors per image left for training
  
  4.4 at testing time
- we use NMS(iou_thresh=0.7), that leaves 2000 proposals per image
- then we use the top-N ranked proposal regions for detection
分析

Mask R-CNN

动机
- instance segmentation:
  - detects objects while simultaneously generating instance mask
  - 注意不仅仅是目标检测了
- easy to generalize to other tasks:
  - instance segmentation
  - bounding-box object detection
  - person keypoint detection
论点
- challenging:
  - requires the correct detection of objects
  - requires precisely segmentation of instances
- a simple, flexible, and fast system can surpass all
  - adding a branch for predicting segmentation on Faster-RCNN
  - in parallel with the existing branch for classification and regression
  - the mask branch is a small FCN applied to each RoI
- Faster R- CNN was not designed for pixel-to-pixel alignment
- we propose RoIAlign to preserve exact spatial locations
- FCNs usually perform per-pixel multi-class categorization, which couples segmentation and classification
- we predict a binary mask for each class independently, decouple mask(mask branch) and class(cls branch)
- other combining methods are multi-stage
- our method is based on parallel prediction
- FCIS also run the system in parallel but exhibits systematic errors on overlapping instances and creates spurious edges
- segmentation-first strategies attempt to cut the pixels of the same category into different instances
- Mask R-CNN is based on an instance-first strategy
要素
- a mask branch with $Km^2$-dims outputs for each RoI, m denotes the resolution, K denotes the number of classes
- bce is key for good instance segmentation results: $L_{mask} = [y>0]\frac{1}{m^2}\sum bce_loss$
- RoI features that are well aligned to the per-pixel input
方法
- RoIAlign
  - Quantizations in RoIPool: (1) RoI to feature map $[x/16]$; (2) feature map to spatial bins $[a/b]$; $[]$ denotes roundings
  - These quantizations introduce misalignments
  - We use bilinear interpolation to avoid quantization
    - sample several points in the spatial bins
    - computes the value of each sampling point by bilinear interpolation from the nearby grid points on the feature map
    - aggregate the results of sampling points (using max or average)
- Architecture
  - backbone: using a ResNet-FPN backbone for feature extraction gives excellent gains in both accuracy and speed
  - head: use previous heads in ResNet/FPN(res5 contained in head/backbone)
- Implementation Details
  - positives: RoIs with IoU at least 0.5, otherwise negative
  - loss: dice loss defined only on positive RoIs
  - mini-batch: 2 images, N RoIs
  - at training time: parallel computation for 3 branches
  - at test time:
    - serial computation
    - proposals -> box prediction -> NMS -> run mask branch on the highest scoring 100 detection boxes
    - it speeds up inference and improves accuracy
    - the $28*28$ floating-number mask output is resized to the RoI size, and binarized at a threshold of 0.5
分析
- on overlapping instances: FCIS+++ exhibits systematic artifacts
- architecture: it benefits from deeper networks (50 vs. 101) and advanced designs including FPN and ResNeXt
- FCN vs. MLP for mask branch
- Human Pose Estimation
  - We model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types
  - the training target is a one-hot $mm$ binary mask where only a single* pixel is labeled as foreground
  - use the cross-entropy loss
  - We found that a relatively high resolution output ($56*56$ compared to masks) is required for keypoint-level localization accuracy

FPN: Feature Pyramid Networks for Object Detection

动机
- for object detection in multi-scale
- struct feature pyramids with marginal extra cost
- practical and accurate
- leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales
论点
- single scale offers a good trade-off between accuracy and speed while multi-scale still performs better, especially for small objects
- featurized image pyramids form the basis solution for multi-scale
- ConvNets are proved robust to variance in scale and thus facilitate recognition from features computed on a single input scale
- SSD uses the naturely feature hierarchy generated by ConvNet which introduces large semantic gaps caused by different depths
  - high-level features are low-resolution but semantically strong
  - low-level features are of lower-level semantics, but their activations are more accurately localized as subsampled fewer times
- thus we propose FPN:
  - combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections
  - has rich semantics at all levels
  - built from a single scale
  - can be easily extended to mask proposals
  - can be trained end-to- end with all scales
- similar architectures make predictions only on a fine resolution
要素
- takes a single-scale image of an arbitrary size as input
- outputs proportionally sized feature maps at multiple levels
- structure
  - a bottom-up pathway: the feed-forward computation of the backbone ConvNet
  - a top-down pathway and lateral connection:
    - upsampling the spatially coarser, but semantically stronger, feature maps from higher pyramid levels
    - then enhance with features from the bottom-up pathway via lateral connections
    - a $33$ conv is appended on each merged map *to reduce the aliasing effect of upsampling
    - shared classifiers/regressors among all levels, thus using fixed 256 channels convs
    - upsamling uses nearest neighbor interpolation
    - low-level features undergoes a $1*1$ conv to reduce channel dimensions
    - merge operation is a by element-wise addition
- adopt the method in RPN & Fast-RCNN for demonstration
方法
- RPN
  - original design:
    - backbone Convs -> single-scale feature map -> dense 3×3 sliding windows -> head($33$ convs + 2 sibling $11$ conv branches)
    - for regressor: multi-scale anchors(e.g. 3 scales 3 ratios -> 9 anchors)
  - new design:
    - adapt FPN -> multi-scale feature map -> sharing heads
    - for regressor: set single-scale anchor for each level respectively (e.g. 5 level 3 ratios -> 15 anchors)
  - sharing heads:
    - vs. not sharing: similar accuracy
    - indicates all levels of FPN share similar semantic levels (contrasted with naturally feature hierarchy of CNNs)
- Fast R-CNN
  - original design: take the ROI feature map from the output of last conv layer
  - new design: take the specific level of ROI feature map based on ROI area
    - with a $w*h$ ROI on the input image, $k_0$ refers to the target level on which an RoI with $w×h=224^2$ should be mapped into
      $k = [k_0 + log_2 (\sqrt{wh}/224)]$
    - the smaller the ROI area, the lower the level k, the finer the resolution of the feature map
分析
- RPN
  - use or not FPN: boost on small objects
  - use or not top-down pathway: semantic gaps
  - use or not lateral connection: locations
  - use or not multi-levels feature maps:
    - using P2 alone leads to more anchors
    - more anchors are not sufficient to improve accuracy
- Fast R-CNN
  - using P2 alone is marginally worse than that of using all pyramid levels
  - we argue that this is because RoI pooling is a warping-like operation, which is less sensitive to the region’s scales
- Faster R-CNN
  - sharing features improves accuracy by a small margin
  - but reduces the testing time
- Segmentation Proposals
  - use a fully convolutional setup for both training and inference
  - apply a small 5×5 MLP to predict 14×14 masks

衍生应用：

动机
- 3D volume detection and segmentation
- ROI ／ full scan
- LUNA16：lung nodules size evaluation
论点
- variety among nodules & similarity among non-nodules
方法
- use overlapping sliding windows
- use focal loss improve class result
- use IOU loss improve mask result
- use heavy augmentation

Cascade R-CNN: Delving into High Quality Object Detection

动机
- an detector trained with low IoU threshold usually produces noisy detections：低质量框issue
- 但是又不能简单地提高IoU threshold
  - 正样本会急剧减少，导致过拟合
  - inference-time mismatch，训练阶段只有高质量框，但是测试阶段啥质量框都有
- we propose Cascade R-CNN
  - multi-stage object detection architecture
  - consists of a sequence of detectors trained with increasing IoU thresholds
  - trained stage by stage
- surpass all single-model on COCO
论点
- object detections two main tasks
  - recognition problem：foreground/backgroud & object class
  - localization problem：bounding box
  - loose requirement for positives
    - an low IoU thresh(0.5) is required to define positives/negatives：looss
    - noisy bounding boxes：close false positives
- quality
  - 将一个框和gt的IoU定义为它的quality
  - 将一个detector训练用的IoU thresh定义为它的quality
  - detector的quality和input proposals的quality是相关的：a single detector work on a specific quality level of hypotheses
- Cascade R-CNN
  - multi-stage extension of R-CNN
  - sequentially more selective against close false positives
方法
- formulation
  - first stage：
    - proposal network H0
    - applied to entire image
  - second stage
    - region-of-interest detection sub-network H1 (detection head)
    - run on proposals
  - C和B是classification score & bounding box regression
  - we focus on modeling the second stage
- bounding box regression 针对回归质量
  - an image patch $x$
  - a bounding box $b = (b_x, b_y, b_w, b_h)$
  - use a regressor $f(x,b)$ to fit the target $g$
    - use L1 loss $L_{loc}(f(x_i,b_i), g_i)$
    - compute on 相对量 & std normalization
    - invariant to scale and location
    - results in minor adjustments on $b$：所以regression loss通常比cls loss小得多
  - Iterative BBox
    - a single regression step is not sufficient for accurate localization
    - 所以就搞了N个一样的regression heads串联
    - 但还是那个问题：一个regressor只针对某一个quality level的proposals是performance optimal的，但是每个iteration以后框的distribution是剧烈变化的
    - 所以迭代两次以上基本没有gain了
- detection quality 针对分类质量
  - an image patch x
  - M foreground classes and 1 background
  - use a classifier $h(x)$ to learn the target class label among M+1
    - use CE $L_{cls}(h(x_i),y_i)$
    - the class label is determined by IoU thresh：如果image patch和gt box的IoU大于阈值，那么这个image patch的class label就是gt box的label，否则是背景
    - the IoU thresh defines the quality of a detector
  - challenging
    - 如果阈值调高了，positives里面包含更少的背景（高质量前景），但是样本量少
    - 如果阈值低了，前景样本多了，但是内容更加diversified，更难reject close false positives
    - 所以一个分类器在不同的IoU阈值下，要面临不同的问题，在inference阶段，it is very difficult to perform uniformly well over all IoU levels
  - Integral loss
    - 训练好几个分类器，针对不同的IoU level，然后inference阶段ensemble
    - 还是没有解决高IoU阈值的那个分类器会因为样本量少过拟合的问题，而且高质量分类器在infernce阶段还是要处理所有的低质量框
- Cascade R-CNN
  - Cascaded Bounding Box Regression
    - cascade specialized regressors
    - differs from Iterative BBox
      - Iterative BBox是个后处理手段，一个regressor在0.5level的boxes上面优化，然后在inference proposals上面反复迭代
      - Cascade R-CNN是个resampling method，多个不同的regressor级连，训练测试同操作同分布
  - Cascaded Detection
    - resamping manner
      - keep上一阶段的positives
      - 同时丢掉一些outliers
    - 实现就是每个stage的IoU threshold逐渐提高
    - loss还是所有proposals的cls loss + 定义为前景proposals的reg loss
    - outliers
    - proposal quality

CNN Visualization系列

发表于 2020-01-03 |

1. Visualizing and Understanding Convolutional Networks

动机
- give insight into the internal operation and behavior of the complex models
- then one can design better models
- reveal which parts of the scene in image are important for classification
- explore the generalization ability of the model to other datasets
论点
- most visualizing methods limited to the 1st layer where projections to pixel space are possible
- Our approach propose a method that could projects high level feature maps to the pixel space

* some methods give some insight into invariances basing on a simple quadratic approximation 
* Our approach, by contrast, provides a non-parametric view of invariance 



* some methods associate patches that responsible for strong activations at higher layers
* In our approach they are not just crops of input images, but rather top-down projections that reveal structures

方法

3.1 Deconvnet: use deconvnet to project the feature activations back to the input pixel space
- To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer
- Then successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity of the layer beneath until the input pixel space is reached
- 【Unpooling】using switches
- 【Rectification】the convnet uses relu to ensure always positive, same for back projection
- 【Filtering】transposed conv
- Due to unpooling, the reconstruction obtained from a single activation resembles a small piece of the original input image
  
  3.2 CNN model
  
  3.3 visualization among layers
- for each layer, we take the top9 strongest activation across the validation data
- calculate the back projection separately
- alongside we provide the corresponding image patches
  
  3.4 visualization during training
- randomly choose several strongest activation of a given feature map
- lower layers converge fast, higher layers conversely
  
  3.5 visualizing the Feature Invariance
- 5 sample images being translated, rotated and scaled by varying degrees
- Small transformations have a dramatic effect in the first layer of the model(c2 & c3对比)
- the network is stable to translations and scalings, but not invariant to rotation
  
  3.6 architecture selection
- old architecture(stride4, filterSize11)：The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. The 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. (这点可以参考之前vnet中提到的，deconv导致的棋盘格伪影，大stride会更明显)
- smaller stride & smaller filter(stride2, filterSize7)：more coverage of mid frequencies, no aliasing, no dead feature
  
  3.7
- 对于物体的关键部分遮挡之后会极大的影响分类结果
- 第二个和第三个例子中分别是文字和人脸的响应更高，但是却不是关键部分。
理解

4.1 总的来说，网络学习到的特征，是具有辨别性的特征，通过可视化就可以看到我们提取到的特征忽视了背景，而是把关键的信息给提取出来了。从layer 1、layer 2学习到的特征基本上是颜色、边缘等低层特征；layer 3则开始稍微变得复杂，学习到的是纹理特征，比如上面的一些网格纹理；layer 4学习到的则是较多的类别信息，比如狗头；layer 5对应着更强的不变性，可以包含物体的整体信息。。

4.2 在网络迭代的过程中，特征图出现了sudden jumps。低层在训练的过程中基本没啥变化，比较容易收敛，高层的特征学习则变化很大。这解释了低层网络的从训练开始，基本上没有太大的变化，因为梯度弥散。高层网络刚开始几次的迭代，变化不是很大，但是到了40~50的迭代的时候，变化很大，因此我们以后在训练网络的时候，不要着急看结果，看结果需要保证网络收敛。

4.3 图像的平移、缩放、旋转，可以看出第一层中对于图像变化非常敏感，第7层就接近于线性变化。

2. Striving for Simplicity: The All Convolutional Net

动机
- traditional pipeline: alternating convolution and max-pooling layers followed by a small number of fully connected layers
- questioning the necessity of different components in the pipeline, max-pooling layer to be specified
- to analyze the network we introduce a new variant of the “deconvolution approach” for visualizing features
论点
- two major improving directions based on traditional pipeline
  - using more complex activation functions
  - building multiple conv modules
- we study the most simple architecture we could conceive
  - a homogeneous network solely consisting of convolutional layers
  - without the need for complicated activation functions, any response normalization or max-pooling
  - reaches state of the art performance
方法
- replace the pooling layers with standard convolutional layers with stride two
  - the spatial dimensionality reduction performed by pooling makes covering larger parts of the input in higher layers possible
  - which is crucial for achieving good performance with CNNs
- make use of small convolutional layers
  - greatly reduce the number of parameters in a network and thus serve as a form of regularization
  - if the topmost convolutional layer covers a portion of the image large enough to recognize its content then fully connected layers can also be replaced by simple 1-by-1 convolutions
- the overall architecture consists only of convolutional layers with rectified linear non-linearities and an averaging + softmax layer to produce predictions
  - Strided-CNN-C: pooling is removed and the preceded conv stride is increase
  - ConvPool-CNN-C: a dense conv is placed, to show the effect of increasing parameters
  - All-CNN-C: max-pooling is replaced by conv
  - when pooling is replaced by an additional convolution layer with stride 2, performance stabilizes and even improves
  - small 3 × 3 convolutions stacked after each other seem to be enough to achieve the best performance
- guided backpropagation
  - the paper above proposed ‘deconvnet’, which we observe that it does not always work well without max-pooling layers
  - For higher layers of our network the method of Zeiler and Fergus fails to produce sharp, recognizable image structure
  - Our architecture does not include max-pooling, thus we can ’deconvolve’ without switches, i.e. not conditioning on an input image
  - In order to obtain a reconstruction conditioned on an input image from our network without pooling layers we to combine the simple backward pass and the deconvnet
  - Interestingly, the very first layer of the network does not learn the usual Gabor filters, but higher layers do

3. Cam: Learning Deep Features for Discriminative Localization

动机
- we found that CNNs actually behave as object detectors despite no supervision on the location
- this ability is lost when fully-connected layers are used for classification
- we found that the advantages of global average pooling layers are beyond simply acting as a regularizer
- it makes it easily to localize the discriminative image regions despite not being trained for them
论点

2.1 Weakly-supervised object localization
- previous methods are not trained end-to-end and require multiple forward passes
- Our approach is trained end-to-end and can localize objects in a single forward pass
  
  2.2 Visualizing CNNs
- previous methods only analyze the convolutional layers, ignoring the fully connected thereby painting an incomplete picture of the full story
- we are able to understand our network from the beginning to the end
方法

3.1 Class Activation Mapping
- A class activation map for a particular category indicates the discriminative image regions used by the network to identify that category
- the network architecture: convs—-gap—-fc+softmax
- we can identify the importance of the image regions by projecting back the weights of the output layer on to the convolutional feature maps
- by simply upsampling the class activation map to the size of the input image we can identify the image regions most relevant to the particular category
  
  3.2 Weakly-supervised Object Localization
- our technique does not adversely impact the classification performance when learning to localize
- we found that the localization ability of the networks improved when the last convolutional layer before GAP had a higher spatial resolution, thus we removed several convolutional layers from the origin networks
- overall we find that the classification performance is largely preserved for our GAP networks compared with the origin fc structure
- our CAM approach significantly outperforms the backpropagation approach on generating bounding box
- low mapping resolution prevents the network from obtaining accurate localizations
  
  3.3 Visualizing Class-Specific Units
- the convolutional units of various layers of CNNs act as visual concept detec- tors, identifying low-level concepts like textures or mate- rials, to high-level concepts like objects or scenes
- Deeper into the network, the units become increasingly discriminative
- given the fully-connected layers in many networks, it can be difficult to identify the importance of different units for identifying different categories

4. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

5. Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

6. 综述

GAP

首先回顾一下GAP，NiN中提出了GAP，主要为了解决全连接层参数过多，不易训练且容易过拟合等问题。

对大多数分类任务来说不会因为做了gap让特征变少而让模型性能下降。因为GAP层是一个非线性操作层，这C个特征相当于是从kxkxC经过非线性变化选择出来的强特征。
heatmap

step1. 图像经过卷积网络后最后得到的特征图，在全连接层分类的权重（$w_{k,n}$）肯定不同，

step2. 利用反向传播求出每张特征图的权重，

step3. 用每张特征图乘以权重得到带权重的特征图，在第三维求均值，relu激活，归一化处理
- relu只保留wx大于0的值——我们正响应是对当前类别有用的特征，负响应会拉低$\sum wx$，即会降低当前类别的置信度
- 如果没有relu，定位图谱显示的不仅仅是某一类的特征。而是所有类别的特征。
  
  step4. 将特征图resize到原图尺寸，便于叠加显示
CAM

CAM要求必须使用GAP层，

CAM选择softmax层值最大的节点反向传播，求GAP层的梯度作为特征图的权重，每个GAP的节点对应一张特征图。
Grad-CAM

Grad-CAM不需要限制模型结构，

Grad-CAM选择softmax层值最大的节点反向传播，对最后一层卷积层求梯度，用每张特征图的梯度的均值作为该特征图的权重。

NiN: network in network

发表于 2019-12-25 |

Network In Network

动机
- enhance model discriminability(获得更好的特征描述)：propose mlpconv
- less prone to overfitting：propose global average pooling
论点

comparison 1:
- conventional CNN uses linear filter, which implicitly makes the assumption that the latent concepts are linearly separable.
- traditional CNN is stacking [linear filters+nonlinear activation/linear+maxpooling+nonlinear]：这里引出了一个激活函数和池化层先后顺序的问题，对于avg_poolling，两种操作得到的结果是不一样的，先接激活函数会丢失部分信息，所以应该先池化再激活，对于MAX_pooling，两种操作结果一样，但是先池化下采样，可以减少激活函数的计算量，总结就是先池化再激活。但是好多网络实际实现上都是relu紧跟着conv，后面接pooling，这样比较interpretable——cross feature map pooling
- mlpconv layer can be regarded as a highly nonlinear function(filter-fc-activation-fc-activation-fc-activation…)
  
  comparison 2:
- maxout network imposes the prior that instances of a latent concept lie within a convex set in the input space【QUESTION HERE】
- mlpconv layer is a universal function approximator instead of a convex function approximator
  
  comparison 3:
- fully connected layers are prone to overfitting and heavily depend on dropout regularization
- global average pooling is more meaningful and interpretable, moreover it itself is a structural regularizer【QUESTION HERE】
方法
- use mlpconv layer to replace conventional GLM(linear filters)
- use global average pooling to replace traditional fully connected layers
- the overall structure is a stack of mlpconv layers, on top of which lie the global average pooling and the objective cost layer
- Sub-sampling layers can be added in between the mlpconv as in CNN
- dropout is applied on the outputs of all but the last mlpconv layers for regularization
- another regularizer applied is weight decay
细节
- preprocessing：global contrast normalization and ZCA whitening
- augmentation：translation and horizontal flipping
- GAP for conventional CNN：CNN+FC+DROPOUT < CNN+GAP < CNN+FC
  - gap is effective as a regularizer
  - slightly worse than the dropout regularizer result for some reason
- confidence maps
  - explicitly enforce feature maps in the last mlpconv layer of NIN to be confidence maps of the categories by means of global average pooling：NiN将GAP的输出直接作为output layer，因此每一个类别对应的feature map可以近似认为是 confidence map。
  - the strongest activations appear roughly at the same region of the object in the original image：特征图上高响应区域基本与原图上目标区域对应。
  - this motivates the possibility of performing object detection via NIN
- architecture：实际中多层感知器使用1x1conv来实现，增加的多层感知器相当于是一个含参的池化层，通过对多个特征图进行含参池化，再传递到下一层继续含参池化，这种级联的跨通道的含参池化让网络有了更复杂的表征能力。
总结
1. mlpconv：stronger local reception unit
2. gap：regularizer & bring confidence maps

unet & vnet

发表于 2019-12-05 |

U-NET: Convolutional Networks for Biomedical Image Segmentation

动机：
- train from very few images
- outperforms more precisely on segmentation tasks
- fast
要素：
- 编码：a contracting path to capture context
- 解码：a symmetric expanding path that enables precise localization
- 实现：pooling operators & upsampling operators
论点：
- when we talk about deep convolutional networks：
  - larger and deeper
  - millions of parameters
  - millions of training samples
- representative method：run a sliding-window and predict a pixel label based on its‘ patch
- drawbacks：
  - calculating redundancy of overlapping patches
  - big patch：more max-pooling layers that reduce the localization accuracy
  - small patch：less involvement of context
- metioned but not further explained：cascade structure
方法：
1. In order to localize, high resolution features from the contracting path are combined with the upsampled output. A successive convolution layer can then learn to assemble a more precise output based on this information.
  
  理解：深层特征层感受野较大，带有全局信息，将其上采样用于提供localization information，而横向add过来特征层带有局部特征信息。两个3*3的conv block用于将两类信息整合，输出更精确的表达。
2. In the upsampling part we have also a large number of feature channels, which allow the network to propagate context information to higher resolution layers.
  
  理解：应该是字面意思吧，为上采样的卷积层保留更多的特征通道，就相当于保留了更多的上下文信息。
3. we use excessive data augmentation.
细节：
1. contracting path：
  - typical CNN：blocks of [2 3*3 unpadded convs+ReLU+2*2 stride2 maxpooling]
  - At each downsampling step we double the number of feature channels
2. expansive path：
  - upsampling：
    - 2*2 up-conv that half the channels
    - concatenation the corresponding cropped feature map from the contracting path
    - 2 [3x3 conv+ReLU]
  - final layer：use a 1*1 conv to map the feature vectors to class vectors
3. train：
  - prefer larger input size to larger batch size
  - sgd with 0.99 momentum so that the previously seen samples dominate the optimization
4. loss：softmax & cross entropy
5. unbalanced weight：
  - pre-compute the weight map base on the frequency of pixels for a certain class
  - add the weight for a certain element to force the learning emphasis：e.g. the small separation borders
  - initialization：Gaussian distribution
6. data augmentation：
  - deformations
  - “Drop-out layers at the end of the contracting path perform further implicit data augmentation”
7. metrics：“warping error”, the “Rand error” and the “pixel error” for EM segmentation challenge and average IOU for ISBI cell tracking challenge
8. prediction：
  
  按照论文的模型结构，输入和输出的维度是不一样的——在valid padding的过程中有边缘信息损失。
  
  那么如果我们想要预测黄框内的分割结果，需要输入一张更大的图（蓝框）作为输入，在图片边缘的时候，我们通过镜像的方式补全。
  
  因果关系：
  - 首先因为内存限制，输入的不是整张图，是图片patch，
  - 为了保留上下文信息，使得预测更准确，我们给图片patch添加一圈border的上下文信息（实际感兴趣的是黄框区域）
  - 在训练时，为了避免重叠引入的计算，卷积层使用了valid padding
  - 因此在网络的输出层，输出尺寸才是我们真正关注的部分
  - 如果训练样本尺寸不那么huge，完全可以全图输入，然后使用same padding，直接预测全图mask
总结：
- train from very few images —-> data augmentation
- fast —-> full convolution layers
- precise —-> global?

V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

动机
- entire 3D volume
- imbalance between the number of foreground and background voxels：dice coefficient
- limited data：apply random non-linear transformations and histogram matching
- fast and accurate
论点：
- early approaches based on patches
  - local context
  - challenging modailities
  - efficiency issues
- fully convolutional networks
  - 2D so far
- imbalance issue：the anatomy of interest occupies only a very small region of the scan thus predictions are strongly biased towards the background.
  - re-weighting
  - dice coefficient claims to be better that above
要素：
- a compression path
- a decompression path
方法：
- compression：
  - add residual能够加速收敛
  - resolution is reduced by [2*2*2 conv with stride 2]相比于maxpooling节省了bp所需switch map的memory消耗
  - double the number of feature maps as we reduce their resolution
  - PReLU
- decompression：
  - horizontal connections：1) gather fine grained detail that would be otherwise lost in the compression path 2) improve the convergence time
  - residual conv：blocks of [5*5*5 conv with stride 1] 提取特征继续增大感受野
  - up-conv：expands the spatial support of the lower resolution feature maps
  - last layer：run [1*1*1conv with 2 channel+softmax] to obtain the voxelwise probabilistic segmentations of the foreground and background
- dice coefficient： [0,1] which we aim to maximise，assume $p_i$、$g_i$ belong to two binary volumes
  $D = \frac{2\sum_i^N p_i g_i}{\sum_i^N p_i^2 + \sum_i^N g_i^2}$
- train：
  - input fix size 128 × 128 × 64 voxels and a spatial resolution of 1 × 1 × 1.5 millimeters
  - each mini-batch contains 2 volumes
  - online augmentation：
    - randomly deformation
    - vary the intensity distribution：随机选取样本的灰度分布作为当前训练样本的灰度分布
  - used a momentum of 0.99 and a initial learning rate of 0.0001 which decreases by one order of magnitude every 25K iterations
- metrics：
  - Dice coefficient
  - Hausdorff distance of the predicted delineation to the ground truth annotation
  - the score obtained on the challenge

dice loss & focal loss

CE & BCE
- CE：categorical_crossentropy，针对所有类别计算，类别间互斥
  $CE(x) = -\sum_{i=1}^{n\_class}y_i log f_i(x)$
  
  $x$是输入样本，$y_i$是第$i$个类别对应的真实标签，$f_i(x)$是对应的模型输出值。
  
  对分类问题，$y_i$是one-hot，$f_i(x)$是个一维向量。最终得到一个数值。
- BCE：binary_crossentropy，针对每个类别计算
  $BCE(x)_i = - [y_i log f_i(x) + (1-y_i)log(1-f_i(x))]$
  
  $i$是类别编号，最终得到一个维度为$n_class$的向量。
  
  再求类均值得到一个数值作为单个样本的loss。
  
  $BCE(x) = \frac{\sum_{i=1}^{n\_class}BCE_i(x)}{n\_class}$
- batch loss：对batch中所有样本的loss求均值。
- 从公式上看，CE的输出通常是经过了softmax，softmax的某一个输出增大，必然导致其它类别的输出减小，因此在计算loss的时候关注正确类别的预测值是否被拉高即可。使用BCE的场景通常是使用sigmoid，类别间不会互相压制，因此既要考虑所属类别的预测概率够高，也要考虑不所属类别的预测概率足够低（这一项在softmax中被实现了故CE不需要这一项）。
- 场景：
  - 二分类：只有一个输出节点，$f(x) \in (0,1)$，应该使用sigmoid+BCE作为最后的输出层配置。
  - 单标签多分类：应该使用softmax+CE的方案，BCE也同样适用。
  - 多标签多分类：multi-label每个标签的输出是相互独立的，因此常用配置是sigmoid+BCE。
- 对分割场景来说，输出的每一个channel对应一个类别的预测map，可以看成是多个channel间的单标签多分类（softmax+CE），也可以看成是每个独立通道类别map的二分类（sigmoid+BCE）。unet论文用了weighted的softmax+CE。vnet论文用了dice_loss。
re-weighting(WCE)

基于CE&BCE，给了样本不同的权重。

unet论文中提到了基于pixel frequency为不同的类别创建了weight map。

一种实现：基于每个类别的weight map，在实现CE的时候改成加权平均即可。

另一种实现：基于每个样本的weight map，作为网络的附加输入，在实现CE的时候乘在loss map上。

focal loss

提出是在目标检测领域，用于解决正负样本比例严重失调的问题。

也是一种加权，但是相比较于re-weighting，困难样本的权重由网络自行推断出，通过添加$(\alpha)$和$(-)^\lambda$这一加权项：

$focal\_loss(x)_i = -[\alpha y_i (1-p_i)^\lambda log (p_i)+(1-\alpha)(1-y_i)p_i^\lambda log(1-p_i)]$

对于类别间不均衡的情况（通常负样本远远多于正样本），$(\alpha)$项用于平衡正负样本权重。

对于类内困难样本的挖掘，$(-)^\lambda$项用于调整简单样本和困难样本的权重，预测概率更接近真实label的样本（简单样本）的权重会衰减更快，预测概率比较不准确的样本（苦难样本）的权重则更高些。

由于分割网络的输出的单通道／多通道的图片，直接使用focal loss会导致loss值很大。

1. 通常与其他loss加权组合使用

2. sum可以改成mean

3.不建议在训练初期就加入，可在训练后期用于优化模型

4. 公式中含log计算，可能导致nan，要对log中的元素clip

def focal_loss(y_true, y_pred):
    gamma = 2.
    alpha = 0.25
    # score = alpha * y_true * K.pow(1 - y_pred, gamma) * K.log(y_pred) +            # this works when y_true==1
    #         (1 - alpha) * (1 - y_true) * K.pow(y_pred, gamma) * K.log(1 - y_pred)  # this works when y_true==0
    pt_1 = tf.where(tf.equal(y_true, 1), y_pred, tf.ones_like(y_pred))
    pt_0 = tf.where(tf.equal(y_true, 0), y_pred, tf.zeros_like(y_pred))
    # avoid nan
    pt_1 = K.clip(pt_1, 1e-3, .999)
    pt_0 = K.clip(pt_0, 1e-3, .999)
    score = -K.sum(alpha * K.pow(1. - pt_1, gamma) * K.log(pt_1)) -  \
            K.sum((1 - alpha) * K.pow(pt_0, gamma) * K.log(1. - pt_0))
    return score

dice loss

dice定义两个mask的相似程度：
- 分子是TP——只关注前景
- 分母可以是$|A|$（逐个元素相加），也可以是平方形式$|A|^2$
- 梯度：“使用dice loss有时会不可信，原因是对于softmax或log loss其梯度简言之是p-t ，t为目标值，p为预测值。而dice loss 为 2t2 / (p+t)2
  
  如果p，t过小会导致梯度变化剧烈，导致训练困难。”
  
  【详细解释下】交叉熵loss：$L=-(1-|t-p|)log(1-|t-p|)$，求导得到$\frac{\partial L}{\partial p}=-log(1-|t-p|)$，其实就可以简化看作$t-p$，很显然这个梯度是有界的，因此使用交叉熵loss的优化过程比较稳定。而dice loss的两种形式（不平方&平方）：$L=\frac{2pt}{p+t}\ or\ L=\frac{2pt}{p^2+t^2}$，求导以后分别是$\frac{\partial L}{\partial p} = \frac{t^2+2pt}{(p+t)^2} \ or\ \frac{3tp^2+t^3}{(p^2+t^2)^2}$计算结果比较复杂，pt都很小的情况下，梯度值可能很大，可能导致训练不稳定，loss曲线混乱。

vnet论文中的定义在分母上稍有不同（see below）。smoothing的好处：

避免分子除0

减少过拟合

def dice_coef(y_true, y_pred): 
  	smooth = 1.
    intersection = K.sum(y_true * y_pred, axis=[1,2,3]) 
    union = K.sum(y_true, axis=[1,2,3]) + K.sum(y_pred, axis=[1,2,3]) 
    return K.mean( (2. * intersection + smooth) / (union + smooth), axis=0) 
 
def dice_coef_loss(y_true, y_pred): 
    1 - dice_coef(y_true, y_pred, smooth=1)

iou loss

dice loss衍生，intersection over union：

分母上比dice少了一个intersection。
- “IOU loss的缺点同DICE loss，训练曲线可能并不可信，训练的过程也可能并不稳定，有时不如使用softmax loss等的曲线有直观性，通常而言softmax loss得到的loss下降曲线较为平滑。”
boundary loss

dice loss和iou loss是基于区域面积匹配度去学习，我们也可以使用边界匹配度去监督网络的学习。

只对边界上的像素进行评估，和GT的边界吻合则为0，不吻合的点，根据其距离边界的距离评估它的Loss。
Hausdorff distance

用于度量两个点集之间的相似程度，denote 点集$A\{a_1, a_2, …, a_p\}$，点集$B\{b_1, b_2, …, b_p\}$：
$HD(A, B) = max\{hd(A,B), hd(B,A)\}\\ hd(A,B) = max_{a \in A} min_{b in B} ||a-b||\\ hd(B,A) = max_{b \in B} min_{a in A} ||b-a||$
其中HD(A,B)是Hausdorff distance的基本形式，称为双向距离

hd(A,B)描述的是单向距离，首先找到点集A中每个点在点集B中距离最近的点作为匹配点，然后计算这些a-b-pair的距离的最大值。

HD(A,B)取单向距离中的最大值，描述了两个点集合的最大不匹配程度。
mix loss
- BCE + dice loss：在数据较为平衡的情况下有改善作用，但是在数据极度不均衡的情况下，交叉熵损失会在几个训练之后远小于Dice 损失，效果会损失。
- focal loss + dice loss：数量级问题
MSE

关键点检测有时候也会采用分割框架，这时候ground truth是高斯map，dice是针对二值化mask的，这时候还可以用MSE。
ohnm

online hard negative mining 困难样本挖掘

Tversky loss

一种加权的dice loss，dice loss会平等的权衡FP（精度，假阳）和FN（召回，假阴），但是医学图像中病灶数目远少于背景数量，很可能导致训练结果偏向高精度但是低召回率，Tversky loss控制loss更偏向FN：

$loss = 1-\frac{|PG|}{|PG|+\alpha|P\backslash G|+\beta|G\backslash P|}$

def tversky_loss(y_true, y_pred):
    y_true_pos = K.flatten(y_true)
    y_pred_pos = K.flatten(y_pred)
    # TP
    true_pos = K.sum(y_true_pos * y_pred_pos)
    # FN
    false_neg = K.sum(y_true_pos * (1-y_pred_pos))
    # FP
    false_pos = K.sum((1-y_true_pos) * y_pred_pos)
    alpha = 0.7
    return 1 - (true_pos + K.epsilon())/(true_pos + alpha * false_neg + (1-alpha) * false_pos + K.epsilon())

Lovasz hinge & Lovasz-Softmax loss

IOU loss衍生，jaccard loss只适用于离散情况，而网络预测是连续值，如果不使用某个超参将神经元输出二值化，就不可导。blabla

不是很懂直接用吧：https://github.com/bermanmaxim/LovaszSoftmax

一些补充

改进：
1. dropout、batch normalization：从论文上看，unet只在最深层卷积层后面添加了dropout layer，BN未表，而common sense用每一个conv层后面接BN层能够替换掉dropout并能获得性能提升的。
2. UpSampling2D、Conv2DTranspose：unet使用了上采样，vnet使用了deconv，但是“DeConv will produce image with checkerboard effect, which can be revised by upsample and conv”(Reference)。
3. valid padding、same padding：unet论文使用图像patch作为输入，特征提取时使用valid padding，损失边缘信息。
4. network blocks：unet用的conv block是两个一组的3*3conv，vnet稍微不同一点，可以尝试的block有ResNet／ResNext、DenseNet、DeepLab等。
5. pretrained encoder：feature extraction path使用一些现有的backbone，可以加载预训练权重(Reference)，加速训练，防止过拟合。
6. 加入SE模块(Reference)：对每个通道的特征加权
7. attention mechanisms：
8. 引用nn-Unet主要结构改进合集：“Just to provide some prominent examples: variations of encoder-decoder style architectures with skip connections, first introduced by the U-Net [12], include the introduction of residual connections [9], dense connections [6], at- tention mechanisms [10], additional loss layers [5], feature recalibration [13], and others [11].
衍生：
1. TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation
2. nnU-Net: Breaking the Spell on Successful Medical Image Segmentation

TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation

动机：
- neural network initialized with pre-trained weights usually shows better performance than those trained from scratch on a small dataset.
- 保留encoder-decoder的结构，同时充分利用迁移学习的优势
论点：
- load pretrained weights
- 用huge dataset做预训练
方法：
- 用vgg11替换原始的encoder，并load pre-trained weights on ImageNet：
- 最深层输入(maxpooling5)：use a single conv of 512 channels that serves as a bottleneck central part of the network
- upsampling换成了convTranspose
- loss function：IOU + BCE：
  $L = BCE - log(IOU)$
- inference：choose a threshold 0.3, all pixel values below which are set to be zero
结论：
1. converge faster
2. better IOU

nnU-Net: Breaking the Spell on Successful Medical Image Segmentation

动机
- many proposed methods fail to generalize: 对于分割任务，从unet出来之后的几年里，在网络结构上已经没有多少的突破了，结构修改越多，反而越容易过拟合
- relies on just a simple U-Net architecture embedded in a robust training scheme
- automate necessary adaptations such as preprocessing, the exact patch size, batch size, and inference settings based on the properties of a given dataset: 更多的提升其实在于理解数据，针对数据采用适当的预处理和训练方法和技巧
论点
- the diversity and individual peculiarities of imaging datasets make it difficult to generalize
- prominent modifications focus on architectural modifications, merely brushing over all the other hyperparameters
- we propose: 使用基础版unet：nnUNet（no-new-Net）
  - a formalism for automatic adaptation to new datasets
  - automatically designs and executes a network training pipeline
  - without any manual fine-tuning
要素

a segmentation task: $f_{\theta}(X) = \hat Y$, in this paper we seek for a $g(X,Y)=\theta$.

First we distinguish two type of hyperparameters:
- static params：in this case the network architecture and a robust training scheme
- dynamic params：those that need to be changed in dependence of $X$ and $Y$
Second we define g——a set of heuristics rules covering the entire process of the task:
- 预处理：resampling和normalization
- 训练：loss，optimizer设置、数据增广
- 推理：patch-based策略、test-time-augmentations集成和模型集成等
- 后处理：增强单连通域等
方法
1. Preprocessing
  - Image Normalization：
    - CT：$normed_intensity = (intensity - fg_mean) / fg_standard_deviation$, $fg$ for $[0.05,0.95]$ foreground intensity
    - not CT：$normed_intensity = (intensity - mean) / standard_deviation $
  - Voxel Spacing：
    - for each axis chooses the median as the target spacing
    - image resampled with third order spline interpolation
    - z-axis using nearest neighbor interpolation if ‘anisotropic spacing’ occurs
    - mask resampled with third order spline interpolation
2. Training Procedure
  - Network Architecture：
    - 3 independent model：a 2D U-Net, a 3D U-Net and a cascade of two 3D U-Net
    - padded convolutions：to achieve identical output and input shapes
    - instance normalization：“BN适用于判别模型，比如图片分类模型。因为BN注重对每个batch进行归一化，从而保证数据分布的一致性，而判别模型的结果正是取决于数据整体分布。但是BN对batchsize的大小比较敏感，由于每次计算均值和方差是在一个batch上，所以如果batchsize太小，则计算的均值、方差不足以代表整个数据分布；IN适用于生成模型，比如图片风格迁移。因为图片生成的结果主要依赖于某个图像实例，所以对整个batch归一化不适合图像风格化，在风格迁移中使用Instance Normalization不仅可以加速模型收敛，并且可以保持每个图像实例之间的独立。”
    - Leaky ReLUs
  - Network Hyperparameters：
    - sets the batch size, patch size and number of pooling operations for each axis based on the memory consumption
    - large patch sizes are favored over large batch sizes
    - pooling along each axis is done until the voxel size=4
    - start num of filters=30, double after each pooling
    - If the selected patch size covers less than 25% of the voxels, train the 3D U-Net cascade on a downsampled version of the training data to keep sufficient context
  - Network Training:
    - five-fold cross-validation
    - One epoch is defined as processing 250 batches
    - loss = dice loss + cross-entropy loss
    - Adam(lr=3e-4, decay=3e-5)
    - lrReduce: EMA(train_loss), 30 epoch, factor=0.2
    - earlyStop: earning rate drops below 10 6 or 1000 epochs are exceeded
    - data augmentation: elastic deformations, random scaling and random rotations as well as gamma augmentation($g(x,y)=f(x,y)^{gamma}$)
    - keep transformations in 2D-plane if ‘anisotropic spacing’ occurs
  - Inference
    - sliding window with half the patch size: this increases the weight of the predictions close to the center relative to the borders
    - ensemble:
      - U-Net configurations (2D, 3D and cascade)
      - furthermore uses the five models (five-fold cross-validation)
Ablation studies

3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation

动机
- learns from sparsely/full annotated volumetric images (user annotates some slices)
- provides a dense 3D segmentation

要素
- 3D operations
- avoid bottlenecks and use batch normalization for faster convergence
- on-the-fly elastic deformation
- train from scratch
论点
- neighboring slices show almost the same information
- many biomedical applications generalizes reasonably well because medical images comprises repetitive structures
- thus we suggest dense-volume-segmentation-network that only requires some annotated 2D slices for training
- scenarios
  - manual annotated 一部分slice，然后训练网络实现dense seg
  - 用一部分 sparsely annotated的dataset作为training set，然后训练的网络实现在新的数据集上dense seg
方法
- Network Architecture
  - compression：2*3x3x3 convs(+BN)+relu+2x2x2 maxpooling
  - decompression：2x2x2 upconv+2*3x3x3 convs+relu
  - head：1x1x1 conv
  - concat shortcut connections
  - 【QUESTION】avoid bottlenecks by doubling the number of channels already before max pooling
    
    个人理解这个double channel是在跟原始的unet结构对比，原始unet每个stage的两个conv的filter num是一样的，然后进行max pooling会损失部分信息，但是分割任务本身是个dense prediction，所以增大channel来减少信息损失
    
    但是不理解什么叫“avoid bottlenecks”
    
    原文说是参考了《Rethinking the inception architecture for computer vision》大名鼎鼎的inception V3
    
    可能对应的是“1. Avoid representational bottlenecks, especially early in the network.”，从输入到输出，要逐渐减少feature map的尺寸，同时要逐渐增加feature map的数量。
  - input：132x132x116 voxel tile
  - output：44x44x28
  - BN：before each ReLU

weighted softmax loss function：setting the weights of unlabeled pixels to zero makes it possible to learn from only the labelled ones and, hence, to generalize to the whole volume（是不是random set the loss zeros of some samples总能让网络更好的generalize？）
- Data
  - manually annotated some orthogonal xy, xz, and yz slices
  - annotation slices were sampled uniformly
- ran on down-sampled versions of the original resolution by factor of two
- labels：0: “inside the tubule”; 1: “tubule”; 2: “background”, and 3: “unlabeled”.
- Training
  - rotation, scaling and gray value augmentation
- a smooth dense deformation：random vector, normal distribution, B-spline interpolation
  - weighted cross-entropy loss：increase weights “inside the tubule”, reduce weights “background”, set zero “unlabeled”

2.5D-UNet: Automatic Segmentation of Vestibular Schwannoma from T2-Weighted MRI by Deep Spatial Attention with Hardness-Weighted Loss

专业术语
- Vestibular Schwannoma(VS) tumors：前庭神经鞘瘤
- through-plane resolution：层厚
- isotropic resolution：各向同性
- anisotropic resolutions：各向异性
动机
- tumor的精确自动分割
- challenge
  - low contrast：hardness-weighted Dice loss functio
  - small target region：attention module
  - low through-plane resolution：2.5D
论点
- segment small structures from large image contexts
  - coarse-to-fine
  - attention map
  - Dice loss
  - our method
    - end-to-end supervision on the learning of attention map
    - voxel-level hardness- weighted Dice loss function
- CNN
  - 2D CNNs ignore inter-slice correlation
  - 3D CNNs most applied to images with isotropic resolution requiring upsampling
  - to balance the physical receptive field (in terms of mm rather than voxels)：memory rise
  - our method
    - high in-plane resolution & low through-plane resolution
    - 2.5D CNN combining 2D and 3D convolutions
    - use inter-slice features
    - more efficient than 3D CNNs
- 数据
  - T2-weighted MR images of 245 patients with VS tumor
  - high in-plane resolution around 0.4 mm×0.4 mm，512x512
  - slice thickness and inter-slice spacing 1.5 mm，slice number 19 to 118
  - cropped cube size：100 mm×50 mm×50 mm
方法
- architecture
  - five levels：L1、L2 use 2D，L3、L4、L5 use 3D
  - After the first two max-pooling layers that downsample the feature maps only in 2D, the feature maps in L3 and the followings have a near- isotropic 3D resolution.
  - start channels：16
  - conv block：conv-BN-pReLU
  - add a spatial attention module to each level of the decoder
- spatial attention module
  - A spatial attention map can be seen as a single-channel image of attention coefficient
  - input：feature map with channel $N_l$
  - conv1+ReLU： channel $N_l/2$
  - conv2+Sigmoid：channel 1，outputs the attention map
  - multiplied the feature map with the attention map
  - a residual connection
  - explicit supervision
    - multi-scale attention loss
    - $L_{attention} = \frac{1}{L} \sum_{L} l(A_l, G_l^f)$
    - $A_l$是每一层的attention map，$G_l^f$是每一层是前景ground truth average-pool到当前resolution的mask
- Voxel-Level Hardness-Weighted Dice Loss
  - automatic hard voxel weighting：$w_i = \lambda * abs(p_i - g_i) + (1-\lambda)$
  - $\lambda \in [0,1]$，controls the degree of hard voxel weighting
  - hardness-weighted Dice loss (HDL) ：
    $l(P,G) = 1.0 - \frac{1}{C}\sum_{C} \frac{2\sum_i w_i p_i g_i + \epsilon}{\sum_i w_i (p_i + g_i) + \epsilon}$
  - total loss：
    $L = \frac{1}{L} \sum_{L} l(A_l, G_l^f) + l(P,G)$

Combining analysis of multi-parametric MR images into a convolutional neural network: Precise target delineation for vestibular schwannoma treatment planning

只有摘要和一幅图

multi-parametric MR images：T1W、T2W、T1C
two-pathway U-Net model
- kernel 3 × 3 × 1 and 1 × 1 × 3 respectively
- to extract the in-plane and through-plane features of the anisotropic MR images
结论
- The proposed two-pathway U-Net model outperformed the single-pathway U-Net model when segmenting VS using anisotropic MR images.
- multi-inputs（T1、T2）outperforms single-inputs

yolo系列

发表于 2019-11-28 |

综述

[yolov1] Yolov1: You Only Look Once: Unified, Real-Time Object Detection
[yolov2] Yolov2: YOLO9000: Better, Faster, Stronger
[yolov3] Yolov3: An Incremental Improvement
[yolov4] YOLOv4: Optimal Speed and Accuracy of Object Detection
[poly-yolo] POLY-YOLO: HIGHER SPEED, MORE PRECISE DETECTION AND INSTANCE SEGMENTATION FOR YOLOV3
[scaled-yolov4] Scaled-YOLOv4: Scaling Cross Stage Partial Network
[yolov7] YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

0. review

review0121：关于yolo loss

之前看keras版的yolo loss，包含分类的bce，回归的l2/mse，以及confidence的回归loss，其中conf loss被建模成单纯的0-1分类问题，用bce来实现。

事实上原版的yolo loss中，objectness是iou（pred和gt的iou），从意义上，不仅指示当前格子有无目标，还对当前的box prediction做了评估
- 回传梯度
- 不回传梯度
  
  iou是通过xywh计算的，也就是说基于iou的loss做梯度回传的时候要流经box head，scaled_yolov4中把这个梯度截断，只作为一个值，对confidence进行梯度回传，
  
  梯度不截断也没有问题，相当于对xywh再回传一个iou的loss
review1215：main features梳理
- yolov1
- yolov2
- yolov3
  - giou
  - anchor机制
    - 匹配机制：用max_iou做1v1匹配，1 in 9 (all-level)，
    - 回归机制：用sigmoid和exp包裹logits，作为xy和wh的偏移量
- yolov4：bag-of-tricks，CSPDarknet53+PAN-SPP+yolohead，mosaic，ciou，syncBN，diou_nms
  - anchor机制
    - 匹配机制：不再是唯一匹配，可以跨level，只要满足iou thresh就作为positive anchor
    - 回归机制
- scaled-yolov4：fully CSP-ized，双向FPN，scaling up出一系列模型
  - CSP-block是为了scaling-up服务的，yolov3的结构一张卡batch size只能开到8，在OOM边缘试探，这篇文章将整个结构都CSP化
  - 面向GPU的YOLOv4-large包含YOLOv4-P5, YOLOv4-P6, and YOLOv4-P7
  - anchor机制
    - 匹配机制：相比较于yolov4进一步拓展，跨level且跨网格，只要长宽比满足阈值，都作为positive anchor，同时找到gt center最近的两个邻域网格，都作为正样本
    - 回归机制：重新定义了回归机制，因为跨网格，偏移量的激活区间变成[-0.5,1.5]
  - loss
    - loss_box：giou
    - loss_obj：bce
    - loss_cls：bce

1. Yolov1: You Only Look Once: Unified, Real-Time Object Detection

动机:
- end-to-end: 2 stages —-> 1 stage
- real-time
论点：
- past methods: complex pipelines, hard to optimize(trained separately)
  - DPM use a sliding window and a classifier to evaluate an object at various locations
  - R-CNN use region proposal and run classifier on the proposed boxes, then post-processing
- in this paper: you only look once at an image
  - rebuild the framework as a single regression problem: single stands for you don’t have to run classifiers on each patch
  - straight from image pixels to bounding box coordinates and class probabilities: straight stands for you obtain the bounding box and the classification results side by side, comparing to the previous serial pipeline
advantages：
- fast & twice the mean average precision of other real-time systems
- CNN sees the entire image thus encodes contextual information
- generalize better
disadvantage:
- accuracy: “ it struggles to precisely localize some objects, especially small ones”
细节：
- grid：
  
  Our system divides the input image into an S × S grid.
  
  If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
- prediction：
  
  Each grid cell predicts B bounding boxes, confidence scores for these boxes , and C conditional class probabilities for each grid
  
  that is an $S*S*(B*5+C)$ tensor
  - We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1.
  - We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell so they are also bounded between 0 and 1.
- at test time：
  
  We obtain the class-specific confidence for individual box by multiply the class probability and box confidence：
  $Pr(Class_i | Object) * Pr(Object)* IOU^{truth}_{pred} = Pr(Class_i)* IOU^{truth}_{pred}$
- network：
  
  the convolutional layers extract features from the image
  
  while the fully connected layers predict the probabilities and coordinates
- training：
  
  activation：use a linear activation function for the final layer and leaky rectified linear activation all the other layers
  
  optimization：use sum-squared error, however it does not perfectly align with the goal of maximizing average precision
  
  * weights equally the localization error and classification error：$\lambda_{coord}$
  
  * weights equally the grid cells containing and not-containing objects：$\lambda_{noobj}$
  
  * weights equally the large boxes and small boxes：square roots the h&w insteand of the straight h&w
  
  loss：pick the box predictor has the highest current IOU with the ground truth per grid cell
  
  avoid overfitting：dropout & data augmentation
  
  * use dropout after the first connected layer,
  
  * introduce random scaling and translations of up to 20% of the original image size for data augmentation
  
  * randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space for data augmentation
- inference：
  
  multiple detections：some objects locates near the border of multiple cells and can be well localized by multiple cells. Non-maximal suppression is proved critical, adding 2- 3% in mAP.
Limitations：
- strong spatial constraints：decided by the settings of bounding boxes
- softmax classification：can only have one class for each grid
  
  “This spatial constraint lim- its the number of nearby objects that our model can pre- dict. Our model struggles with small objects that appear in groups, such as flocks of birds. “
  
  “ It struggles to generalize to objects in new or unusual aspect ratios or configurations. “
- coarse bounding box prediction：the architecture has multiple downsampling layers
- the loss function treats errors the same in small bounding boxes versus large bounding boxes：
  
  The same error has much greater effect on a small box’s IOU than a big box.
  
  “Our main source of error is incorrect localizations. “
Comparison：
- mAP among real-time detectors and Less Than Real-Time detectors：less mAP than fast-rcnn but much faster
- error analysis between yolo and fast-rcnn：greater localization error and less background false-positive
- combination analysis：[fast-rcnn+yolo] defeats [fast-rcnn+fast-rcnn] since YOLO makes different kinds of mistakes with fast-rcnn
- generalizability：RCNN degrades more because the Selective Search is tuned for natural images, change of dataset makes the proposals get worse. YOLO degrades less because it models the size and shape of objects, change of dataset varies less at object level but more at pixel level.

2. Yolov2: YOLO9000: Better, Faster, Stronger

动机：
- run at varying sizes：offering an easy tradeoff between speed and accuracy
- recognize a wide variety of objects ：jointly train on object detection and classification, so that the model can predict objects that aren’t labelled in detection data
- better performance but still fast
论点：
- Current object detection datasets are limited compared to classification datasets
  - leverage the classification data to expand the scope of current detection system
  - joint training algorithm making the object detectors working on both detection and classification data
- Better performance often hinges on larger networks or ensembling multiple models. However we want a more accurate detector that is still fast
- YOLOv1’s shortcomings
  - more localization errors
  - low recall
要素：
1. better
2. faster
  - backbone
3. stronger
  - uses labeled detection images to learn to precisely localize objects
  - uses classification images to increase its vocabulary and robustness
方法：
1. better：
  1. batch normalization：convergence & regularization
    
    add batch normalization on all of the convolutional layers
    
    remove dropout from the model
  2. high resolution classifier：pretrain a hi-res classifier
    
    first fine tune the classification network at the full 448 × 448 resolution for 10 epochs on ImageNet
    
    then fine tune the resulting network on detection
  3. convolutional with anchor boxes：
    
    YOLOv1通过网络最后的全连接层，直接预测每个grid上bounding box的坐标
    
    而RPN基于先验框，使用最后一层卷积层，在特征图的各位置预测bounding box的offset和confidence
    
    “Predicting offsets instead of coordinates simplifies the problem and makes it easier for the network to learn”
    
    YOLOv2去掉了全连接层，也使用anchor box来回归bounding box
    
    eliminate one pooling layer to make the network output have higher resolution
    
    shrink the network input to 416416 to obtain an odd number so that there is a *single center cell in the feature map
    
    predict class and objectness for every anchor box(offset prediction) instead of nothing(direct location&scale prediction)
  4. dimension clustering：
    
    what we want are priors that lead to good IOU scores, thus comes the distance metric：
    $d(box, centroid) = 1 - IOU(box, centroid)$
  5. direct location prediction：
    
    YOLOv1 encounter model instability issue for predicting the (x, y) locations for the box
    
    RPN also takes a long time to stabilize by predicting a (tx, ty) and obtain the (x, y) center coordinates indirectly because this formulation is unconstrained so any anchor box can end up at any point in the image：
    
    学习RPN：回归一个相对量，比盲猜回归一个绝对location（YOLOv1）更好学习
    
    学习YOLOv1：基于cell的预测，将bounding box限定在有限区域，不是全图飞（RPN）
    
    YOLOv2对每个cell，基于5个prior anchor size，预测5个bounding box，每个bounding box具有5维：
    - $t_x\ \&\ t_y$用于回归bounding box的位置，通过sigmoid激活函数被限定在0-1，通过上式能够间接得到bounding box的归一化位置（相对原图）
    - $t_w\ \&\ t_h$用于回归bounding box的尺度，输出应该不是0-1限定，$p_w\ \&\ p_h$是先验框的归一化尺度，通过上式能够间接得到bounding box的归一化尺度（相对原图）
    - $t_o$用于回归objectness，通过sigmoid限定在0-1之间，因为$Pr(object)\ \&\ IOU(b,object)$都是0-1之间的值，IOU通过前面四个值能够求解，进而可以解耦objectness
  6. fine-grained features：
    
    motive：小物体的检测依赖更加细粒度的特征
    
    cascade：Faster R-CNN and SSD both run their proposal networks at various feature maps in the network to get a range of resolutions
    
    【QUESTION】YOLOv2 simply adds a passthrough layer from an earlier layer at 26 × 26 resolution：
    
    latter featuremap —-> upsampling
    
    concatenate with early featuremap
    
    the detector runs on top of this expanded feature map
    
    predicts a $NN(3*(4+1+80))$ tensor for each scale
  7. multi-scale training：
    
    模型本身不限定输入尺寸：model only uses convolutional and pooling layers thus it can be resized on the fly
  - forces the network to learn to predict well across a variety of input dimensions
    - the same network can predict detections at different resolutions
  1. loss：cited from the latter yolov3 paper
    - use sum of squared error loss for box coordinate(x,y,w,h)：then the gradient is $y_{true} - y_{pred}$
  - use logistic regression for objectness score：which should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior
    - if a bounding box prior is not assigned to a ground truth object it incurs no loss(coordinate&objectness)
  - use binary cross-entropy loss for multilabel classification
faster：
1. darknet-19：
  
  YOLOv1中讨论过换VGG-16和YOLOv1使用的backbone对比，前者有map提升，但是耗时。
  
  YOLOv2的新backbone，参数更少，而且相对于VGG16在ImageNet上精度更高。
2. training for classification：
  - first train on ImageNet using 224*224
  - then fine-tuning on 448*448
3. training for detection：
  - remove the last convolutional layer
  - add on three 3 × 3 convolutional layers with 1024 filters each followed by a final 1×1 convolutional layer with the number of outputs we need for detection
  - add a passthrough from the final 3 × 3 × 512 layer to the second to last convolutional layer so that our model can use fine grain features.
4. stronger：
  
  jointly training：以后再填坑
  - 构造标签树
  - classification sample用cls loss，detection sample用detect loss
  - 预测正确的classification sample给一个.3 IOU的假设值用于计算objectness loss

3. Yolov3: An Incremental Improvement

动机：

nothing like super interesting, just a bunch of small changes that make it better
方法：
1. bounding box prediction：
  
  use anchor boxes and predicts offsets for each bounding box
  
  use sum of squared error loss for training
  
  predicts the objectness score for each bounding box using logistic regression
  
  one ground truth coresponds to one best box and one loss
2. class prediction：
  
  use binary cross-entropy loss for multilabel classification
3. 【NEW】prediction across scales：
  
  the detector：a few more convolutional layers following the feature map, the last of which predicts a 3-d(for 3 priors) tensor encoding bounding box, objectness, and class predictions
  
  expanded feature map：upsampling the deeper feature map by 2X and concatenating with the former features
  
  “With the new multi-scale predictions, YOLOv3 has better perfomance on small objects and comparatively worse performance on medium and larger size objects “
4. 【NEW】feature extractor：
  
  darknet-53 !
5. training：common skills

4. 一些补充

metrics：mAP

最早由PASCAL VOC提出，输出结果是一个ranked list，每一项包含框、confidence、class，

yolov3提到了一个“COCOs weird average mean AP metric ”
- IoU：预测框与ground truth的交并比，也被称为Jaccard指数，我们通常用其来判断每个检测的正确性。PASCAL VOC数据集用0.5为阈值来判定预测框是True Positive还是False Positive，COCO数据集则建议对不同的IoU阈值进行计算。
- 置信度：通过改变置信度阈值，我们可以改变一个预测框是Positive还是 Negative。
- precision & recall：precision = TP ／(TP + FP)，recall = TP／(TP + FN)。图片中我们没有预测到的每个部分都是Negative，因此计算True Negatives比较难办。但是我们只需要计算False Negatives，即我们模型所漏检的物体。
- AP：不同的置信度下会得到不同的precision-recall。为了得到precision-recall曲线，首先对模型预测结果进行排序，按照各个预测值置信度降序排列。给定不同的置信度阈值，就有不同的ranked output，Recall和Precision仅在高于该rank值的预测结果中计算。这里共选择11个不同的recall（[0, 0.1, …, 0.9, 1.0]），那么AP就定义为在这11个recall下precision的平均值，其可以表征整个precision-recall曲线（曲线下面积）。给定recall下的precision计算，是通过一种插值的方式：
  $AP = \frac{1}{11}\sum_{r\in\{0,0.1,...,1.0\}}p_{interp}(r) \\ p_{interp}(r) = max_{\tilde r: \tilde r > r} p(\tilde r)$
- mAP：此度量指标在信息检索和目标检测领域有不同的计算方式。对于目标检测，对于各个类别，分别按照上述方式计算AP，取所有类别的AP平均值就是mAP。
eval：
1. yolo_head输出：box_xy是box的中心坐标，(0~1)相对值；box_wh是box的宽高，(0~1)相对值；box_confidence是框中物体置信度；box_class_probs是类别置信度；
2. yolo_correct_boxes函数：能够将box中心的相对信息转换成[y_min,x_min,y_max,x_max]的绝对值
3. yolo_boxes_and_scores函数：输出网络预测的所有box
4. yolo_eval函数：基于score_threshold、max_boxes两项过滤，类内NMS，得到最终输出

4. YOLOv4: Optimal Speed and Accuracy of Object Detection

动机
- Practical testing the tricks of improving CNN
- some features
  - work for certain problems/dataset exclusively
  - applicable to the majority of models, tasks, and datasets
  - only increase the training cost [bag-of-freebies]
  - only increase the inference cost by a small amount but can significantly improve the accuracy [bag-of-specials]
- Optimal Speed and Accuracy
论点
- head：
  - predict classes and bounding boxes
  - one-stage head
    - YOLO, SSD, RetinaNet
    - anchor-free：CenterNet, CornerNet, FCOS
  - two-stage head
    - R-CNN series
    - anchor-free：RepPoints
- neck：
  - collect feature maps from different stages
  - FPN, PAN, BiFPN, NAS-FPN
- backbone：
  - pre-trained on ImageNet
  - VGG, ResNet, ResNeXt, DenseNet
- Bag of freebies
  - data augmentation
    - pixel-wise adjustments
      - photometric distortions：brightness, contrast, hue, saturation, and noise
      - geometric distortions：random scaling, cropping, flipping, and rotating
    - object-wise
      - cut：
        
        to image：CutOut
        
        to featuremaps：DropOut, DropConnect, DropBlock
      - add：MixUp, CutMix, GAN
  - data imbalance for classification
    - two-stage：hard example mining
    - one-stage：focal loss, soft label
  - bounding box regression
    - MSE-regression：treat [x,y,w,h] as independent variables
    - IoU loss：consider the integrity & scale invariant
- Bag of specials
  - enlarging receptive field：improved SPP, ASPP, RFB
  - introducing attention mechanism
    - channel-wise attention：SE, increase the inference time by about 10%
    - point-wise attention：Spatial Attention Module (SAM), does not affect the speed of inference
  - strengthening feature integration
    - channel-wise level：SFAM
    - point-wise level：ASFF
    - scale-wise level：BiFPN
  - activation function：A good activation function can make the gradient more efficiently propagated
  - post-processing：各种NMS
方法
- choose a backbone —- CSPDarknet53
  - Higher input network size (resolution) – for detecting multiple small-sized objects
  - More conv layers – for a higher receptive field to cover the increased size of input network
  - More parameters – for greater capacity of a model to detect multiple objects of different sizes in a single image
- add the SPP block over the CSPDarknet53
  - significantly increases the receptive field
  - separates out the most significant context features
  - causes almost no re- duction of the network operation speed
- use PANet as the method of parameter aggregation
  - Modified PAN
  - replace shortcut connection of PAN to concatenation
- use YOLOv3 (anchor based) head
  - encoding/decoding method变了！！！
  - 代码和issue里面有说明：https://github.com/WongKinYiu/ScaledYOLOv4/issues/90#，但是论文里没显式的说明
    - given pred offsets $t_x,t_y,t_w,t_h$
    - xy_offsets：for each grid centers
      $b_x = sigmoid(t_x)*2 - 0.5 \\ b_y = sigmoid(t_y)*2 - 0.5$
    - wh_ratios：for each grid anchors
      $b_w = (sigmoid(t_w)*2)^2 - 0.5 \\ b_h = (sigmoid(t_h)*2)^2 - 0.5$
- Mosaic data augmentation
  - mixes 4 training images
  - allows detection of objects outside their normal context
  - reduces the need for a large mini-batch size
- Self-Adversarial Training (SAT) data augmentation
  - 1st stage alters images
  - 2nd stage train on the modified images
- CmBN：a CBN modified version
- modified SAM：from spatial-wise attention to point- wise attention
  - 这里的SAM for 《An Empirical Study of Spatial Attention Mechanisms in Deep Networks》，空间注意力机制
  - 还有一篇SAM是《Sharpness-Aware Minimization for Efficiently Improving Generalization》，google的锐度感知最小化，用来提升模型泛化性能，注意区分
实验
- Influence of different features on Classifier training
  - Bluring和Swish没有提升
- Influence of different features on Detector training
  - IoU threshold, CmBN, Cosine annealing sheduler, CIOU有提升

POLY-YOLO: HIGHER SPEED, MORE PRECISE DETECTION AND INSTANCE SEGMENTATION FOR YOLOV3

动机
- yoloV3’s weakness
  - rewritten labels
  - inefficient distribution of anchors
- light backbone：
  - stairstep upsampling
- single scale output
- to extend instance segmentation
  - detect size-independent polygons defined on a polar grid
  - real-time processing
论点
- yolov3
  - real-time
  - low precision cmp with RetinaNet, EfficientDet
    - low precision of the detection of big boxes
    - rewriting of labels by each-other due to the coarse resolution
- this paper solution：
  - 解决yolo精度问题：propose a brand-new feature decoder with a single ouput tensor that goes to head with higher resolution
  - 多尺度特征融合：utilize stairstep upscaling
  - 实例分割：bounding polygon within a poly grid
- instance segmentation
  - two-stage：mask-rcnn
  - one-stage：
    - top-down：segmenting this object within a bounding box
    - bottom-up：start with clustering pixels
    - direct methods：既不需要bounding box也不需要clustered pixels，PolarMask
- cmp with PolarMask
  - size-independent：尺度，大小目标都能检测
  - dynamic number of vertices：多边形定点可变
- yolov3 issues
  - rewriting of labels：
    - 两个目标如果落在同一个格子里，在一个尺度上ground truth label只会保留一个box
    - 对越小的特征图，grid越大，这个问题越严重
  - imbalanced distribution of anchors across output scales
    - anchor如果选的不合理，会导致特征图尺度和anchor尺度不匹配
    - most of the boxes will be captured by the middle output layer and the two other layers will be underused
    - 如上面车的case，大多数车的框很小，聚类出的给level0和level1的anchor shape还是很小，但是level0是稀疏grid
      - 一方面，grid shape和anchor shape不匹配
      - 一方面，label rewriten问题会升级
    - 反过来，如果dense grid上预测大目标，会受到感受野的制约
    - 一种解决方案是基于感受野首先对gt box分成三组，然后分别聚类，然后9选1
  - yolov3原文：YOLOv3 has relatively high $AP_{small}$ performance. However, it has comparatively worse performance on medium and larger size objects. More investigation is needed to get to the bottom of this.
    - 小目标performance更好，大目标worse，主要是就是因为coarse grid上存在label rewriten问题，存在部分gt box被抑制掉了。
方法
- architecture
  - single output
  - higher resolution：stride4
  - handle all the anchors at once
  - cross-scale fusion
    - hypercolumn technique：add operation
    - stairstep interpolation：x2 x2 …
  - SE-blocks
  - reduced the number of convolutional filters to 75% in the feature extraction phase
- bounding polygons
  - extend the box tuple：$b_i=\{b_i^{x^1},b_i^{y^1},b_i^{x^2},b_i^{y^2},V_i\}$
  - The center of a bounding box is used as the origin
  - polygon tuple：$v_{i,j}=\{\alpha_{i,j},\beta_{i,j},\gamma_{i,j}\}$
  - polar coordinate：distance & oriented angle，相对距离（相对anchor box的对角线），相对角度（norm到[0,1]）
  - polar cell：一定角度的扇形区域内，如果sector内没有定点，conf=0
  - general shape：
    - 不同尺度，形状相同的object，在polar coord下表示是一样的
    - distance*anchor box的对角线，转换成绝对尺度
    - bounding box的两个对角预测，负责尺度估计，polygon只负责预测形状
    - sharing values should make the learning easier
- mix loss
  - output：a*(4+1+3*n_vmax)
  - box center loss：bce
  - box wh loss：l2 loss
  - conf loss：bce with ignore mask
  - cls loss：bce
  - polygon loss：$\gamma(log(\frac{\alpha}{anchor^d})-\hat a)^2 + \gammabce(\beta,\hat{beta})+bce(\gamma, \hat \gamma)$
  - auxiliary task learning：
    - 任务间相互boost
    - converge faster

Scaled-YOLOv4: Scaling Cross Stage Partial Network

动机
- model scaling method
- redesign yolov4 and propose yolov4-CSP
- develop scaled yolov4
  - yolov4-tiny
  - yolov4-large
- 没什么技术细节，就是网络结构大更新
论点
- common technique changes depth & width of the backbone
- recently there are NAS
- model scaling
  - input size、width、depth对网络计算量呈现square, linear, and square increase
  - 改成CSP版本以后，能够减少参数量、计算量，提高acc，缩短inference time
  - 检测的准确性高度依赖reception field，RF随着depth线性增长，随着stride倍数增长，所以一般先组合调节input size和stage，然后再根据算力调整depth和width
方法
- backbone：CSPDarknet53
- neck：CSP-PAN，减少40%计算量，SPP
- yoloV4-tiny
- yoloV4-large：P456
补充细节
- box regression

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

精度
- YOLOv7打败了所有real-time detectors：GPU V100上30以上FPS的模型，56.8% AP
- YOLOv7-E6打败了transformer-based (SWIN- L Cascade-Mask R-CNN)/conv-based (ConvNeXt-XL Cascade-Mask R-CNN)的两阶段模型
论点
- 结构上
  - CPU real-time detectors基本是基于MobileNet、ShuffleNet、GhostNet
  - GPU real-time detectors则大多数用ResNet、DarkNet、CSPNet strategy
- Real-time object detectors
  - 大多数基于YOLO / FCOS
  - 1. faster and stronger back
  - 1. effective feature integration
  - 1. accurate detection head
  - 1. robust loss
  - 1. a more efficient label assignment method
  - 1. a more efficient training method
  - 本文关注4/5/6要素
- Model re-parameterization
  - merge multiple computational modules into one at inference stage
  - 其实是一种模型/module层面的ensemble
  - 模型层面
    - ema
    - k-fold ensemble
  - module层面
    - 线形层的合并
    - 需要设计可合并的module（如非线性操作放在branch外面）
- Model scaling
  - 大多数NAS方法不会考虑各种factor的correlation
  - darknet之类的网络实际上是compound scaling的，网络加深的同时内部block的branch就进行了同步的加宽，效果上来讲并不是孤立的增加了depth
- main contribution
  - 主要focus在training method上
    - 优化训练进程
    - 增加了training cost，但是不影响inference性能
    - 所以叫trainable bag-of-freebies
  - two issues & solutions
    - model re-parameterization
    - dynamic label assignment
  - propose extent & compound scaling
    - 降低40%参数量 & 50%计算量
    - faster & higher acc
Architecture
- Extended efficient layer aggregation networks
  - 考虑要素
    - number of parameters
    - the amount of computation
    - the computational density
  - 提出了Extended-ELAN
    - group conv：相同计算量下，可以扩大channel（expand cardinality）
    - shuffle & merge：combine the features of different groups，持续增强学习能力同时不伤害original gradient path【不太理解为啥有这个作用】
- Model scaling for concatenation-based models
  - 大多数讨论scaleup的网络结构，在加深时，网络内部的block的输入/输出维度不会变化，但是本文这种concatenation-based architecture，在加深时因为分支变多了，宽度也变了，因此并不能孤立探讨一个scaleup factor
  - compound scaling
    - 在加深computation block的深度的同时，它的宽度也变了
    - 因此也要同步调整其他部分的宽度
    - 从而maintain the optimal structure
Trainable bag-of-freebies
- Planned re-parameterized convolution：RepConvN
  - 受RepConv的启发，RepVGG的效果是好的，但是直接复用在resnet/densenet结构上会显著掉点
  - recall RepConv：inference阶段的一个3x3conv在训练阶段其实是2/3个分支（identity+1x1+3x3）
  - 本文实验发现直接复用RepConv，它的id path会destroy the residual in ResNet/DenseNet，所以本文改良的RepConv去掉了id path（RepConvN）
- Coarse for auxiliary and fine for lead loss：label assigner
  - deep supervision：对中间层添加辅助分支进行loss guidance
    - 保留作为最终输出的branch叫做lead head
    - 只在训练阶段做辅助的branch叫auxiliary head
  - label assignment
    - soft label：如用pred box和gt box的IoU作为objectness的标签，是随着网络预测结果调节标签的
    - one issue：如何给每个预测头分配label，计算loss？
      - 各算各的
      - 用lead prediction来assign both：因为lead head的学习能力更强，生成的标签更representative一些
    - this paper one step further
      - 用lead prediction来assign both
      - 同时生成的是coarse-to-fine的hierarchical label，to fit不同的feature level
      - fine label还是上面那个lead head计算的soft label
      - coarse label则allow more grids as positives，constraints放的更低了，因为让学习能力不够强的层提前学习过于精准的label，会导致后面的层学习能力恶化，但是如果前期encourage more，后期只需过滤掉低质量框就好了
      - 具体实现是通过put restrictions in the decoder so that the extra coarse positive grids cannot produce soft label properly，这个fine-coarse的assign机制是在训练过程中online实现的
- Other trainable bag-of-freebies
  - BN：conv-bn-activation，测试阶段合并线形计算单元
  - Implicit knowledge in YOLOR：测试阶段也能合并
  - EMA：老ensemble了

triplet-center-loss论文

发表于 2019-11-13 |

0. before reading

结合：

triplet loss：考虑类间关系，但计算复杂度高，困难样本难挖掘
center loss：考虑类内关系
TCL：同时增加类内数据的紧实度（compactness）和类间的分离度（separability）

三元组只考虑样本、所属类中心、最近邻类的中心。避免了建立triplets的复杂度和mining hard samples的难度。

title：Triplet-Center Loss for Multi-View 3D Object Retrieval

动机：deep metric learning
- the learned features using softmax loss are not discriminative enough in nature
- although samples of the two classes are separated by the decision boundary elaborately, there exists significant intra-class variations
- QUESTION1：so what? how does this affect the current task? 动机描述不充分。
- QUESTION2：在二维平面上overlap不代表在高维空间中overlap，这种illustration究竟是否有意义。
- ANSWER for above：高维空间可分，投影到二维平面不一定可分，但是反过来，二维平面上高度可分，映射会高维空间数据仍旧是高度可分的。只能说，后者能够确保不同类别数据离散性更好，不能说明前者数据离散性不好（如果定义了高维距离，也可以说明）。
应用场景：3D object retrieval
要素：
- learns a center for each class
- requires that the distances between samples and centers from the same class are smaller than those from different classes, in this way the samples are pulled closer to the corresponding center and meanwhile pushed away from the different centers
- both the inter-class separability and the intra-class variations are considered
论点：
- Compared with triplet loss, TCL avoids the complex construction of triplets and hard sample mining mechanism.
- Compared with center loss, TCL not only considers to reduce the intra-class variations.
- QUESTION：what about the comparison with [softmax loss + center loss]?
- ANSWER for above：center-loss is actually representing for the joint loss [softmax loss + center loss].
  
  ‘’Since the class centers are updated at each iteration based on a mini-batch instead of the whole dataset, which can be very unstable, it has to be under the joint supervision of softmax loss during training. ‘’
本文做法：
- the proposed TCL is used as the supervision loss
- the softmax loss could be also combined in as an addition
细节：
- TCL：
  $L_{tc} = \sum_{i=1}^Mmax(D(f_i, c_{y^i}) + m - min_{j\neq y^i}D(f_i, c_j), 0)$
  前半部分是center-loss，类内欧几里得距离，后半部分是每个样本和与其最近的negative center之间的距离。
- ‘Unlike center loss, TCL can be used independently from softmax loss. However… ‘
  
  作者解释说，因为center layer是随机初始化出来的，而且是batch updating，因此开始阶段会比较tricky，’while softmax loss could serve as a good guider for seeking better class centers ‘
- 调参中提到’m is fixed to 5’，说明本文对feature vector没有做normalization（相比之下facenet做了归一化，限定所有embedding分布在高维球面上）。
- 衡量指标：AUC和MAP，这是一个retrieval任务，最终需要的是embedding，给定Query，召回top matches。
reviews：
- 个人理解：
  1. softmax分类器旨在数据可分，对于分类边界、feature vector的空间意义不存在一个具象的描述。deep metric learning能够引入这种具象的、图像学的意义，在此基础上，探讨distance、center才有意义。
  2. 就封闭类数据（类别有限且已知）分类来讲，分类边界有无图像学描述其实意义不大。已知的数据分布尽可能discriminative的主要意义是针对未知类别，我们希望给到模型一个未知数据时，它能够检测出来，而不是划入某个已知类（softmax）。
  3. TCL的最大贡献应该是想到用center替代样本来进行metric judgement，改善triplet-loss复杂计算量这一问题，后者实际训起来太难了，没有感情的GPU吞噬机器。
- XXX：

能够引入这种具象的、图像学的意义，在此基础上，我们探讨distance、center才有意义。

dicomReader

发表于 2019-11-11 |

read a dcm file

import SimpleITK as sitk

image = sitk.ReadImage(dcm_file)
image_arr = sitk.GetArrayFromImage(image)

read a dcm series

series_IDs = sitk.ImageSeriesReader.GetGDCMSeriesIDs(series_path)

nb_series = len(series_IDs)
print(nb_series)

# 默认获取第一个序列的所有切片路径
dicom_names = sitk.ImageSeriesReader.GetGDCMSeriesFileNames(file_path)
series_reader = sitk.ImageSeriesReader()
series_reader.SetFileNames(dicom_names)
image3D = series_reader.Execute()

read a dcm case

series_IDs = sitk.ImageSeriesReader.GetGDCMSeriesIDs(case_path)
for series_id in series_IDs:
    dicom_names = sitk.ImageSeriesReader.GetGDCMSeriesFileNames(case_path, series_id)
    series_reader = sitk.ImageSeriesReader()
    series_reader.SetFileNames(dicom_names)
    image3D = series_reader.Execute()

read tag

1 2	# 首先得到image对象 Image_type = image.GetMetaData("0008\|0008") if image.HasMetaData("0008\|0008") else 'Nan'

发现一种序列，每张图的尺寸不同，这样执行series_reader的时候会报错，因为series_reader会依照第一层的图像尺寸申请空间，所以要么异常要么逐张读。

reference: http://itk-users.7.n7.nabble.com/ITK-users-Reader-InvalidRequestedRegionError-td38608.html

c++ tricks in engineering

发表于 2019-11-06 |

数组传参

工程化被坑了好多回！

C/C++ 传递数组，虽然传递的是首地址地址，但是参数到了函数内就成了普通指针。

所以试图在调用函数中求取所传递数组的长度是行不通的。
vector传参

传值—>拷贝构造，传引用／指针—>不发生拷贝构造。

实际工程化中遇到的问题是，构建了一个vector\ imgs对象，传入函数以后，在函数内部创建空间cv::Mat img，然后将img push进vector。在函数外读取该vector的时候发现其内部没值。

要点：1. 要传引用，2. push clone：imgs.push_back(img)

另外，vector可以作为函数返回值。

图像算法综述

发表于 2019-10-31 |

类别
- 按照任务类型：度量学习（metric learning）和描述子学习（image descriptor learning）
- 按照网络结构：pairwise的siamese结构、triplet的three branch结构、以及引入尺度信息的central-surround结构
- 按照网络输出：特征向量（feature embedding）和单个概率值（pairwise similarity）
- 按照损失函数：对比损失函数、交叉熵损失函数、triplet loss、hinge loss等，此外损失函数可以带有隐式的困难样本挖掘，例如pn-net中的softpn等，也可以是显示的困难挖掘。
Plain网络

主要是说AlexNet／VGG-Net，后者更常用一些。

Plain网络的设计主要遵循以下几个准则：

（1）输出特征图尺寸相同的层使用相同数量的滤波器。

（2）如果特征图尺寸减半，那么滤波器数量就加倍，从而保证每层的时间复杂度相同（这是为啥？？）。
名词
- 感受野：卷积神经网络每一层输出的特征图上的像素点在原始图像上映射区域的大小。通俗的说，就是输入图像对这一层输出的神经元的影响有多大。
  
  感受野计算：由当前层向前推，需要的参数是kernel size和stride。
  $N\_RF = kernel\_size + (cur\_RF-1)*stride$
  其中$cur_RF$是当前层（start from 1），$N_RF$、$kernel_size$、$stride$是上一层参数。
- 有效感受野：并不是感受野内所有像素对输出向量的贡献相同，在很多情况下感受野区域内像素的影响分布是高斯，有效感受野仅占理论感受野的一部分，且高斯分布从中心到边缘快速衰减。
- 感受野大小：
  - 小感受野：local，位置信息更准确
- 大感受野：global，语义信息更丰富

inception module：下图为其中一种。

意义：增加网络深度和宽度的同时，减少参数。结构中嵌入了多尺度信息，集成了多种不同感受野上的特征。
building block：左边这种，红色框框里面是一个block。

几个相同的building block堆叠为一层conv。在第一个building Block块中，输出特征图的尺寸下降一半（第一个卷积stride=2），剩余的building Block块输入输出尺寸是一样的。
bottleneck：右边这种，蓝色框框block。字面意思，瓶颈，形容输入输出维度差距较大。

第一个1*1负责降低维度，第二个1*1负责恢复维度，3*3层就处在一个输入／输出维度较小的瓶颈。
```
左右两种结构时间复杂度相似。

<img src="图像算法综述/block.png" width="30%;" />

<img src="图像算法综述/ImageNet.png" width="110%;" />
```
- top-1和top-5：top-1就是预测概率最大的类别，top-5则取最后预测概率的前五个，只要其中包含正确类别则认为预测正确。
  
  使用top-5主要是因为ImageNet中很多图片中其实是包含多个物体的。
- accuracy、error rate、F1-score、sensitivity、specificity、precision、recall
  - accuracy：总体准确率
  - precision：从结果角度，单一类别准确率
  - recall：从输入角度，预测类别真实为1的准确率
  - P-R曲线：选用不同阈值，precision-recall围成的曲线
  - AP：平均精度，P-R曲线围住的面积
  - F1-score：对于某个分类，综合了Precision和Recall的一个判断指标，因为选用不同阈值，precision-recall会随之变化，F1-score用于选出最佳阈值。
  - sensitivity：=recall
  - specificity：预测类别真实为0的准确率
  reference：https://zhuanlan.zhihu.com/p/33273532
- trade-off：
- FLOPS：每秒浮点运算次数是每秒所执行的浮点运算次数的简称，被用来估算电脑效能。
- ROC、AUC、MAP：
  - ROC：TPR和FPR围成的曲线
  - AUC：ROC围住的面积
  - mAP：所有类别AP的平均值
- 梯度弥散：
- “底层先收敛、高层再收敛”：
- 特征图：卷积层通过线性滤波器进行线性卷积运算，然后再接个非线性激活函数，最终生成特征图。
- TTA test time augmentation：测试时增强，为原始图像造出多个不同版本，包括不同区域裁剪和更改缩放程度等，并将它们输入到模型中；然后对多个版本进行计算得到平均输出，作为图像的最终输出分数。
- pooling mode:
  - full mode：从filter和image刚开始相交开始卷积
  - same mode：当filter的中心和image的角重合时开始卷积，如果stride=1，那么输入输出尺寸相同
  - valid mode：当filter完全在image里面时开始卷积
- 空间不变性：
  - 平移不变性：不管输入如何平移，系统产生完全相同的响应，比如图像分类任务，图像中的目标不管被移动到图片的哪个位置，得到的结果（标签）应该是相同的
  - 平移同变性（translation equivariance）：系统在不同位置的工作原理相同，但它的响应随着目标位置的变化而变化，比如实例分割任务，目标如果被平移了，那么输出的实例掩码也相应变化
  - 局部连接：每个神经元没有必要对全局图像进行感知，只需要对局部进行感知，然后在更高层将局部的信息综合起来就得到了全局的信息
  - 权值共享：对于这个图像上的所有位置，我们都能使用同样的学习特征
  - 池化：通过消除非极大值，降低了上层的计算复杂度。最大池化返回感受野中的最大值，如果最大值被移动了，但是仍然在这个感受野中，那么池化层也仍然会输出相同的最大值。
  - 卷积和池化这两种操作共同提供了一些平移不变性，即使图像被平移，卷积保证仍然能检测到它的特征，池化则尽可能地保持一致的表达。
  - 同理，所谓的CNN的尺度、旋转不变性，也是由于pooling操作，引入的微小形变的鲁棒性。
- 模型大小与参数量：float32是4个字节，因此模型大小字节数=参数量×4

训练技巧
- 迁移学习：当数据集太小，无法用来训练一个足够好的神经网络，可以选择fine-tune一些预训练网络。使用时修改最后几层，降低学习率。
  
  keras中一些预训练权重下载地址：https://github.com/fchollet/deep-learning-models/releases/
- K-fold交叉验证：
  1. 我们不能将全部数据集用于训练——这样就没有数据来测试模型性能了
  2. 将数据集分割为training set 和 test set，衡量结果取决于数据集划分，training set和全集之间存在bias，不同test下结果variety很大
  3. 交叉验证Cross-Validation：
    - 极端情况LOOCV：全集N，每次取一个做test，其他做train，重复N次，得到N个模型，并计算N个test做平均
    - K-fold：全集切分成k份，每次取一个做test，其他做train，重复k次～
    - 实验显示LOOCV和10-foldCV的结果很相近，后者计算成本明显减小
    - Bias-Variance Trade-Off：K越大，train set越接近全集，bias越小，但是每个train set之间相关性越大，而这种大相关性会导致最终的test error具有更大的Variance

分割
- 实例分割&语义分割
  - instance segmentation：标记实例和语义, 不仅要分割出人这个类, 而且要分割出这个人是谁, 也就是具体的实例
  - semantic segmentation：只标记语义, 也就是说只分割出人这个类来

Segmentation

发表于 2019-08-22 |

idea:

CT图一般是单通道灰度图像，假如我将128张CT图堆叠在一起（即128通道的图像），然后用2D卷积（会考虑通道数128），这样和直接用3D卷积会有结果上的差别吗？
3d网络可以结合图像层间信息，能够保证隔层图像Mask之间的一个变化连续性，效果会比2d好。层间距大的图像，在预处理中会有插值。
3d网络因为显存的限制，一种处理方式是裁成3d patch作为输入，导致其感受野有限，通常只能专注于细节和局部特征，适合作为第二级网络用于对细节做精优化。一种处理方式是降采样，分割精度下降。
2.5D网络。

amber.zhang

要糖有糖，要猫有猫

GitHub