densenet

动机
- embrace shorter connections
- the feature-maps of all preceding layers are used as inputs
- advantages
  - alleviate vanishing-gradient
  - encourage feature reuse
  - reduce the number of parameters
论点
- Dense
  - each layer obtains additional inputs from all preceding lay- ers and passes on its own feature-maps to all subsequent layers
  - feature reuse
  - combine features by concatenating：the summation in ResNet may impede the information flow in the network
- information preservation
  - id shortcut/additive identity transformations
- fewer params
  - DenseNet layers are very narrow
  - add only a small set of feature-maps to the “collective knowledge”
- gradients flow
  - each layer has direct access to the gradients from the loss function
  - have regularizing effect
方法
- architecture
  - dense blocks
    - concat
    - BN-ReLU-3x3 conv
    - $x_l = H_l([x_0, x_1, …, x_{l-1}])$
  - transition layers
    - change the size of feature-maps
    - BN-1x1 conv-2x2 avg pooling
  - growth rate k
    - $H_l$ produces feature- maps
    - narrow：e.g., k = 12
    - One can view the feature-maps as the global state of the network
    - The growth rate regulates how much new information each layer contributes to the global state
  - bottleneck —- DenseNet-B
    - in dense block stage
    - 1x1 conv reduce dimension first
    - number of channels：4k
  - compression —- DenseNet-C
    - in transition stage
    - reduce the number of feature-maps
    - number of channels：$\theta k$
  - structure configurations
    - 1st conv channels：第一层卷积通道数
    - number of dense blocks
    - L：dense block里面的layer数
    - k：growth rate
    - B：bottleneck 4k
    - C：compression 0.5k
讨论
- concat replace sum：
  - seemingly small modification lead to substantially different behaviors of the two network architectures
  - feature reuse：feature can be accessed anywhere
  - parameter efficient：同样参数量，test acc更高，同样acc，参数量更少
  - deep supervision：classifiers attached to every hidden layer
  - weight assign
    - All layers spread their weights over multi inputs (include transition layers)
    - least weight are assigned to the transition layer, indicating that transition layers contain many redundant features, thus can be compressed
    - overall there seems to be concentration towards final feature-maps, suggesting that more high-level features are produced late in the network