CNN Visualization系列

1. Visualizing and Understanding Convolutional Networks

动机
- give insight into the internal operation and behavior of the complex models
- then one can design better models
- reveal which parts of the scene in image are important for classification
- explore the generalization ability of the model to other datasets
论点
- most visualizing methods limited to the 1st layer where projections to pixel space are possible
- Our approach propose a method that could projects high level feature maps to the pixel space

* some methods give some insight into invariances basing on a simple quadratic approximation 
* Our approach, by contrast, provides a non-parametric view of invariance 



* some methods associate patches that responsible for strong activations at higher layers
* In our approach they are not just crops of input images, but rather top-down projections that reveal structures

方法

3.1 Deconvnet: use deconvnet to project the feature activations back to the input pixel space
- To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer
- Then successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity of the layer beneath until the input pixel space is reached
- 【Unpooling】using switches
- 【Rectification】the convnet uses relu to ensure always positive, same for back projection
- 【Filtering】transposed conv
- Due to unpooling, the reconstruction obtained from a single activation resembles a small piece of the original input image
  
  3.2 CNN model
  
  3.3 visualization among layers
- for each layer, we take the top9 strongest activation across the validation data
- calculate the back projection separately
- alongside we provide the corresponding image patches
  
  3.4 visualization during training
- randomly choose several strongest activation of a given feature map
- lower layers converge fast, higher layers conversely
  
  3.5 visualizing the Feature Invariance
- 5 sample images being translated, rotated and scaled by varying degrees
- Small transformations have a dramatic effect in the first layer of the model(c2 & c3对比)
- the network is stable to translations and scalings, but not invariant to rotation
  
  3.6 architecture selection
- old architecture(stride4, filterSize11)：The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. The 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. (这点可以参考之前vnet中提到的，deconv导致的棋盘格伪影，大stride会更明显)
- smaller stride & smaller filter(stride2, filterSize7)：more coverage of mid frequencies, no aliasing, no dead feature
  
  3.7
- 对于物体的关键部分遮挡之后会极大的影响分类结果
- 第二个和第三个例子中分别是文字和人脸的响应更高，但是却不是关键部分。
理解

4.1 总的来说，网络学习到的特征，是具有辨别性的特征，通过可视化就可以看到我们提取到的特征忽视了背景，而是把关键的信息给提取出来了。从layer 1、layer 2学习到的特征基本上是颜色、边缘等低层特征；layer 3则开始稍微变得复杂，学习到的是纹理特征，比如上面的一些网格纹理；layer 4学习到的则是较多的类别信息，比如狗头；layer 5对应着更强的不变性，可以包含物体的整体信息。。

4.2 在网络迭代的过程中，特征图出现了sudden jumps。低层在训练的过程中基本没啥变化，比较容易收敛，高层的特征学习则变化很大。这解释了低层网络的从训练开始，基本上没有太大的变化，因为梯度弥散。高层网络刚开始几次的迭代，变化不是很大，但是到了40~50的迭代的时候，变化很大，因此我们以后在训练网络的时候，不要着急看结果，看结果需要保证网络收敛。

4.3 图像的平移、缩放、旋转，可以看出第一层中对于图像变化非常敏感，第7层就接近于线性变化。

2. Striving for Simplicity: The All Convolutional Net

动机
- traditional pipeline: alternating convolution and max-pooling layers followed by a small number of fully connected layers
- questioning the necessity of different components in the pipeline, max-pooling layer to be specified
- to analyze the network we introduce a new variant of the “deconvolution approach” for visualizing features
论点
- two major improving directions based on traditional pipeline
  - using more complex activation functions
  - building multiple conv modules
- we study the most simple architecture we could conceive
  - a homogeneous network solely consisting of convolutional layers
  - without the need for complicated activation functions, any response normalization or max-pooling
  - reaches state of the art performance
方法
- replace the pooling layers with standard convolutional layers with stride two
  - the spatial dimensionality reduction performed by pooling makes covering larger parts of the input in higher layers possible
  - which is crucial for achieving good performance with CNNs
- make use of small convolutional layers
  - greatly reduce the number of parameters in a network and thus serve as a form of regularization
  - if the topmost convolutional layer covers a portion of the image large enough to recognize its content then fully connected layers can also be replaced by simple 1-by-1 convolutions
- the overall architecture consists only of convolutional layers with rectified linear non-linearities and an averaging + softmax layer to produce predictions
  - Strided-CNN-C: pooling is removed and the preceded conv stride is increase
  - ConvPool-CNN-C: a dense conv is placed, to show the effect of increasing parameters
  - All-CNN-C: max-pooling is replaced by conv
  - when pooling is replaced by an additional convolution layer with stride 2, performance stabilizes and even improves
  - small 3 × 3 convolutions stacked after each other seem to be enough to achieve the best performance
- guided backpropagation
  - the paper above proposed ‘deconvnet’, which we observe that it does not always work well without max-pooling layers
  - For higher layers of our network the method of Zeiler and Fergus fails to produce sharp, recognizable image structure
  - Our architecture does not include max-pooling, thus we can ’deconvolve’ without switches, i.e. not conditioning on an input image
  - In order to obtain a reconstruction conditioned on an input image from our network without pooling layers we to combine the simple backward pass and the deconvnet
  - Interestingly, the very first layer of the network does not learn the usual Gabor filters, but higher layers do

3. Cam: Learning Deep Features for Discriminative Localization

动机
- we found that CNNs actually behave as object detectors despite no supervision on the location
- this ability is lost when fully-connected layers are used for classification
- we found that the advantages of global average pooling layers are beyond simply acting as a regularizer
- it makes it easily to localize the discriminative image regions despite not being trained for them
论点

2.1 Weakly-supervised object localization
- previous methods are not trained end-to-end and require multiple forward passes
- Our approach is trained end-to-end and can localize objects in a single forward pass
  
  2.2 Visualizing CNNs
- previous methods only analyze the convolutional layers, ignoring the fully connected thereby painting an incomplete picture of the full story
- we are able to understand our network from the beginning to the end
方法

3.1 Class Activation Mapping
- A class activation map for a particular category indicates the discriminative image regions used by the network to identify that category
- the network architecture: convs—-gap—-fc+softmax
- we can identify the importance of the image regions by projecting back the weights of the output layer on to the convolutional feature maps
- by simply upsampling the class activation map to the size of the input image we can identify the image regions most relevant to the particular category
  
  3.2 Weakly-supervised Object Localization
- our technique does not adversely impact the classification performance when learning to localize
- we found that the localization ability of the networks improved when the last convolutional layer before GAP had a higher spatial resolution, thus we removed several convolutional layers from the origin networks
- overall we find that the classification performance is largely preserved for our GAP networks compared with the origin fc structure
- our CAM approach significantly outperforms the backpropagation approach on generating bounding box
- low mapping resolution prevents the network from obtaining accurate localizations
  
  3.3 Visualizing Class-Specific Units
- the convolutional units of various layers of CNNs act as visual concept detec- tors, identifying low-level concepts like textures or mate- rials, to high-level concepts like objects or scenes
- Deeper into the network, the units become increasingly discriminative
- given the fully-connected layers in many networks, it can be difficult to identify the importance of different units for identifying different categories

4. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

5. Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

6. 综述

GAP

首先回顾一下GAP，NiN中提出了GAP，主要为了解决全连接层参数过多，不易训练且容易过拟合等问题。

对大多数分类任务来说不会因为做了gap让特征变少而让模型性能下降。因为GAP层是一个非线性操作层，这C个特征相当于是从kxkxC经过非线性变化选择出来的强特征。
heatmap

step1. 图像经过卷积网络后最后得到的特征图，在全连接层分类的权重（$w_{k,n}$）肯定不同，

step2. 利用反向传播求出每张特征图的权重，

step3. 用每张特征图乘以权重得到带权重的特征图，在第三维求均值，relu激活，归一化处理
- relu只保留wx大于0的值——我们正响应是对当前类别有用的特征，负响应会拉低$\sum wx$，即会降低当前类别的置信度
- 如果没有relu，定位图谱显示的不仅仅是某一类的特征。而是所有类别的特征。
  
  step4. 将特征图resize到原图尺寸，便于叠加显示
CAM

CAM要求必须使用GAP层，

CAM选择softmax层值最大的节点反向传播，求GAP层的梯度作为特征图的权重，每个GAP的节点对应一张特征图。
Grad-CAM

Grad-CAM不需要限制模型结构，

Grad-CAM选择softmax层值最大的节点反向传播，对最后一层卷积层求梯度，用每张特征图的梯度的均值作为该特征图的权重。