CNN Visualization系列

1. Visualizing and Understanding Convolutional Networks

  1. 动机

    • give insight into the internal operation and behavior of the complex models
    • then one can design better models
    • reveal which parts of the scene in image are important for classification
    • explore the generalization ability of the model to other datasets
  2. 论点

    • most visualizing methods limited to the 1st layer where projections to pixel space are possible

    • Our approach propose a method that could projects high level feature maps to the pixel space

* some methods give some insight into invariances basing on a simple quadratic approximation 
* Our approach, by contrast, provides a non-parametric view of invariance 



* some methods associate patches that responsible for strong activations at higher layers
* In our approach they are not just crops of input images, but rather top-down projections that reveal structures  
  1. 方法

    3.1 Deconvnet: use deconvnet to project the feature activations back to the input pixel space

    • To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer
    • Then successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity of the layer beneath until the input pixel space is reached
    • 【Unpooling】using switches
    • 【Rectification】the convnet uses relu to ensure always positive, same for back projection
    • 【Filtering】transposed conv
    • Due to unpooling, the reconstruction obtained from a single activation resembles a small piece of the original input image

      3.2 CNN model

      3.3 visualization among layers

    • for each layer, we take the top9 strongest activation across the validation data

    • calculate the back projection separately
    • alongside we provide the corresponding image patches

      3.4 visualization during training

    • randomly choose several strongest activation of a given feature map

    • lower layers converge fast, higher layers conversely

      3.5 visualizing the Feature Invariance

    • 5 sample images being translated, rotated and scaled by varying degrees

    • Small transformations have a dramatic effect in the first layer of the model(c2 & c3对比)
    • the network is stable to translations and scalings, but not invariant to rotation

      3.6 architecture selection

    • old architecture(stride4, filterSize11):The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. The 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. (这点可以参考之前vnet中提到的,deconv导致的棋盘格伪影,大stride会更明显)

    • smaller stride & smaller filter(stride2, filterSize7):more coverage of mid frequencies, no aliasing, no dead feature

      3.7

    • 对于物体的关键部分遮挡之后会极大的影响分类结果

    • 第二个和第三个例子中分别是文字和人脸的响应更高,但是却不是关键部分。
  2. 理解

    4.1 总的来说,网络学习到的特征,是具有辨别性的特征,通过可视化就可以看到我们提取到的特征忽视了背景,而是把关键的信息给提取出来了。从layer 1、layer 2学习到的特征基本上是颜色、边缘等低层特征;layer 3则开始稍微变得复杂,学习到的是纹理特征,比如上面的一些网格纹理;layer 4学习到的则是较多的类别信息,比如狗头;layer 5对应着更强的不变性,可以包含物体的整体信息。。

    4.2 在网络迭代的过程中,特征图出现了sudden jumps。低层在训练的过程中基本没啥变化,比较容易收敛,高层的特征学习则变化很大。这解释了低层网络的从训练开始,基本上没有太大的变化,因为梯度弥散。高层网络刚开始几次的迭代,变化不是很大,但是到了40~50的迭代的时候,变化很大,因此我们以后在训练网络的时候,不要着急看结果,看结果需要保证网络收敛。

    4.3 图像的平移、缩放、旋转,可以看出第一层中对于图像变化非常敏感,第7层就接近于线性变化。

2. Striving for Simplicity: The All Convolutional Net

  1. 动机

    • traditional pipeline: alternating convolution and max-pooling layers followed by a small number of fully connected layers
    • questioning the necessity of different components in the pipeline, max-pooling layer to be specified
    • to analyze the network we introduce a new variant of the “deconvolution approach” for visualizing features
  2. 论点

    • two major improving directions based on traditional pipeline
      • using more complex activation functions
      • building multiple conv modules
    • we study the most simple architecture we could conceive
      • a homogeneous network solely consisting of convolutional layers
      • without the need for complicated activation functions, any response normalization or max-pooling
      • reaches state of the art performance
  3. 方法

    • replace the pooling layers with standard convolutional layers with stride two

      • the spatial dimensionality reduction performed by pooling makes covering larger parts of the input in higher layers possible
      • which is crucial for achieving good performance with CNNs
    • make use of small convolutional layers

      • greatly reduce the number of parameters in a network and thus serve as a form of regularization
      • if the topmost convolutional layer covers a portion of the image large enough to recognize its content then fully connected layers can also be replaced by simple 1-by-1 convolutions
    • the overall architecture consists only of convolutional layers with rectified linear non-linearities and an averaging + softmax layer to produce predictions

      • Strided-CNN-C: pooling is removed and the preceded conv stride is increase
      • ConvPool-CNN-C: a dense conv is placed, to show the effect of increasing parameters
      • All-CNN-C: max-pooling is replaced by conv
      • when pooling is replaced by an additional convolution layer with stride 2, performance stabilizes and even improves
      • small 3 × 3 convolutions stacked after each other seem to be enough to achieve the best performance
    • guided backpropagation

      • the paper above proposed ‘deconvnet’, which we observe that it does not always work well without max-pooling layers
      • For higher layers of our network the method of Zeiler and Fergus fails to produce sharp, recognizable image structure
      • Our architecture does not include max-pooling, thus we can ’deconvolve’ without switches, i.e. not conditioning on an input image
      • In order to obtain a reconstruction conditioned on an input image from our network without pooling layers we to combine the simple backward pass and the deconvnet

      • Interestingly, the very first layer of the network does not learn the usual Gabor filters, but higher layers do

3. Cam: Learning Deep Features for Discriminative Localization

  1. 动机

    • we found that CNNs actually behave as object detectors despite no supervision on the location
    • this ability is lost when fully-connected layers are used for classification
    • we found that the advantages of global average pooling layers are beyond simply acting as a regularizer
    • it makes it easily to localize the discriminative image regions despite not being trained for them
  2. 论点

    2.1 Weakly-supervised object localization

    • previous methods are not trained end-to-end and require multiple forward passes
    • Our approach is trained end-to-end and can localize objects in a single forward pass

      2.2 Visualizing CNNs

    • previous methods only analyze the convolutional layers, ignoring the fully connected thereby painting an incomplete picture of the full story

    • we are able to understand our network from the beginning to the end
  3. 方法

    3.1 Class Activation Mapping

    • A class activation map for a particular category indicates the discriminative image regions used by the network to identify that category
    • the network architecture: convs—-gap—-fc+softmax
    • we can identify the importance of the image regions by projecting back the weights of the output layer on to the convolutional feature maps
    • by simply upsampling the class activation map to the size of the input image we can identify the image regions most relevant to the particular category

      3.2 Weakly-supervised Object Localization

    • our technique does not adversely impact the classification performance when learning to localize

    • we found that the localization ability of the networks improved when the last convolutional layer before GAP had a higher spatial resolution, thus we removed several convolutional layers from the origin networks
    • overall we find that the classification performance is largely preserved for our GAP networks compared with the origin fc structure
    • our CAM approach significantly outperforms the backpropagation approach on generating bounding box
    • low mapping resolution prevents the network from obtaining accurate localizations

      3.3 Visualizing Class-Specific Units

    • the convolutional units of various layers of CNNs act as visual concept detec- tors, identifying low-level concepts like textures or mate- rials, to high-level concepts like objects or scenes

    • Deeper into the network, the units become increasingly discriminative
    • given the fully-connected layers in many networks, it can be difficult to identify the importance of different units for identifying different categories

4. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

5. Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

6. 综述

  1. GAP

    首先回顾一下GAP,NiN中提出了GAP,主要为了解决全连接层参数过多,不易训练且容易过拟合等问题。

    对大多数分类任务来说不会因为做了gap让特征变少而让模型性能下降。因为GAP层是一个非线性操作层,这C个特征相当于是从kxkxC经过非线性变化选择出来的强特征。

  2. heatmap

    step1. 图像经过卷积网络后最后得到的特征图,在全连接层分类的权重($w_{k,n}$)肯定不同,

    step2. 利用反向传播求出每张特征图的权重,

    step3. 用每张特征图乘以权重得到带权重的特征图,在第三维求均值,relu激活,归一化处理

    • relu只保留wx大于0的值——我们正响应是对当前类别有用的特征,负响应会拉低$\sum wx$,即会降低当前类别的置信度
    • 如果没有relu,定位图谱显示的不仅仅是某一类的特征。而是所有类别的特征。

      step4. 将特征图resize到原图尺寸,便于叠加显示

  3. CAM

    CAM要求必须使用GAP层,

    CAM选择softmax层值最大的节点反向传播,求GAP层的梯度作为特征图的权重,每个GAP的节点对应一张特征图。

  4. Grad-CAM

    Grad-CAM不需要限制模型结构,

    Grad-CAM选择softmax层值最大的节点反向传播,对最后一层卷积层求梯度,用每张特征图的梯度的均值作为该特征图的权重。