pseudo-3d

[3d resnet] Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition:真3d,for comparison,分类

[C3d] Learning Spatiotemporal Features with 3D Convolutional Networks:真3d,for comparison,分类

[Pseudo-3D resnet] Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks:伪3d,resblock,S和T花式连接,分类

[2.5d Unet] Automatic Segmentation of Vestibular Schwannoma from T2-Weighted MRI by Deep Spatial Attention with Hardness-Weighted Loss:patch输入,先2d后3d,针对各向异性,分割

[two-pathway U-Net] Combining analysis of multi-parametric MR images into a convolutional neural network: Precise target delineation for vestibular schwannoma treatment planning:patch输入,3d网络,xy和z平面分别conv & concat,分割

[Projection-Based 2.5D U-net] Projection-Based 2.5D U-net Architecture for Fast Volumetric Segmentation:mip,2d网络,分割,重建

[New 2.5D Representation] A New 2.5D Representation for Lymph Node Detection using Random Sets of Deep Convolutional Neural Network Observations:横冠矢三个平面作为三个channel输入,2d网络,检测

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

  1. 动机

    • spatio-temporal video
    • the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand
    • new framework
      • 1x3x3 & 3x1x1
      • Pseudo-3D Residual Net which exploits all the variants of blocks
    • outperforms 3D CNN and frame-based 2D CNN
  2. 论点

    • 3d CNN的model size:making it extremely difficult to train a very deep model
    • fine-tuning 2d 好于 train from scrach 3d
    • RNN builds only the temporal connections on the high-level features,leaving the correlations in the low-level forms not fully exploited
    • we propose
      • 1x3x3 & 3x1x1 in parallel or cascaded
      • 其中的3x3 conv可以用2d conv来初始化
      • a family of bottleneck building blocks:enhance the structural diversity
  3. 方法

    • P3D Blocks

      • direct/indirect influence:S和T之间是串联还是并联
      • direct/indirect connected to the final output:S和T的输出是否直接与identity path相加

      • bottleneck:

        • 头尾各接一个1x1x1的conv
        • 头用来narrow channel,尾用来widen back
        • 头有relu,尾没有relu

    • Pseudo-3D ResNet

      • mixing blocks:循环ABC
      • better performance & small increase in model size

      • fine-tuning resnet50:

        • randomly cropped 224x224
        • freeze all BN except for the first one
        • add an extra dropout layer with 0.9 dropout rate
      • further fine-tuning P3D resnet:
        • initialize with r50 in last step
        • randomly cropped 16x160x160
        • horizontally flipped
        • mini-batch as 128 frames
    • future work

      • attention mechanism will be incorporated

Projection-Based 2.5D U-net Architecture for Fast Volumetric Segmentation

  1. 动机

    • MIP:2D images containing information of the full 3D image

    • faster, less memory, accurate

  2. 方法

    • 2d unet

      • MIP:$\alpha=36$
      • 3x3 conv, s2 pooling, transpose conv, concat, BN, relu,
      • filters:begin with 32, end with 512
      • dropout:0.5 in the deepest convolutional block and 0.2 in the second deepest blocks

    • 3d unet

      • overfitting & memory space
      • filters:begin with 4, end with 16
      • dropout:0.5 in the deepest convolutional block and 0.4 in the second deepest blocks
    • Projection-Based 2.5D U-net

      • 2d slice:loss of connection

      • 2d mip:disappointing results

      • 2d volume:long training time

      • the proposed 2.5D U-net:

        • $M_{i}$:MIP,p=12

        • $U$:2d-Unet like above

        • $F_p$:learnable filtration,1x3 conv,for each projection,抑制重建伪影

        • $R_p$:reconstruction operator

        • $T$:fine-tuning operator,shift & scale back to 0-1 mask

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

  1. 动机

    • 3D kernels tend to overfit
    • 3D CNNs is relatively shallow
    • propose a 3D CNNs based on ResNets
      • better performance
      • not overfit
      • deeper than C3D
  2. 论点

    • two-stream architecture:consists of RGB and optical flow streams is often used to represent spatio-temporal information
    • 3D CNNs:trained on relatively small video datasets performs worse than 2D CNNs pretrained on large datasets
    • Very deep 3D CNNs:not explored yet due to training difficulty
  3. 方法

    • Network Architecture

      • main difference:kernel dimensions
      • stem:stride2 for S,stride1 for T
      • resblock:conv_bn_relu&conv + id
      • identity shortcuts:use zero-padding for increasing dimensions,to avoid increasing the number of parameters
      • stride2 conv:conv3_1、 conv4_1、 conv5_1
      • input clips:3x16x112x112
      • large learning rate and batch size was important

  4. 实验

    • 在小数据集上3d-r18不如C3D,overfit了:shallow architecture of the C3D and pretraining on the Sports-1M dataset prevent the C3D from overfitting
    • 在大数据集上3d-r34好于C3D,同时C3D的val acc明显高于train acc——太shallow欠拟合了,r34则表现更好,而且不需要预训练
    • RGB-I3D achieved the best performance
      • 3d-r34是更deeper的
      • RGB-I3D用了更大的batch size:Large batch size is important to train good models with batch normalization
      • High resolutions:3x64x224x224

Learning Spatiotemporal Features with 3D Convolutional Networks

  1. 动机

    • generic
    • efficient
    • simple
    • 3d ConvNet with 3x3x3 conv & a simple linear classifier
  2. 论点

    • 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets
    • 2D ConvNets lose temporal information of the input signal right after every convolution operation
    • 2d conv在channel维度上权重都是一样的,相当于temporal dims上没有重要性特征提取

  3. 方法

    • basic network settings

      • 5 conv layers + 5 pooling layers + 2 fc layers + softmax
      • filters:[64,128,256,256,256]
      • fc dims:[2048,2048]
      • conv kernel:dx3x3
      • pooling kernel:2x2x2,s2 except for the first layer
        • with the intention of not to merge the temporal signal too early
        • also to satisfy the clip length of 16 frames
    • varing settings

      • temporal kernel depth
        • homogeneous:depth-1/3/5/7 throughout
        • varying:increasing-3-3-5-5-7 & decreasing-7- 5-5-3-3
      • depth-3 throughout performs the best

      • depth-1 is significantly worse

      • We also verify that 3D ConvNet consistently performs better than 2D ConvNet on a large-scale internal dataset
    • C3D

      • 8 conv layers + 5 pooling layers + 2 fc layers + softmax
      • homogeneous:3x3x3 s1 conv thtoughout
      • pool1:1x2x2 kernel size & stride,rest 2x2x2
      • fc dims:4096

    • C3D video descriptor:fc6 activations + L2-norm

    • deconvolution visualizing:

      • conv5b feature maps
      • starts by focusing on appearance in the first few frames
      • tracks the salient motion in the subsequent frames
    • compactness

      • PCA
      • 压缩到50-100dim不太损失acc
      • 压缩到10dim仍旧是最高acc

      • projected to 2-dimensional space using t-SNE

        • C3D features are semantically separable compared to Imagenet
        • quantitatively observe that C3D is better than Imagenet

  4. Action Similarity Labeling

    • predicting action similarity
    • extract C3D features: prob, fc7, fc6, pool5 for each clip
    • L2 normalization
    • compute the 12 different distances for each feature:48 in total
    • linear SVM is trained on these 48-dim feature vectors
    • C3D significantly outperforms the others