pseudo-3d

[3d resnet] Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition：真3d，for comparison，分类

[C3d] Learning Spatiotemporal Features with 3D Convolutional Networks：真3d，for comparison，分类

[Pseudo-3D resnet] Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks：伪3d，resblock，S和T花式连接，分类

[2.5d Unet] Automatic Segmentation of Vestibular Schwannoma from T2-Weighted MRI by Deep Spatial Attention with Hardness-Weighted Loss：patch输入，先2d后3d，针对各向异性，分割

[two-pathway U-Net] Combining analysis of multi-parametric MR images into a convolutional neural network: Precise target delineation for vestibular schwannoma treatment planning：patch输入，3d网络，xy和z平面分别conv & concat，分割

[Projection-Based 2.5D U-net] Projection-Based 2.5D U-net Architecture for Fast Volumetric Segmentation：mip，2d网络，分割，重建

[New 2.5D Representation] A New 2.5D Representation for Lymph Node Detection using Random Sets of Deep Convolutional Neural Network Observations：横冠矢三个平面作为三个channel输入，2d网络，检测

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

动机
- spatio-temporal video
- the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand
- new framework
  - 1x3x3 & 3x1x1
  - Pseudo-3D Residual Net which exploits all the variants of blocks
- outperforms 3D CNN and frame-based 2D CNN
论点
- 3d CNN的model size：making it extremely difficult to train a very deep model
- fine-tuning 2d 好于 train from scrach 3d
- RNN builds only the temporal connections on the high-level features，leaving the correlations in the low-level forms not fully exploited
- we propose
  - 1x3x3 & 3x1x1 in parallel or cascaded
  - 其中的3x3 conv可以用2d conv来初始化
  - a family of bottleneck building blocks：enhance the structural diversity
方法
- P3D Blocks
  - direct／indirect influence：S和T之间是串联还是并联
  - direct／indirect connected to the final output：S和T的输出是否直接与identity path相加
  - bottleneck：
    - 头尾各接一个1x1x1的conv
    - 头用来narrow channel，尾用来widen back
    - 头有relu，尾没有relu
- Pseudo-3D ResNet
  - mixing blocks：循环ABC
  - better performance & small increase in model size
  - fine-tuning resnet50：
    - randomly cropped 224x224
    - freeze all BN except for the first one
    - add an extra dropout layer with 0.9 dropout rate
  - further fine-tuning P3D resnet：
    - initialize with r50 in last step
    - randomly cropped 16x160x160
    - horizontally flipped
    - mini-batch as 128 frames
- future work
  - attention mechanism will be incorporated

Projection-Based 2.5D U-net Architecture for Fast Volumetric Segmentation

动机
- MIP：2D images containing information of the full 3D image
- faster, less memory, accurate
方法
- 2d unet
  - MIP：$\alpha=36$
  - 3x3 conv, s2 pooling, transpose conv, concat, BN, relu,
  - filters：begin with 32, end with 512
  - dropout：0.5 in the deepest convolutional block and 0.2 in the second deepest blocks
- 3d unet
  - overfitting & memory space
  - filters：begin with 4, end with 16
  - dropout：0.5 in the deepest convolutional block and 0.4 in the second deepest blocks
- Projection-Based 2.5D U-net
  - 2d slice：loss of connection
  - 2d mip：disappointing results
  - 2d volume：long training time
  - the proposed 2.5D U-net：
    - $M_{i}$：MIP，p=12
    - $U$：2d-Unet like above
    - $F_p$：learnable filtration，1x3 conv，for each projection，抑制重建伪影
    - $R_p$：reconstruction operator
    - $T$：fine-tuning operator，shift & scale back to 0-1 mask

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

动机
- 3D kernels tend to overfit
- 3D CNNs is relatively shallow
- propose a 3D CNNs based on ResNets
  - better performance
  - not overfit
  - deeper than C3D
论点
- two-stream architecture：consists of RGB and optical flow streams is often used to represent spatio-temporal information
- 3D CNNs：trained on relatively small video datasets performs worse than 2D CNNs pretrained on large datasets
- Very deep 3D CNNs：not explored yet due to training difficulty
方法
- Network Architecture
  - main difference：kernel dimensions
  - stem：stride2 for S，stride1 for T
  - resblock：conv_bn_relu&conv + id
  - identity shortcuts：use zero-padding for increasing dimensions，to avoid increasing the number of parameters
  - stride2 conv：conv3_1、 conv4_1、 conv5_1
  - input clips：3x16x112x112
  - large learning rate and batch size was important
实验
- 在小数据集上3d-r18不如C3D，overfit了：shallow architecture of the C3D and pretraining on the Sports-1M dataset prevent the C3D from overfitting
- 在大数据集上3d-r34好于C3D，同时C3D的val acc明显高于train acc——太shallow欠拟合了，r34则表现更好，而且不需要预训练
- RGB-I3D achieved the best performance
  - 3d-r34是更deeper的
  - RGB-I3D用了更大的batch size：Large batch size is important to train good models with batch normalization
  - High resolutions：3x64x224x224

Learning Spatiotemporal Features with 3D Convolutional Networks

动机
- generic
- efficient
- simple
- 3d ConvNet with 3x3x3 conv & a simple linear classifier
论点
- 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets
- 2D ConvNets lose temporal information of the input signal right after every convolution operation
- 2d conv在channel维度上权重都是一样的，相当于temporal dims上没有重要性特征提取
方法
- basic network settings
  - 5 conv layers + 5 pooling layers + 2 fc layers + softmax
  - filters：[64，128，256，256，256]
  - fc dims：[2048，2048]
  - conv kernel：dx3x3
  - pooling kernel：2x2x2，s2 except for the first layer
    - with the intention of not to merge the temporal signal too early
    - also to satisfy the clip length of 16 frames
- varing settings
  - temporal kernel depth
    - homogeneous：depth-1/3/5/7 throughout
    - varying：increasing-3-3-5-5-7 & decreasing-7- 5-5-3-3
  - depth-3 throughout performs the best
  - depth-1 is significantly worse
  - We also verify that 3D ConvNet consistently performs better than 2D ConvNet on a large-scale internal dataset
- C3D
  - 8 conv layers + 5 pooling layers + 2 fc layers + softmax
  - homogeneous：3x3x3 s1 conv thtoughout
  - pool1：1x2x2 kernel size & stride，rest 2x2x2
  - fc dims：4096
- C3D video descriptor：fc6 activations + L2-norm
- deconvolution visualizing：
  - conv5b feature maps
  - starts by focusing on appearance in the first few frames
  - tracks the salient motion in the subsequent frames
- compactness
  - PCA
  - 压缩到50-100dim不太损失acc
  - 压缩到10dim仍旧是最高acc
  - projected to 2-dimensional space using t-SNE
    - C3D features are semantically separable compared to Imagenet
    - quantitatively observe that C3D is better than Imagenet
Action Similarity Labeling
- predicting action similarity
- extract C3D features: prob, fc7, fc6, pool5 for each clip
- L2 normalization
- compute the 12 different distances for each feature：48 in total
- linear SVM is trained on these 48-dim feature vectors
- C3D significantly outperforms the others