sparseInst

发表于 2022-10-15 |

主题：query-based instance segmentation

Sparse Instance Activation for Real-Time Instance Segmentation

动机
- previous instance methods relies heavily on detection results
- this paper
  - use a sparse set of instance activation maps：稀疏set作为前景目标的ROI
  - aggregate the overall mask feature & instance-level feature
  - avoid NMS
- 性能 & 精度
  - 40 FPS
  - 37.9AP on COCO
论点
- 两阶段methods的limitations
  - dense anchors make redundant proposals伴随了heavy burden of computation
  - multi-scale further aggravate the issue
  - ROI-Align没法部署在终端/嵌入式设备上
  - 后处理排序/NMS也耗时
- this paper
  - propose a sparse set called Instance Activation Maps（IAM）
    - motivated by CAM
    - 通过instance-aware activation maps对全局特征加权就可以获得instance-level feature
  - label assignment
    - use DETR‘s bipartitie matching
    - avoid NMS
- object representation对比
方法
- overview
  - backbone：拿到x8/x16/x32的C3/C4/C5特征
  - encoder：feature pyramid，bottleneck用了PPM扩大感受野，然后bottom-up FPN，然后merge multi-scale feature to achieve single-level prediction
  - IAM-based decoder：
    - mask branch：provide mask features $M: D \times H \times W$
    - instance branch：provide instance activation maps $k: N\times H \times W$ to further acquire recognition kernels $z: N\times D$
- IAM-based decoder
  - 首先两个分支都包含了 a stack of convs
    - 3x3 conv
    - channel = 256
    - stack=4
  - location-sensitive features
    - $H \times W \times 2$的归一化坐标值
    - concat到decoder的输入上
    - enhance instance的representation
  - instance activation maps
    - IAM用3x3conv+sigmoid：region proposal map
    - 进一步地，还有用GIAM，换成group=4的3x3conv，multiple activations per object
    - 点乘在输入上，得到instance feature，$N\times D$的vector
    - 然后是三个linear layer，分别得到classification/objectness score/mask kernel
  - IOU-aware objectness
    - 拟合pred mask和gt mask的IoU，用于评价分割模型的好坏
    - 主要因为proposal大多是负样本，这种不均衡分布会拉低cls分支的精度，导致cls score和mask分布存在misalignment
    - inference阶段，最终使用的fg prob是$p=\sqrt{cls_prob, obj_score}$
  - mask head
    - given instance features $w_i \in R^{1D}$and mask features $M\in R^{DHW}$
    - 直接用矩阵乘法：$m_i = w_i \cdot M$
    - 最后upsample到原始resolution
- matching loss
  - bipartite matching
    - dice-based cost
    - $C(i,k)=p_i[cls_k]^{1-\alpha} * dice(m_i,gt_k)^\alpha$
    - $\alpha = 0.8$
    - dice用原始形式：$dice(x,y)=\frac{2 \sum x*y}{\sum x^2 + \sum y^2}$
    - Hungary algorithm: scipy
  - weighted sum of
    - loss cls：focal loss
    - loss obj：bce
    - loss mask：bce + dice
实验
- dataset
  - COCO 2017：118k train / 5k valid / 20k test
- metric
  - AP：mask的average precision
  - FPS：frames per second，在Nvidia 2080 Ti上，没有使用TensorRT/FP16加速
- training details
  - 8卡训练，总共64 images per-mini-batch
  - AdamW：lr=5e-5，wd=1e-4
  - train 270k iterations
  - learning rate：divided by 10 at 210k & 250k
  - backbone：用了ImageNet的预训练权重，同时frozenBN
  - data augmentaion：random flip/scale/jitter，shorter side random from [416,640]，longer side<=864
  - test/eval：use shorter size 640
  - loss weights：cls weight=2，dice weight=2，pixel bce weight=2，obj weight=1，后面实验发现提高pixel bce weight到5会有些精度gain
  - proposals：N=100
- results
  - backbone主要是ResNet50
    - ResNet-d：bag-of-tricks paper里面的一个变种，resnet在stage起始阶段进行下采样，变种将下采样放在block里面，residual path用stride2的3x3 conv，identity path用avg pooling + 1x1 conv
    - ResNet-DCN：参考的是deformable conv net v2，将最后一个卷积替换成deformable conv
  - 在数据处理上，增加了random crop，以及large weight decay（0.05），为了和竞品对齐
  - ablation on coords / dconv
  - ablation on FIAM：kernel size/n convs/activations / group conv

未命名

发表于 2022-08-15 |

[RVM 2021] Robust High-Resolution Video Matting with Temporal Guidance：字节，temporal（ConvGRU），multi-task

[SparseInst 2022] Sparse Instance Activation for Real-Time Instance Segmentation：自动化所，overview有点像DETR

Robust High-Resolution Video Matting with Temporal Guidance

动机
- human video matting
  - 用于背景替换
  - 现有技术不稳定，会产生artifacts
- performance
  - robust
  - real-time
    - 4K at 76 FPS and HD at 104 FPS on Nvidia GTX 1080Ti
  - high-resolution
- use recurrent structure instead of frame by frame：用时序网络，分割质量更好
- propose a novel training strategy：同时进行matting和segmentation两个任务，模型更鲁邦
论点
- matting formulation recollect
  - $I = \alpha F + (1-\alpha)B$
- matting methods
  - Trimap-based matting：最classical的，需要额外的先验，且通常不分类，只做前景语义分割
  - Background-based matting：不要先验证trimap了，改要先验background map
  - Segmentation：就是binary的语义分割，人像前景效果还好，但是背景容易出现各种artifacts，比较不稳定
  - Auxiliary-free matting：不需要额外输入的架构，MODNet更关注肖像，this paper更关注目标人
  - Video matting：
    - MODNet用相邻帧图像的预测结果来相互压制伪影，本质上仍就是image-independent
    - BGM用多帧图像作为多通道
  - Recurrent architecture：ConvLSTM/ConvGRU
  - High-resolution matting
    - Patch-based refinement：图像尺寸减小，以获取high resolution task的算力，但是
    - Deep guided filter：trainable，模块化，end-to-end将low-reso转换成high-reso
- use temporal structure
  - temporal information boosts both quality and robustness
  - 这种overtime的背景变换使得模型对背景信息的学习更加鲁邦和精确
- introduce a new training strategy
  - 大多数matting数据集都是合成的，包括在数据处理阶段也会做这种前景贴背景的操作，扩充样本量，这种图像太fake了，和实际场景有domain gap，泛化性差
  - 也有方法尝试先在segmentation任务上做预训练、用真实图像做对抗等方式去解决图像假的问题，这样的缺点是multi step
  - 同时训练matting & segmentation任务就一步到位了，没有额外的adaptation steps
方法
- model architecture overview
  - encoder：编码individual frame’s features，mobileNetv3/resnet50
  - recurrent decoder：aggregates temporal information
  - a Deep Guided Filter module：high-resolution upsampling
- Feature-Extraction Encoder
  - MobileNetV3-Large + LR-ASPP module
  - 最后一个block使用了空洞卷积
- Recurrent Decoder
  - ConvGRU at multiple scales
  - bottleneck block：x16 level上
    - 在LR-ASPP之后
    - 后ConvGRU，with id path（split，一半通道用于id，一半通道用于GRU）
    - 然后bilinear 2x
  - Upsampling block：x8/x4/x2 level上
    - 每个resolution stage
    - 先merge（concat）前一个stage的feature
    - 然后avg pooling，conv-bn-relu，transfer the feature
    - 然后ConvGRU，with id path
    - 然后bilinear 2x
  - Output block：x1 level上
    - 去做一个final prediction
    - 先merge
    - 然后【conv3x3-bn-relu】x2
    - 然后conv1x1 head：1-channel alpha/3-channel fg/1-channel segmentation
- Deep Guided Filter Module
  - given high- resolution videos such as 4K and HD
  - 先下采样by a factor s
  - 然后输入网络
  - 最后网络的2个输出（alpha & fg）、网络output block的hidden feature、以及HR的原图这四个信息都给到DGF，to produce high-resolution的alpha和foreground
实验
- training details
  - progressive learning：see longer sequences and higher resolution
  - loss：
    - matting loss（alpha / fg）：L1 & pyramid Laplacian loss + additional temporal coherence loss
    - segmentation loss：BCE

Sparse Instance Activation for Real-Time Instance Segmentation

动机
- fully convolutional real-time instance segmentation
- former work的实例分割通常与目标检测绑定
  - dense anchors
  - fixed reception field by fixed anchors
  - multi-level prediction
  - ROI-Align对移动端/嵌入式设备不友好
  - NMS time-consuming
- this paper
  - a sparse set of activation maps：类似detr的100个proposal
  - 基于attention map得到instance-level的features
  - 匈牙利算法来匹配proposed instance和gt，从而省略NMS，得到稀疏预测
  - 40 FPS and 37.9 AP on the COCO benchmark
  - repo：https://github. com/hustvl/SparseInst
论点
- this paper
  - IAM：instance activation maps，sparse set，motivated by CAM
    - pixel-level：相比较于框里还有bg
    - 全局context & single-level
    - simple op：avoid ROI-Align/NMS这些不可避免的循环操作
    - bipartite的稀疏监督：inhibit the redundant predictions, thus avoiding NMS
  - recognition and segmentation：在IAM的instance feature基础上执行下游任务
- overall structure
  - encoder：backbone + PPM，giving x8 fused features
  - decoder：multi-branch
    - instance branch：IAM，
    - mask branch：语义分割，
方法
- IAM：Instance Activation Maps
  - 首先一个基本假设：encoder得到的feature是redundant
  - IAM的op
    - 一个id分支，传入原始feature，[b,h,w,d]
    - 一个feature selection分支（conv+sigmoid+norm），[b,h,w,N]
    - 两个分支做矩阵乘法，[b,N,d]：feature selection分支，给出了基于原始feature的N forms of spatial reweighting方案，作为最终的attention proposals
  - downstream task：recognition and segmentation
    - kernel
    - class
    - score

mediaPipe

发表于 2022-07-12 |

requirements

getting started
- model zoo: https://google.github.io/mediapipe/solutions/models

模型压缩

发表于 2022-07-11 |

模型压缩常用方法

模型裁剪：网络中不需要的权重进行修剪，包括非结构化裁剪/结构化裁剪
模型量化：用uint8之类的低bit数域来映射和还原f32的浮点权重
知识蒸馏：teacher student，利用soft label
网络结构设计：MobileNet、ShuffleNet

模型裁剪
- 非结构化裁剪
  - 裁剪掉某些不重要的神经元实现
  - 裁剪力度较大，可以压缩几十倍
  - 需要定制化的软硬件支持
- 结构化裁剪
  - channel、filter、shape的re-selection
  - 灵活部署

self-supervised

发表于 2022-06-10 |

自监督papers
- MoCo系列：contrastive-based
  
  [2019 MoCo v1] Momentum Contrast for Unsupervised Visual Representation Learning，kaiming
  
  [2020 SimCLR] A Simple Framework for Contrastive Learning of Visual Representations，Google Brain，混进来是因为它improve based on MoCo v1，而MoCo v2/v3又都是基于它改进
  
  [2020 MoCo v2] Improved Baselines with Momentum Contrastive Learning，kaiming
  
  [2021 MoCo v3] An Empirical Study of Training Self-Supervised Visual Transformers，kaiming
- MAE：reconstruct-based
  
  [2021 MAE] Masked Autoencoders Are Scalable Vision Learners：恺明，将BERT的掩码自监督模式搬到图像领域，设计基于masked patches的图像重建任务
- MIM：reconstruct-based
  
  [2021 SimMIM] SimMIM: A Simple Framework for Masked Image Modeling：微软，swin v2的scale up模型用了这个自监督方法来缓解data hungary issue
  
  [2022 MIM] Revealing the Dark Secrets of Masked Image Modeling：微软，类似上一篇的展开实验part，

SimMIM: a Simple Framework for Masked Image Modeling

动机
- propose a simple framework of MIM
  - without the need of special designs
  - simple designs revealed strong learning performance
- major components
  - 使用较大patch size：random masking of the input image with a moderately large masked patch size (e.g., 32) makes a powerful pre-text task
  - 进行pixel-level的regression预测：predicting RGB values of raw pixels by direct regression performs no worse than the patch classification approaches with complex designs
  - 轻量的预测头：the prediction head can be as light as a linear layer, with no worse performance than heavier ones
- proved on ImageNet
  - 普通模型ViT-B：pretraing+finetuning，ImageNet-1k，83.8% top-1
  - 大模型SwinV2-H：87.1% top-1
  - 超大模型SwinV2-G：MIM能够leverage the data-hungry issue，使用更少量的数据训练超大模型至SOTA
论点
- 自监督里面的numerous pretext tasks
  - gray-scale image colorization：图像上色
  - jigsaw puzzle solving：打乱的patches重新排序
  - split-brain auto-encoding：图像分成两部分，两条分支，交叉预测对方
  - rotation prediction：given变换前后的图像X&Y，预测变换参数
  - learning to cluster：「特征聚类，将簇心标签赋给其成员，训练一个task，特征聚类」，迭代这个过程
- Masked signal modeling
  - 图像和NLP的主要区别
    - images exhibit stronger locality：neighbor pixels就是high related的，language的词序则不存在必然的距离相关性
    - visual signals are raw and low-level：因此预测pixel-level的重建任务是否对high-level recognition task有增益？
    - the visual signal is continuous, and the text token is discrete
  - bridge the modality gaps through several special designs
    - converting continuous signals into color clusters
    - patch tokenization using an additional network
    - block-wise masking strategy to break short-range connections
- this paper
  - propose a simple framework，无需上述复杂设计
  - Random masking：
    - patch-level的随机masking，适配vision transformer
    - 大一点的patch size(32) works for a wide range of masking ratio
    - 小的patch size(8) 需要masking ratio as high as 80% to perform well
    - NLP里面的masking ratio通常比较小，如0.15，我们认为是因为info-level不一致
  - 1-layer prediction head
    - extremely lightweight
    - 同时target resolution也不建议太大（12-96都可以好于192x192）
    - achieves sligtly better transferring performance than heavy heads
    - 这个自监督的头预训练完了要丢弃的，所以越小越好，不要过多承担模型能力
  - pixel-level reconstruction use simple L1 loss
    - regression比较适配continuous signal
    - performs no worse than classification approaches
    - 【QUESTION】分类任务一般怎么设计：后面实验里面，把RGB灰度值分解成8/256个bin，然后分类
方法
- A Masked Image Modeling Framework：4 major components
  - masking strategy
  - encoder architecture：ViT & Swin
  - prediction head
  - prediction target：either the raw pixels or a transformation
- Masking Strategy
  - mask token
    - use a learnable mask token vector to replace each masked patch
    - token dimension 和 visible那部分patch embedding一致
  - Patch-aligned random masking
    - 就是以patch为单位随机masking
    - swin的patch size是随阶段增长的，从4到32，we adopt 32
    - for ViT we adopt 32
  - Other masking strategies：用了16/32
    - square：随机放置的大方框
    - block-wise：复杂设计的
- Prediction Head
  - as light as a linear layer
  - 实验也尝试过2-layer MLP、an inverse Swin-T、an inverse Swin-B 这种逐渐heavy的
  - 上采样 if required：
    - ViT编码得到x16的feature maps
    - Swin编码得到x32的feature maps
    - 用一个1x1 conv / linear，将feature dim扩展到patch size patch size 3，如swin-RGB就是32*32*3=3072
- Prediction Targets
  - regression
    - 也可以考虑将grouth-truth降采样到feature size
    - L1 loss：计算masked区域RGB像素的L1 loss，然后mean on pixels
    - 实验也尝试了L2 / smoothL1
  - Other prediction targets
    - previous approaches大多数将masked signals转化成clusters or classes，然后perform a classification task
    - Color clustering（iGPT）：将巨型dataset的RGB values聚类成512个cluster，每个预测pixel is assigned to最邻近的cluster
    - Vision tokenization（BEiT）：用一个pretrained discrete VAE network将image patches转化成token，并作为classification target
    - Channel-wise bin color discretization：每个颜色通道独立分类，灰度值离散化为8/256 bins
- Evaluation protocols
  - 首先将模型在imagenet1k上finetuning，然后看分类精度
  - 或者其他down-stream tasks的指标来评估
实验
- pre-training settings
  - swinB：input 192x192，window size=6
  - dataset：ImageNet-1K，a light data augmentation (random resize cropping/random flipping /color normalization)
  - AdamW：weight decay=0.05，beta=[0.9,0.999]
  - cosine LR scheduler：100 epochs (warmup 10 ep)，baseLR=8e-4
  - batch size：2048
  - random masking：mask ratio=0.6，mask patch size=32
- fine-tuning settings
  - AdamW、batch size、masking 参数一致
  - cosine LR：baseLR=5e-3
  - a stochastic depth rate：0.1
  - a layer-wise learning rate decay：0.9
  - strong data augmentation：RandAug，Mixup，Cutmix，label smoothing，random erasing
- AvgDist
  - measures the averaged Euclidean distance of masked pixels to the nearest visible ones：被遮挡的patch embedding与其最近的visible patch的embedding欧几里得距离
  - mask ratio越大，AvgDist越大
  - mask patch size越大，AvgDist越大
  - AvgDist的值在[10,20]区间时，模型的精度最高
- 一些精度记录
  - SwinV2-H achieves 87.1% top-1 accuracy，在只使用ImageNet-1K数据集里面精度最佳
  - SwinV2-G借助了外部数据，但比google用的少，40× smaller，achieves strong performance
    - 84.0% top-1 on ImageNet-V2
    - 63.1/54.4 box/mask mAP on COCO object detection
    - 59.9 mIoU on ADE20K semantic segmentation
    - 86.8% top-1 acc on Kinetics-400 action recognition
Visualization
- 【20220614】目前初步实验下来，预训练的生成模型，生成的图片会呈现明显的棋盘格，因为每个x32的feature pixel代表了一个32x32的RGB patch，官方论文里面的图也很棋盘格，不知道该训练到啥程度算结束
- What capability is learned?
  - random masking：the shape and texture of masked parts can be well recovered，以及unmasked区域会观察到显著棋盘格效应，因为这部分区域在训练过程中是不回传梯度的
  - masking most parts of a major object：can still predict an existence of object by the negligible clues
  - masking the full major object：the masked area will be inpainted with background textures
- Prediction v.s. reconstruction
  - 比较了masked region recover和全图recover两个任务
  - 从重建结果上看，后者视觉效果更好一点(棋盘格没那么明显，因为是全局预测)，但是精度则低了一个点：probably the model capacity is wasted at the recovery of the unmasked area which may not be that useful for fine-tuning
  - auto-encoders and masked image modeling两个方法都是重建任务，but they are built on different philosophies：
    - 前者是visible signal reconstruction
    - 后者是prediction of invisible signals
  - MIM也可以设计成全图重建，但这相当于融合了两个任务
    - prediction & reconstruction
    - 从finetuning精度上看two tasks are fundamentally different in their internal mechanisms，两个任务的内部机制不同，合起来做不会促进
    - the task to predict might be a more promising representation：看起来prediction任务学到的representation对下游任务更有用一些
  - 【个人理解重建任务更local一点，所以细节更好看，prediction任务更long-range一些，但是为什么有说对low-level/fine-grained downstream task更好呢？】

Revealing the Dark Secrets of Masked Image Modeling

动机
- Masked image modeling (MIM) as pre-training
  - proved effective
  - but how it works remains unclear
- we compare MIM with mainstream supervised models
  - through visualizations & experiments
  - to cover the key representational differents
- visualizations
  - 发现MIM brings locality inductive bias to all layers：信息流更不丢东西？
  - 相比之下supervised models在lower layers更关注局部信息，在higher layers则更关注全局信息
  - supervised models在last layers的attention head基本没啥差别（都是global semantic info），但是MIM在last layers仍旧能够keep diversity on attention heads
  - less diversity harms the fine-tuning performance
- experiments
  - 发现MIM相比较于supervised models更胜任tasks with weak semantics / fine-grained tasks
  - 【猜测】image-level label 驱动 pixel-level study的效果更好？
论点
- masked signal modeling
  - mask a portion of input signals and tries to predict them
  - 属于比较经典的recover-based自监督任务设计
  - language, vision, and speech场景都有在用
- masked image modeling (MIM)
  - achieve very high fine-tuning accuracy
  - thus this paper wants a deeper understanding
  - we use SimMIM framework：就是基于ViT/swin-back+light-weight head重建pixel-level图像的任务
    - random masking with large patch size
    - a-linear-layer prediction head
    - predict raw RGB pixels use L1 loss
Visualizations
- attention weights have a clear meaning：每个token比重多大
- 从三个方面来分析
  - averaged attention distance to measure whether it is local attention or global attention
  - entropy of attention distribution to measure whether it is focused attention or broad attention：这跟上面不一个意思吗
  - KL divergence of different attention heads to investigate that attention heads are attending different tokens or similar ones
- Local Attention or Global Attention
  - 图像信息带有strong locality：neighbor pixels天然地highly correlated，所以才会有conv这样的带有local priors的设计，但是transformer结构有没有这种inductive bias就值得讨论了
  - computing averaged attention distance in each attention head of each layer
    - constastive模型与supervised模型表现类似，lower layer focus locally，higher layers focus more globally
    - MIM模型每层的attention heads则表现的充满diversity，始终保有local & global pixels
    - 说明MIM brings locality inductive bias【不太理解】
- Focused Attention or Broad Attention
  - averaging the entropy of each head’s attention distribution
    - constastive模型与supervised模型表现类似，lower layer的一些attention heads有非常focused attention，大部分higher layers的attention heads则focus very broadly
    - MIM模型每层都很diverse，每层都兼顾了focused attention & broad attention
- Diversity on Attention Heads
  - 看每个attention head关注的token是否相似
  - computing the KL divergence between different heads
    - constastive模型与supervised模型表现类似，diversity逐渐变小，最后几层甚至都没了
    - losing diversity limits the capacity of the model：损害了模型表达能力
    - 去掉supervised模型的后面几层去进行下游任务精度会保持甚至提升，说明supervised pretrained model后面几层确实对下游任务有负面影响
- Investigating the Representation Structures via CKA similarity
  - 前面都是在看同层不同attention heads，这里观察不同层的feature maps
  - via the CKA similarity between feature representations of different layers
    - MIM和constastive模型表现类似，每层的feature representation structures高度相似
    - supervised模型则每层差异比较大
    - 给这些预训练模型加载权重的时候随机调换一些层进行下游任务，MIM只有轻微掉点，但是supervised会受影响更大
- Experiments
  - on 3 types of downstream tasks
    - semantic understanding tasks：classification，Concept Generalization (CoG) & 12-dataset (K12) & iNaturalist-18 (iNat18)
    - geometric and motion tasks：pose/depth estimation/video tracking，COCO & CrowdPose & NYUv2
    - combined tasks：object detection，COCO
  - Semantic Understanding Tasks
    - 用了三个数据集，从ImageNet pretrained去transfer
    - settings
      - AdamW
      - cosine learning rate schedule
      - 100 epochs with 20 warm-up
      - input 224x224
      - DropPath
    - 发现ImageNet cover的类别supervised模型会好于MIM模型，没cover的类/fine-grained的类都是MIM精度更高，说明MIM的 representation power的transfer能力更强
  - Geometric and Motion Tasks
    - 主要测试目标定位能力，不太关注高级语义信息
    - 全面超越
  - Combined Task of Object Detection
    - COCO目标检测
    - Mask-RCNN framework
    - 也是clearly outperform
    - 然后观察到MIM模型的定位task收敛的faster and better，supervised模型则对分类能力更有用，也说明了MIM更专注geometric and motion tasks

FAN:full attention networks

发表于 2022-06-03 |

paper: https://arxiv.org/abs/2204.12451

repo: https://github.com/NVlabs/FAN

Understanding The Robustness in Vision Transformers

动机
- ViT展现出了strong robustness（图像出现各类corruption的时候仍旧保持特征提取能力），但是缺少systematic understanding
- this paper
  - examine the role of self-attention
  - further propose a family of fully attentional networks (FANs) that strengthen its capability
- verified state-of-the-art on
  - ImageNet-1k：acc 87.1%
  - downstream tasks：semantic segmentation and object detection
论点
- ViT针对non-local relations的建模被认为是the robustness against various corruptions的重要要素
- 但是convNext用纯卷积网络也达到了相当的精度，raising the interest in the actual role of self-attention in robust generalization
- our approach
  - 首先是观察：
    - 在执行分类任务的时候，网络天然地做出了目标的分割——self-attention promotes mid-level representations
    - 对每层ViT的输出tokens做spectral clustering，观察到了极大的特征值
    - 发现极大特征值的数量与input perturbation是存在相关性的——这两个指标在mid-level layers中会显著降低，从而保持robustness，同时 indicates the symbiosis of grouping
  - 然后进一步探究这个grouping phenomenon
    - 发现self-attention类似于information bottleneck (IB)，这个暂时不理解
  - 最后是对transformer结构的改动
    - 原始的ViT block，用MSA（multi heads）去提取不同特征，然后用MLP去整合
    - 本文在融合的时候，加了一个channel attention

TokenLearner

发表于 2022-06-01 |

recollect：

[ViT 2020] AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE，Google，开启了vision transformer的固定范式，都是切割patches作为tokens，这也对应了文本的词/字符切割，但是一个patch和一个词向量的信息量是不一样的（像素信息更低级）

[TokenLearner 2022] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? Google，使用更少数量的、能够挖掘重要信息的learnable tokens，

repo：https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner

unofficial keras repo：https://github.com/ariG23498/TokenLearner

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

动机
- tokens
  - previous：densely sampled patches
  - ours：a handful of adpatively learned tokens
- learn to mine important tokens in visual data
  - find a few import visual tokens
  - enable long range pair-wise attention
- applicable to both image & video tasks
  - strong performance
  - computationally more efficient
- comparable results are verified on classifications tasks
  - 与state-of-the-arts on ImageNet对比
  - video datasets, including Kinetics-400, Kinetics-600, Charades, and AViD
论点
- the main challenge of ViTs
  - require too many tokens：按照16x16的切割方式，512x512的图像也对应着1024个tokens
  - transformer block的computation和memory是基于token length平方型增长的
  - 因此限制了larger image/longer video
- thus we propose TokenLearner
  - a learnable module
  - take image as input
  - generates a small set of tokens
  - idea很直接：找到图像的重点区域regions-of-importance，然后用重点区域生成token
  - 实验发现保留8-16个（之前transformer block通常保留200-500通道数）就能够保持甚至提升精度，同时降低flops
方法
- TokenLearner
  - formulation
    - a space-time tensor input $X$：$X \in R^{T \times H \times W \times C}$
    - a temporal slice $X_t$：$X_t \in R^{H \times W \times C}$
    - T是时间维度，如果是image的话T=1，HWC是常规的长/宽/通道数
    - for every time frame t，we learn to generate a series of tokens $Z_t$ from the input frame $X_t$：$Z_t=[z_i]_{i=1}^S$
    - thus we use a tokenizer function $A_i$：$z_i=A_i(X_t)$，adaptively selects important combination of pixels
    - 这样的function我们有S个，而且S远远小于HW，通常S=8
  - tokenizer function
    - implemented with a spatial attention mechanism
    - 首先生成一个spatial weight map (size HW1)
    - 然后乘在$X_t$上，得到an intermediate weighted tensor (size HWC)
    - 最后进行spatial维度的global averge pooling，将weighted maps转化成vector (size C)
    - 所有的resulting tokens are gathered to form the output $Z_t =[z_i]_{i=1}^S\in R^{S \times C}$
    - spatial attention的实现有两种
      - 本文v1.0使用了一层/多层卷积(channel=S)+sigmoid
      - 本文v1.1使用了一个MLP(dense-gelu-dense)
      - （这两个版本的参数量差距巨大啊）
  - 图：将$R^{HWC}$的input image稀疏映射到$R^{SC}$
- TokenFuser
  - after the Transformer layers，此时的tensor flow还是$R^{SC}$
  - 引入TokenFuser
    - fuse information across tokens，融合所有token
    - remap the representation back to origin resolution，重映射
  - 首先做fuse：give tokens $Y\in R^{ST \times C}$，乘以一个learnable weight $M (ST \times ST)$，得到tensor $\in R^{ST \times C}$，可以理解为空间（或时空）关联
  - 然后做remap，对每个temporal slice $Y_t \in R^{SC}$:
    - $X_t^{j+1} = B(Y_t, X_t^j) = B_w Y_t + X_t^j = \beta_i(X_t^j)Y_t+X_t^j$
      - $X_t^j$是TokenLinear的残差输入，也就是原图HWC，等待被reweight的分支
      - $X_t^{j+1}$是模块输出
      - $Y_t^j$是TokenFuser的fuse这步的结果，对应图上transformer output SC
    - $\beta_i()$是个dense+sigmoid，作用在原图上，得到HWS的weight tensor $B_w$
    - 然后乘上Y得到HWC
    - 再加上这个残差
- 整体架构
  - 整体计算流程
  - 两种模型结构（有/没有TokenFuser）
实验
- settings
  - tobeadded
- TokenFuser的ablation实验：整体有提升，模型越大提升越不明显

cython & flask

发表于 2022-05-20 |

python的编译
- pyc：
  - pyc是编译py之后生成的二进制文件，由python虚拟机来执行的
  - 在模块被加载时，.pyc文件比.py文件更快
  - 可直接反编译为源码
  - 生成方式1：
    - 生成pyc文件：python -m py_compile {file1,file2}.py / python -m compileall DIR
    - 删除py文件：find . -name “*.py” |xargs rm -rf
    - 删除pycache目录：find . -name “pycache” |xargs rm -rf
  - 生成方式2：
    1
    2
    3
    4
    5
    6
    7
    # single file
    import py_compile
    py_compile.compile('$file_path')
    
    # dir
    import compileall
    compileall.compile_dir('$dir')
- cython编译成动态库.so：
  - pyx文件由cython编译为.c文件，.c文件由C/C++编译器编译为.so动态库
  - 能起到代码保护的作用
  - 但是编译速度太慢了
  - 注意在编译的时候每个库里必须有一个__init__.py文件
  - 生成方式：
    - 创建setup.py
    - 运行python setup.py build_ext
    - 会在执行目录下，新增build文件夹，里面放有.so
python的反编译
- pyc：
  - 需要安装第三方库：pip install uncompyle6
  - uncompyle6 -o . *.pyc

Flask（https://zhuanlan.zhihu.com/p/32202156）

一些概念
- API：应用程序接口，由服务器（Server）提供，一般是web server（网络服务器），使得外部机器通过API读取、编辑网站数据，通俗来讲API是一套协议，规定了与外界的沟通方式——如何发送请求和接受响应
- HTTP动词（GET、POST、PUT、DELETE）：它们分别对应四种基本操作：GET用来获取资源，POST用来新建资源（也可以用于更新资源），PUT用来更新资源，DELETE用来删除资源
- 装饰器：放在函数前面，相当于将函数对象传入这个wrapper方法中

startup：URL

from flask import Flask


app = Flask(__name__)


@app.route('/')
def index():
    return 'Index Page'


@app.route('/hello')
def hello():
    return 'Hello World'


@app.route('/user/<string:username>')
def show_user_profile(username):
    # show the user profile for that user
    return 'User %s' % username


if __name__ == '__main__':
    # app.run()
    app.run(debug=True)

    * app = Flask(\__name__): 新建一个Flask类的实例，这是一个wsgi（Web服务器网关接口，Python Web Server Gateway Interface）

    * @app.route(URL): 用route装饰器告诉Flask什么样的URL能够触发我们的函数
           - @app.route('/')就代表默认根地址http://127.0.0.1:5000/

           - @app.route('/hello')则代表http://127.0.0.1:5000/hello

           - @app.route('/user/<str: username>')则带了传入变量以及它的变量转换器，e.g. http://127.0.0.1:5000/user/amber

                <img src="cython-flask/转换器.png" width="60%;" />

           - 同时可以看到，每执行一次访问，debug界面生成一次GET请求（"GET /hello HTTP/1.1"）以及服务器返回的内容（200），具体的解释在下面的章节

    * def index()/hello_world()/show_user_profile(username): 这些函数在生成指定URL时被调用，这些方法都叫做视图函数

    <img src="cython-flask/url.png" width="40%;" />   <img src="cython-flask/get.png" width="40%;" />

* HTTP方法

  * HTTP 方法，通过浏览器告知服务器，客户端想对请求的页面做什么

    * GET(方法)告知服务器：只获取页面上的信息并发给我，这是最常用的方法
    * POST(方法)告诉服务器：想在 URL 上发布新信息。并且服务器必须确保数据已存储且仅存储一次，这是 HTML 表单通常发送数据到服务器的方法
    * PUT(方法)类似 POST 但是服务器可能触发了存储过程多次，多次覆盖掉旧值。你可能会问这有什么用，当然这是有原因的。考虑到传输中连接可能会丢失，在这种情况下浏览器和服务器之间的系统可能安全地第二次接收请求，而不破坏其它东西。因为 POST 它只触发一次，所以用 POST 是不可能的

    * DELETE(方法)删除给定位置的信息

  * 默认情况下，路由只回应GET请求，如上图的GET info，在访问指定URL时候就被触发了，具体方法通过route()装饰器传递methods参数来指定

* 设计一个简单的应用todoList

  * 首先设计一个根URL：如http://[hostname]/todo/api/v1.0/

    * hostname：[ip：port_id]
    * todo：应用名称
    * api/v1.0：API版本

  * 第二步规划数据结构和动作

    * 动作

    | HTTP方法 |                       URL                       |         动作         |
    | :------: | :---------------------------------------------: | :------------------: |
    |   GET    |      http://[hostname]/todo/api/v1.0/tasks      |     检索任务清单     |
    |   GET    | http://[hostname]/todo/api/v1.0/tasks/[task_id] |     检索指定任务     |
    |   POST   |      http://[hostname]/todo/api/v1.0/tasks      |   创建一个新的任务   |
    |   PUT    | http://[hostname]/todo/api/v1.0/tasks/[task_id] | 更新一个已存在的任务 |
    |  DELETE  | http://[hostname]/todo/api/v1.0/tasks/[task_id] |     删除一个任务     |
    |          |                                                 |                      |

        可以看到我们通过指定HTTP method，可以在同一个URL上实现不同的请求

    * 任务的数据结构
        * id:唯一标识。整型。
        * title:简短的任务描述。字符串型。
        * description:完整的任务描述。文本型。
        * done:任务完成状态。布尔值型。

  * 第三步实现第一个方法：get_tasks()

    1
2
3
4
global tasks
@app.route('/todo/api/v1.0/tasks', methods=['GET'])
def get_tasks():
    return jsonify({'tasks': tasks})



    通过访问http://127.0.0.1:5000/todo/api/v1.0/tasks可以查看task列表

  * 第四步实现【获取指定任务/删除任务】方法：get_task(task_id)/delete_task(task_id)

    1
2
3
4
5
6
7
@app.route('/todo/api/v1.0/tasks/<int:task_id>', methods=['GET'])
def get_task(task_id):
	pass

@app.route('/todo/api/v1.0/tasks/<int:task_id>', methods=['DELETE'])
def delete_task(task_id):
    pass



    这两个方法比较类似，都是先查询，存在即执行操作，否则抛出404

  * 第五步实现【创建一个新的任务】方法：create_task()

    1
2
3
4
5
6
7
8
9
10
11
12
13
@app.route('/todo/api/v1.0/tasks', methods=['POST'])
def create_task():
    if not request.json or 'title' not in request.json:
        abort(400)
    # new task
    task = {
        'id': tasks[-1]['id'] + 1,
        'title': request.json['title'],   # not blank
        'description': request.json.get('description', ""),
        'done': False
    }
    tasks.append(task)
    return jsonify({'task': task}), 201



    这里面出现了**request和状态码**，这部分内容放在下一节

  * 第六步实现【更新一个任务】方法：update_task(task_id)

      1
2
3
4
@app.route('/todo/api/v1.0/tasks/<int:task_id>', methods=['POST'])
def update_task(task_id):
    # request.json
	pass



* 配置服务器ip和端口号

HTTP请求（https://www.jianshu.com/p/4456b0906708）
- flask的工作，就是开启一个server，监听client端发出的请求，并作出响应，请求和响应都是以http request的方式
- 服务器端就是flask开启的web server，客户端通过浏览器向server传递指令
- 请求报文 request message
  - 请求报文由请求行（HTTP方法、URL、协议版本）、首部字段(header)、空行、请求数据（内容实体）组成
  - request对象
    - 假设请求的url是：http://helloflask.com/hello?name=Grey
      - http：//：协议字符串，指定要使用的协议，这里是http协议
      - helloflask.com：服务器的地址（域名）
      - /hello：资源路径（route）
      - ?name=Grey：query string，查询字符串，查询字符串从?开始,以键值对的形式写出,多个键值对之间用&分隔
    - 属性和方法：
      - 上面出现的request.json，就是在获取json格式的报文
      - 本例中执行request.args.get(‘name’, ‘’)，就能获取到url请求中name的value——Grey
- 路由匹配 & server处理
  - 当server能够匹配到http request给出的host/URL和methods的时候，就会调用对应的视图函数，否则得到404
  - 查看当前server定义的所有路由：
    - export FLASK_APP=todo
    - flask routes
    - 每个路由对应：断点（Endpoint）、HTTP方法（Methods）和URL规则（Rule），其中static是flask添加的特殊路由，用来访问静态文件
  - 生成响应
    - 在flask程序中，客户端发出的请求触发响应的视图函数，获取的返回值会作为响应的主体最后生成完整的响应，即响应报文
    - 视图函数可以返回最多由三个元素组成的元组：响应主体、状态码、首部字段
    - 默认的状态码为200
    - 首部字段可以为字典，或是两元素元组组成的列表
    - 一个普通的响应可以只返回主体内容，其余保持默认值（200，{}）
    - 在abort()函数中传入状态码即可返回对应的错误响应，不需要return，abort之后的代码将不会被执行
    - Flask中可以调用make_response()方法将视图函数返回值转换为响应对象
      1
      2
      3
      4
      5
      6
      7
      from flask import make_response
      
      @app.route('/foo')
      def foo():
      response = make_response('Hello, World!')
      response.mimetype = 'text/plain'
      return response

响应报文 response message
- 视图函数可以返回最多由三个元素组成的元组：响应主体、状态码、首部字段，这就是响应的主体
- 响应的报文首部包含一些关于响应和服务器的信息，这些内容由Flask生成
- 浏览器接收到响应后，会将响应主体解析并显示在浏览器窗口上
- 响应报文主要由协议版本、状态码（status code）、原因短语（reason phrase）、响应首部和响应主体组成
- 常见HTTP状态码

Matting

发表于 2022-05-05 |

papers：

[DIM-Matting 2017] Deep Image Matting，Matting网络始祖，trimap-based

[BGMv2 2021] Real-time high-resolution background matting，实现高分辨率图像的实时预测

[MODNet 2020] Is a Green Screen Really Necessary for Real-Time Portrait Matting，商汤，摒弃了辅助信息，直接实现Alpha预测

[PP-Matting 2022] High-Accuracy Natural Image Matting，百度，在MODNet基础上改进

[BiSeNet v2 2020] Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation，双encoder结构，一个用来显式地guide local feature，本身不是针对matting任务，但是其他matting paper引用了它

[GCA Matting 2020] Natural Image Matting via Guided Contextual Attention，introduce GCA block来做local guidance

[animal matting 2022] Bridging Composite and Real: Towards End-to-end Deep Image Matting，毛发抠图

Bridging Composite and Real: Towards End-to-end Deep Image Matting

introduction
- 毛发的分割通常需要trimap引导
- 思路还是常规的decompose，into two parallel sub-tasks：
  - high-level semantic segmentation
  - low-level details matting
- propose GFM
  - Glance and Focus Matting network (GFM)
  - a shared encoder and two separate decoders
  - learn both tasks in a collaborative manner
- 分析了composed image和real-world image的差异
  - a carefully designed composition route RSSN
  - 提供了2000张high-resolution的动物图和10,000张portrait图
related work
- methods
  - 串行：先global segmentation生成trimap，然后做local matting，串行时间效率差，而且错误的语义不会被修正，两个网络通常分别训练，存在mismatch
  - 并行：添加了一个global guidance做辅助分支，通常用coarse matte做学习目标，但是用一个网络去学全图fg/transition/bg区域的matte是一个比较困难的事，因为不同区域的表征、语义差异太大了
  - this paper 并行且多任务：并行执行segmentation和matting两个任务
    - global rough segmentation
    - details matting
- datasets
  - ORI-Track：之前的数据只提供前景和matte/low-resolution portraits和不太准确的matte
  - COMP-Track：
    - 之前是通过在公开数据集如COCO上叠加前景数据构造数据集的
    - 这种合成图像存在composition artifacts
      - 因为和前景存在resolution，sharpness，noise，illumination差异
      - 可能存在salient object
  - this paper 提出了一个large-scale high-resolution clean background dataset (BG-20k)
method
- overview
  - a segmentation stage：先识别salient rough foreground / background
  - a matting stage：focus on the transition areas to distinguish details from the background
  - collaboration
- shared encoder
  - 输入是single image
  - 5 blocks：E0-E4，s2-s32
  - DenseNet-121/ResNet-34/ResNet-101
- Glance Decoder (GD)
  - a large receptive field to learn high-level semantics
  - PPM
  - 镜像地stack 5 blocks：D4-D0
    - each of which consists of three sequential 3 × 3 convolutional layers and an upsampling layer
    - 每个stage的decoder block还接受一个ele-wise sum的PPM输入
  - loss：2/3-channel的CE
- Focus Decoder (FD)
  - aims at low-level structural features in transition areas
  - bridge block (BB)
    - three dilated convolutional layers
    - leverage local context in different receptive fields
    - E4和BB的feature concat起来作为stage5 feature
  - 镜像地stack 5 blocks：D4-D0
    - UNet style
    - 额外的来自encoder的shortcut，保留fine details
  - loss
    - alpha prediction loss：absolute difference
    - Laplacian loss：L1 loss
    - 仅关注the unknown transition areas
- Representation of Semantic and Transition Areas
  - 模式1：GFM-TT
    - 3-class trimap T：segmentation gt将ground truth alpha dilation and erosion with a kernel size of 25
    - use the ground truth alpha matte来定义the transition area
  - 模式2：GFM-FT
    - 2-class foreground F：将ground truth alpha erode with a kernel size of 50做F
    - 用(gt alpha>0) - F 作为transition area
  - 模式3：GFM-BT
    - 2-class foreground G：将ground truth alpha dilate with a kernel size of 50做G
    - 用G-(gt alpha>0) 作为transition area
- Collaborative Matting (CM)
  - to generate the final alpha prediction
  - CM的不同模式
    - GFM-TT模式下：CM把GD在transition area的预测换成FD的预测
    - GFM-FT模式下：CM把GD和FD的结果相加
    - GFM-BT模式下：CM用GD-FD的结果作为最终结果
  - loss
    - alpha-prediction loss：absolute diff
    - Laplacian loss：L1
    - composition loss：absolute diff
RSSN

BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation

introduction
- propose Bilateral Segmentation Network (BiSeNet V2)
  - treat spatial details and categorical semantics separately
    - a Detail Branch：wide channels and shallow layers
    - a Semantic Branch：narrow channels and deep layers
  - design a Guided Aggregation Layer
    - enhance mutual connections
    - 得到fused feature
- performance
  - on Cityscapes test：with 2048x1024 input，72.6% miou，156 FPS，on NVIDIA GeForce GTX 1080 Ti
- different backbone
  - dilation backbone如deeplab，用空洞卷积扩大reception field，同时保留high resolution，计算量大
  - encoder-decoder backbone如unet，skip connection这部分会引入memory access cost，影响latency
  - 本文的Bilateral Segmentation backbone，两个pathway，achieve high accuracy and high efficiency simultaneously
method
- overview
- Detail Branch
  - 负责学spatial details
  - wide channels，shallow layers，small strides
  - 没有residual path，避免memory access cost
  - three stages，基本类似VGG
    - 每个stage有两层conv-bn-relu
    - 每个stage的第一个conv的stride是2
    - 输出x8的feature map
- Semantic Branch
  - 负责学high-level semantics
  - channel数和Detail Branch的channel数有一个比例factor（<1）
  - 可以用任何的lightweight convolutional model
  - large strides，large reception field
  - stem block
    - 迅速下采样到x4尺度，用了两种下采样manner
    - 然后concat + fuse conv
  - GE block
    - 基本就是MobileNet V2的block
    - 但是在resolution/channel变化时，identity path加入dwconv-conv来做align，而不是去掉id path
      - 2个stride2的DWconv
      - 1一个separable conv
    - 同时在residual path中，expand layer的1x1conv换成了3x3conv，因为cudnn里面3x3conv更友好，expand操作放在了DWconv里面
  - 最后的CE block用global average pooling to embed the global contextual response
    - GAP的1x1xC的context vector直接broadcast add到x32 feature上
- Bilateral Guided Aggregation
  - fuse manners
    - simple sum/concat
    - well designed operations，本文用了bidirectional aggregation method
  - BGA
    - 用semantic branch的响应值来guide Detail Branch
- Booster Training Strategy
  - 每个semantic stage的feature上加了auxiliary segmentation head
  - 每个seg head都是3x3conv-bn-relu + 1x1 conv，然后上采样到原始尺度

PP-Matting: High-Accuracy Natural Image Matting

动机
- 抠图：aka. 抠前景，与segmentation的区别是在细节上更精准
- 现有技术的缺陷：
  - trimap-based methods：需要提供trimap辅助信息
  - trimap-free methods：与trimap-based方法在抠图质量上有较大差距
- 本文方法
  - trimap-free architecture（a two-branch architecture）
    - applies a high-resolution detail branch (HRDB)：high-resolution保留local细节信息
    - a semantic context branch (SCB)：分割分支负责全局语义分割
  - high-accuracy
  - natural image matting
- test datasets
  - Composition-1k
  - Distinctions-646
论点
- task formulation
  - an image $I \in R^{H\times W\times 3}$ 可以看作前景和背景的linear combination
    - foreground image $F \in R^{H\times W\times 3}$
    - background image $B \in R^{H\times W\times 3}$
    - $I^i = \alpha^i F^i + (1-\alpha^i)B^i$
    - $\alpha_i$：foreground opacity
  - 可以看作是alpha mattes estimation problem
  - 【没理解】The image matting problem is highly ill-posed with 7 values to be solved, but only 3 values are known for each pixel in a given RGB image.
    - ill-posed problem：不适定问题，
    - well-posed problem：适定问题，需满足三个条件，否则为不适定问题
      - a solution exists 解必须存在
      - the solution is unique 解必须唯一
      - the solution’s behavior changes continuously with the initial conditions 解连续变化，不会发生跳变，即必须稳定
    - GAN、图像超分辨率等任务，都不满足‘解唯一’（感觉生成系都不满足）—— In most cases, there are several possible output images corresponding to a given input image and the problem can be seen as a task of selecting the most proper one from all the possible outputs.
- methods
  - trimap-based
    - trimap：a rough segmentation of the image into three parts: foreground, background and transition (regions with unknown opacity)
    - 作为image的辅助信息，并行输入
    - not feasible in video
  - trimap-free
    - multi-stage approaches：将trimap作为中间结果，串起两个任务，会有累积误差
    - end-to-end approaches：一个网络直接出结果
- our method
  - high-resolution detail branch (HRDB)：keep high resolution instead of encoder-decoder，fine-grained details
  - semantic context branch (SCB) ：segmentation subtask， foreground-background ambiguity
  - fuse：give the final alpha matte
方法
- overview of network architecture
  - two branches：SCB & HRDB
  - shared encoder
  - PPM to strengthen semantic context
  - guidance flow to merge
- shared encoder
  - need both high-resolution details and high-level abstract semantics
  - HRNet48 pre-trained on ImageNet
- Semantic Context Branch（SCB）
  - 用来生成trimap（fg/bg/transition）
  - 5 blocks
  - each block：3xConvBNReLU+bilinear upsample
  - 32x 下采样的特征图作为分割分支的输入，加一个PPM，
  - 输出是semantic map，分割目标是3分类的segmentation mask，也就是trimap
- High-Resolution Detail Branch（HRDB）
  - 3 residual blocks + 1 conv
  - 2x和4x 的特征图上采样到原始resolution，然后combine，作为分支输入
  - SCB的中间结果作为guidance flow，也融合进HRD分支， to obtain semantic context
  - 输出是detail map，focuses on details representation in the transition region
- Guidance Flow
  - Gate Convolutional Layers (GCL) 用来生成guidance map $g \in R^{H \times W}$
    - $g = \sigma (C_{1 \times 1} (s||d))$
    - semanic map和detail map先concat，然后conv-bn-relu & conv-sigmoid
  - merge guidance flow和 original flow 得到最终的merged flow $\hat d$
    - $\hat d = (d \odot g + d)^T w $
    - detail map和guidance map先做element-wise的点乘，作为辅助信息
    - 然后叠加detail map
    - 最后进行channel-wise的re-weight
  - 用semantic map的1、3、5 block的输出进行guidance
- Loss Function
  - 3个losses
  - [1] semantic loss $L_s$：pixel-level的3分类CE
  - [2] detail loss $L_d$：是alpha loss $L_{\alpha}$和gradient loss $L_{grad}$的sum，而且只计算transition region
    - the alpha-prediction loss：是 groud truth alpha（下标g）和 predict alpha（下标d）的absolute difference
    - the gradient loss：是像素gradient magnitude的差值
    - $\epsilon=1e-6$
  - [3] fusion loss $L_f$：包含alpha loss $L_{\alpha}$、gradient loss $L_{grad}$、composition loss $L_{comp}$ based on the final alpha matte
    - the composition loss：是ground truth RGB value与predict RGB value的差值
    - predict Image $I_p$是用predicted alpha matte对ground truth foreground & background的加权得到
    - alpha loss和gradient loss的算法与上面保持一致，但是alpha matte的值是不同的，一个是detail map的结果，一个是fusion map的结果
  - the final weighted loss：$\lambda_1=\lambda_2=\lambda_3=1.0$
实验
- Datasets
  - Distinctions-646：训练集包含596个前景目标及ground truth alpha mattes，测试集包含50个
  - Adobe Composition-1k：431/50
  - 在训练中每个前景会被添加进100张背景中，测试是20张
- Implementation Details
  - input images：random crop into [512,640,800] /pad into [512,]，augmented by random [distortion, blurring, horizontal flipping]
  - SGD：momentum=0.9，weight decay=4e-5
  - lr：初始0.01，300k iteration的时候pow by 0.9
  - batchsize = 4
  - conduct on a single Tesla V100 GPU
- Evaluation metrics：the lower，the better
  - the sum of absolute differences (SAD)
  - mean squared error (MSE)
  - gradient (Grad)
  - connectivity (Conn)

MODNet: Real-Time Trimap-Free Portrait Matting via Objective Decomposition

动机
- existing methods
  - either require auxiliary inputs
  - or involve multiple stages
- thus we propose MODNet (matting objective decomposition network)
  - light-weight
  - real-time
  - end-to-end
  - single-input
  - portrait matting
- two novel techniques
  - an Efficient Atrous Spatial Pyramid Pooling (e-ASPP) module：fuse multi-scale
  - a self-supervised sub-objectives consistency (SOC) strategy：domain shift problem
- test device：under 512x512, 67 FPS, GTX 1080Ti GPU
- dataset：a carefully designed photographic portrait matting (PPM-100) benchmark & Adobe Matting Dataset
论点
- Portrait Matting Approaches
  - trimap-based：有一个pre-defined trimap作为先验
  - multi-stage：先用一个semantic network生成一个pseudo trimap，然后refine成alpha matte，数据集有限，suffer from the domain shift problem，对real-world data的泛化性不好
  - 本文方法能够同时自动化完成matting task的子任务：背景提取 & 前景语义，同时用子任务之间的consistency做自监督，提升模型泛化性
- Image Matting Formulation
  - an alpha matte predicting task：$I^i = \alpha^i F^i + (1-\alpha^i)B^i$
  - ill-posed explanation：上面这个公式，等式右边的参数全是未知的，3通道像素值也就是3+3+1=7个未知数per pixel
  - 所以通常才需要trimap辅助信息
    - 提供0/0.5/1三种alpha构成的mask
    - absolute foreground (α = 1),
    - absolute background (α = 0)
    - unknown area (α = 0.5)
    - 这样任务就简化为，基于known 0/1 region的像素信息，需要预测unknown region的alpha probability
  - matting任务heavily rely on low-level features
  - trimap-free matting
    - a semantic estimation step will then be needed to locate the foreground
  - ASPP
    - proved to boost the performance notably in segmentation-based tasks
    - 本文改进了an efficient variant of ASPP
  - Consistency Constraint
    - Consistency supervision用于semi-/self-supervised
    - MODNet imposes consistency among various sub-objectives within a model
方法
- overview
  - divide into three parts
    - semantic estimation：前背景，x16 seg map，用coarse alpha监督
    - detail prediction：边界细节，x4 boundary map，用boundary region的alpha监督
    - semantic-detail fusion：信息融合，得到最终的alpha matte prediction
  - Architecture
- Semantic Estimation
  - the low-resolution branch S
  - use an encoder MobileNetV2 to predict a coarse mask：16x downsamp
  - seg head：1x1conv + sigmoid
  - ground truth也是粗糙版：ground truth matte也做16x下采样+blur，去除了一些类似发丝的fine feature，专心提取前景整体
  - Efficient ASPP (e-ASPP)：
    - 标准的ASPP能解决分割前景有洞的情况，但是huge computation
    - 空洞卷积多尺度提取+常规卷积融合——modify it in three steps：
      - 空洞卷积分解成depth-wise conv和point-wise conv
      - 调换point-wise和fuse conv的顺序
      - fuse conv也替换为更cheaper的depth-wise conv
  - L2 loss
    - $s_p$：predict alpha
    - $G(\alpha_g)$：粗糙化以后的ground truth alpha matte
- Detail Prediction
  - the high-resolution branch D
  - 用原始图像I、Semantic Branch的输出S(I)、以及Semantic Branch的中间结果(low-level features) 作为输入
  - Branch D 超级轻量
    - 层数少：12个conv
    - 通道数少：64 maximum
    - 没有保留原始解析率：第一层就下采样到4x，最后两层再复原，impact is negligible owing to skip link
  - L1 loss：focus on transition region only
    - $m_d$：boundary mask，对$\alpha_g$先膨胀再腐蚀，提取transition area作为boundary mask
    - $d_p$：D(I, S(I))，branch输出
    - $\alpha_g$：ground truth alpha matte
- Semantic-Detail Fusion
  - the fusion branch F
  - combine semantics and details
    - upsample S(I)
    - then concat S(I) & D(I, S(I))
    - 然后是convs+sigmoid，得到final predict matte
  - L1 loss + compositional loss
    - L1 loss：全图的alpha matte的L1 loss
    - $L_c$： the absolute difference between input image I and the composited image，跟PP-matting公式10一样，用均方根～
      - the composited image $I_p$ 用gt的fg和bg以及预测的alpha计算
      - loss是L2 loss
    - $\alpha_p$：final prediction
- train end-to-end through the sum of losses above
  - $\lambda_s = \lambda_a = 1$
  - $\lambda_d = 10$
SOC for Real-World Data
- portrait matting requires excellent labeling in the hair area，通常在摄影网站上找虚化背景的图标注的
- 常规的data aug是用背景替换的方式，但是和real-world data还是存在domain gap，模型通常过拟合训练集
- utilize sub-objectives consistency (SOC) to adapt MODNet to unseen data distributions
  - MODNet的3个子任务在无标签数据上should have consistent outputs
  - given unlabeled Image I，预测$s, d, \alpha$
  - enforce the semantics in $\alpha$ to be consistent with $s$
    - 还是用supervised training里面的L2 loss
    - gt alpha替换成pred alpha
  - enforce the details in $\alpha$ to be consistent with $d$
    - 还是用supervised training里面的L1 loss in transition region
    - gt alpha替换成pred alpha
  - 两项consistency loss加起来
  - extra regularization term防止alpha虚化使得detail loss丧失细节信息

咏柳皮肤病paper

发表于 2022-04-19 |

A Deep Learning Based Framework for Diagnosing Multiple Skin Diseases in a Clinical Environment

摘要
- a novel framework based on deep learning
  - backbone：eff-b4
  - output layer改成14个neuron
  - 每个layer group后面接一个auxiliary classifier
  - 用t-SNE对image feature可视化
- a dataset that represents the real clinical environment
  - 13,603 专家标注的皮肤镜图片
  - 14类（扁平苔藓LP，红斑狼疮Rosa，疣VW，痤疮AV，瘢痕KAHS，湿疹和皮炎EAD，皮肤纤维瘤DF，脂溢性皮炎SD，脂溢性角化SK，黑素细胞痣MN，血管瘤Hem，银屑病Pso，暗红色斑PWS，基底细胞癌BCC）
- 精度
  - overall acc：0.948
  - sensitivity：0.934
  - specificity：0.950
  - AUC：0.985
  - 与280个权威专家比赛：showed a comparable performance level in an 8-class diagnostic task
方法
- previous work不太适用于实际场景：不符合亚洲人发病率
- database
  - this paper调研了北大医学部皮肤病科的database
  - from October 2016 to April 2020
  - 由同一个技师使用皮肤镜，对着病灶从不同角度，连续拍摄多张
  - 2个5年以上经验的专家，结合患者病史，临床表现，皮肤镜特征打标签
  - 2人意见不同的时候通过咨询第3人达成一致
  - 劣质数据（图像质量、病史不完整、病灶在黏膜/指甲）被排除
  - 提取了14 most frequently encountered 常见病，13,603 clinical images from 2,538 patient cases
- 网络
  - eff-b4，gradually unfroze
  - 7 auxiliary classifiers + 1 final classifiers，element-wise summation
  - 无语子
- Comparison with dermatologists
  - 280个专家，
  - 用独立的测试集：consists of 200 cases with a clinical image and a dermoscopic image，8类（MN, SK, BCC, EAD, SD, Pso, VW and Rosa），每类25
  - 模型只用皮肤镜图片
  - 都是8选1
一些精度
- metrics
- dataset
- 皮肤镜图像示例
- 总体精度
- 混淆矩阵

amber.zhang

要糖有糖，要猫有猫

GitHub