[RVM 2021] Robust High-Resolution Video Matting with Temporal Guidance:字节,temporal(ConvGRU),multi-task
[SparseInst 2022] Sparse Instance Activation for Real-Time Instance Segmentation:自动化所,overview有点像DETR
Robust High-Resolution Video Matting with Temporal Guidance
- 动机
- human video matting
- 用于背景替换
- 现有技术不稳定,会产生artifacts
- performance
- robust
- real-time
- 4K at 76 FPS and HD at 104 FPS on Nvidia GTX 1080Ti
- high-resolution
- use recurrent structure instead of frame by frame:用时序网络,分割质量更好
- propose a novel training strategy:同时进行matting和segmentation两个任务,模型更鲁邦
- human video matting
- 论点
- matting formulation recollect
- $I = \alpha F + (1-\alpha)B$
- matting methods
- Trimap-based matting:最classical的,需要额外的先验,且通常不分类,只做前景语义分割
- Background-based matting:不要先验证trimap了,改要先验background map
- Segmentation:就是binary的语义分割,人像前景效果还好,但是背景容易出现各种artifacts,比较不稳定
- Auxiliary-free matting:不需要额外输入的架构,MODNet更关注肖像,this paper更关注目标人
- Video matting:
- MODNet用相邻帧图像的预测结果来相互压制伪影,本质上仍就是image-independent
- BGM用多帧图像作为多通道
- Recurrent architecture:ConvLSTM/ConvGRU
- High-resolution matting
- Patch-based refinement:图像尺寸减小,以获取high resolution task的算力,但是
- Deep guided filter:trainable,模块化,end-to-end将low-reso转换成high-reso
- use temporal structure
- temporal information boosts both quality and robustness
- 这种overtime的背景变换使得模型对背景信息的学习更加鲁邦和精确
- introduce a new training strategy
- 大多数matting数据集都是合成的,包括在数据处理阶段也会做这种前景贴背景的操作,扩充样本量,这种图像太fake了,和实际场景有domain gap,泛化性差
- 也有方法尝试先在segmentation任务上做预训练、用真实图像做对抗等方式去解决图像假的问题,这样的缺点是multi step
- 同时训练matting & segmentation任务就一步到位了,没有额外的adaptation steps
- matting formulation recollect
- 方法
- model architecture overview
- encoder:编码individual frame’s features,mobileNetv3/resnet50
- recurrent decoder:aggregates temporal information
- a Deep Guided Filter module:high-resolution upsampling
- Feature-Extraction Encoder
- MobileNetV3-Large + LR-ASPP module
- 最后一个block使用了空洞卷积
- Recurrent Decoder
- ConvGRU at multiple scales
- bottleneck block:x16 level上
- 在LR-ASPP之后
- 后ConvGRU,with id path(split,一半通道用于id,一半通道用于GRU)
- 然后bilinear 2x
- Upsampling block:x8/x4/x2 level上
- 每个resolution stage
- 先merge(concat)前一个stage的feature
- 然后avg pooling,conv-bn-relu,transfer the feature
- 然后ConvGRU,with id path
- 然后bilinear 2x
- Output block:x1 level上
- 去做一个final prediction
- 先merge
- 然后【conv3x3-bn-relu】x2
- 然后conv1x1 head:1-channel alpha/3-channel fg/1-channel segmentation
- Deep Guided Filter Module
- given high- resolution videos such as 4K and HD
- 先下采样by a factor s
- 然后输入网络
- 最后网络的2个输出(alpha & fg)、网络output block的hidden feature、以及HR的原图这四个信息都给到DGF,to produce high-resolution的alpha和foreground
- model architecture overview
- 实验
- training details
- progressive learning:see longer sequences and higher resolution
- loss:
- matting loss(alpha / fg):L1 & pyramid Laplacian loss + additional temporal coherence loss
- segmentation loss:BCE
- training details
Sparse Instance Activation for Real-Time Instance Segmentation
- 动机
- fully convolutional real-time instance segmentation
- former work的实例分割通常与目标检测绑定
- dense anchors
- fixed reception field by fixed anchors
- multi-level prediction
- ROI-Align对移动端/嵌入式设备不友好
- NMS time-consuming
- this paper
- a sparse set of activation maps:类似detr的100个proposal
- 基于attention map得到instance-level的features
- 匈牙利算法来匹配proposed instance和gt,从而省略NMS,得到稀疏预测
- 40 FPS and 37.9 AP on the COCO benchmark
- repo:https://github. com/hustvl/SparseInst
- 论点
- this paper
- IAM:instance activation maps,sparse set,motivated by CAM
- pixel-level:相比较于框里还有bg
- 全局context & single-level
- simple op:avoid ROI-Align/NMS这些不可避免的循环操作
- bipartite的稀疏监督:inhibit the redundant predictions, thus avoiding NMS
- recognition and segmentation:在IAM的instance feature基础上执行下游任务
- IAM:instance activation maps,sparse set,motivated by CAM
- overall structure
- encoder:backbone + PPM,giving x8 fused features
- decoder:multi-branch
- instance branch:IAM,
- mask branch:语义分割,
- this paper
- 方法
- IAM:Instance Activation Maps
- 首先一个基本假设:encoder得到的feature是redundant
- IAM的op
- 一个id分支,传入原始feature,[b,h,w,d]
- 一个feature selection分支(conv+sigmoid+norm),[b,h,w,N]
- 两个分支做矩阵乘法,[b,N,d]:feature selection分支,给出了基于原始feature的N forms of spatial reweighting方案,作为最终的attention proposals
- downstream task:recognition and segmentation
- kernel
- class
- score
- IAM:Instance Activation Maps