Matting

papers:

[DIM-Matting 2017] Deep Image Matting,Matting网络始祖,trimap-based

[BGMv2 2021] Real-time high-resolution background matting,实现高分辨率图像的实时预测

[MODNet 2020] Is a Green Screen Really Necessary for Real-Time Portrait Matting,商汤,摒弃了辅助信息,直接实现Alpha预测

[PP-Matting 2022] High-Accuracy Natural Image Matting,百度,在MODNet基础上改进

[BiSeNet v2 2020] Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation,双encoder结构,一个用来显式地guide local feature,本身不是针对matting任务,但是其他matting paper引用了它

[GCA Matting 2020] Natural Image Matting via Guided Contextual Attention,introduce GCA block来做local guidance

[animal matting 2022] Bridging Composite and Real: Towards End-to-end Deep Image Matting,毛发抠图

Bridging Composite and Real: Towards End-to-end Deep Image Matting

  1. introduction

    • 毛发的分割通常需要trimap引导

    • 思路还是常规的decompose,into two parallel sub-tasks:

      • high-level semantic segmentation
      • low-level details matting
    • propose GFM

      • Glance and Focus Matting network (GFM)

      • a shared encoder and two separate decoders

      • learn both tasks in a collaborative manner

    • 分析了composed image和real-world image的差异

      • a carefully designed composition route RSSN
      • 提供了2000张high-resolution的动物图和10,000张portrait图
  2. related work

    • methods

      • 串行:先global segmentation生成trimap,然后做local matting,串行时间效率差,而且错误的语义不会被修正,两个网络通常分别训练,存在mismatch
      • 并行:添加了一个global guidance做辅助分支,通常用coarse matte做学习目标,但是用一个网络去学全图fg/transition/bg区域的matte是一个比较困难的事,因为不同区域的表征、语义差异太大了
      • this paper 并行且多任务:并行执行segmentation和matting两个任务

        • global rough segmentation
        • details matting

    • datasets

      • ORI-Track:之前的数据只提供前景和matte/low-resolution portraits和不太准确的matte
      • COMP-Track:
        • 之前是通过在公开数据集如COCO上叠加前景数据构造数据集的
        • 这种合成图像存在composition artifacts
          • 因为和前景存在resolution,sharpness,noise,illumination差异
          • 可能存在salient object
      • this paper 提出了一个large-scale high-resolution clean background dataset (BG-20k)

  3. method

    • overview

      • a segmentation stage:先识别salient rough foreground / background
      • a matting stage:focus on the transition areas to distinguish details from the background
      • collaboration

    • shared encoder

      • 输入是single image
      • 5 blocks:E0-E4,s2-s32
      • DenseNet-121/ResNet-34/ResNet-101
    • Glance Decoder (GD)

      • a large receptive field to learn high-level semantics
      • PPM
      • 镜像地stack 5 blocks:D4-D0
        • each of which consists of three sequential 3 × 3 convolutional layers and an upsampling layer
        • 每个stage的decoder block还接受一个ele-wise sum的PPM输入
      • loss:2/3-channel的CE
    • Focus Decoder (FD)

      • aims at low-level structural features in transition areas

      • bridge block (BB)

        • three dilated convolutional layers
        • leverage local context in different receptive fields
        • E4和BB的feature concat起来作为stage5 feature
      • 镜像地stack 5 blocks:D4-D0

        • UNet style
        • 额外的来自encoder的shortcut,保留fine details
      • loss

        • alpha prediction loss:absolute difference

        • Laplacian loss:L1 loss

        • 仅关注the unknown transition areas

    • Representation of Semantic and Transition Areas

      • 模式1:GFM-TT
        • 3-class trimap T:segmentation gt将ground truth alpha dilation and erosion with a kernel size of 25
        • use the ground truth alpha matte来定义the transition area
      • 模式2:GFM-FT
        • 2-class foreground F:将ground truth alpha erode with a kernel size of 50做F
        • 用(gt alpha>0) - F 作为transition area
      • 模式3:GFM-BT
        • 2-class foreground G:将ground truth alpha dilate with a kernel size of 50做G
        • 用G-(gt alpha>0) 作为transition area
    • Collaborative Matting (CM)

      • to generate the final alpha prediction

      • CM的不同模式

        • GFM-TT模式下:CM把GD在transition area的预测换成FD的预测
        • GFM-FT模式下:CM把GD和FD的结果相加
        • GFM-BT模式下:CM用GD-FD的结果作为最终结果
      • loss

        • alpha-prediction loss:absolute diff
        • Laplacian loss:L1
        • composition loss:absolute diff

  4. RSSN

BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation

  1. introduction

    • propose Bilateral Segmentation Network (BiSeNet V2)
      • treat spatial details and categorical semantics separately
        • a Detail Branch:wide channels and shallow layers
        • a Semantic Branch:narrow channels and deep layers
      • design a Guided Aggregation Layer
        • enhance mutual connections
        • 得到fused feature
    • performance

      • on Cityscapes test:with 2048x1024 input,72.6% miou,156 FPS,on NVIDIA GeForce GTX 1080 Ti
    • different backbone

      • dilation backbone如deeplab,用空洞卷积扩大reception field,同时保留high resolution,计算量大
      • encoder-decoder backbone如unet,skip connection这部分会引入memory access cost,影响latency
      • 本文的Bilateral Segmentation backbone,两个pathway,achieve high accuracy and high efficiency simultaneously

  2. method

    • overview

    • Detail Branch

      • 负责学spatial details
      • wide channels,shallow layers,small strides
      • 没有residual path,避免memory access cost
      • three stages,基本类似VGG
        • 每个stage有两层conv-bn-relu
        • 每个stage的第一个conv的stride是2
        • 输出x8的feature map
    • Semantic Branch

      • 负责学high-level semantics

      • channel数和Detail Branch的channel数有一个比例factor(<1)

      • 可以用任何的lightweight convolutional model

      • large strides,large reception field

      • stem block

        • 迅速下采样到x4尺度,用了两种下采样manner
        • 然后concat + fuse conv

      • GE block

        • 基本就是MobileNet V2的block
        • 但是在resolution/channel变化时,identity path加入dwconv-conv来做align,而不是去掉id path
          • 2个stride2的DWconv
          • 1一个separable conv
        • 同时在residual path中,expand layer的1x1conv换成了3x3conv,因为cudnn里面3x3conv更友好,expand操作放在了DWconv里面

      • 最后的CE block用global average pooling to embed the global contextual response

        • GAP的1x1xC的context vector直接broadcast add到x32 feature上

    • Bilateral Guided Aggregation

      • fuse manners

        • simple sum/concat
        • well designed operations,本文用了bidirectional aggregation method
      • BGA

        • 用semantic branch的响应值来guide Detail Branch

    • Booster Training Strategy

      • 每个semantic stage的feature上加了auxiliary segmentation head
      • 每个seg head都是3x3conv-bn-relu + 1x1 conv,然后上采样到原始尺度

PP-Matting: High-Accuracy Natural Image Matting

  1. 动机

    • 抠图:aka. 抠前景,与segmentation的区别是在细节上更精准
    • 现有技术的缺陷:
      • trimap-based methods:需要提供trimap辅助信息
      • trimap-free methods:与trimap-based方法在抠图质量上有较大差距
    • 本文方法
      • trimap-free architecture(a two-branch architecture)
        • applies a high-resolution detail branch (HRDB):high-resolution保留local细节信息
        • a semantic context branch (SCB):分割分支负责全局语义分割
      • high-accuracy
      • natural image matting
    • test datasets
      • Composition-1k
      • Distinctions-646
  2. 论点

    • task formulation
      • an image $I \in R^{H\times W\times 3}$ 可以看作前景和背景的linear combination
        • foreground image $F \in R^{H\times W\times 3}$
        • background image $B \in R^{H\times W\times 3}$
        • $I^i = \alpha^i F^i + (1-\alpha^i)B^i$
        • $\alpha_i$:foreground opacity
      • 可以看作是alpha mattes estimation problem
      • 【没理解】The image matting problem is highly ill-posed with 7 values to be solved, but only 3 values are known for each pixel in a given RGB image.
        • ill-posed problem:不适定问题,
        • well-posed problem:适定问题,需满足三个条件,否则为不适定问题
          • a solution exists 解必须存在
          • the solution is unique 解必须唯一
          • the solution’s behavior changes continuously with the initial conditions 解连续变化,不会发生跳变,即必须稳定
        • GAN、图像超分辨率等任务,都不满足‘解唯一’(感觉生成系都不满足)—— In most cases, there are several possible output images corresponding to a given input image and the problem can be seen as a task of selecting the most proper one from all the possible outputs.
    • methods
      • trimap-based
        • trimap:a rough segmentation of the image into three parts: foreground, background and transition (regions with unknown opacity)
        • 作为image的辅助信息,并行输入
        • not feasible in video
      • trimap-free
        • multi-stage approaches:将trimap作为中间结果,串起两个任务,会有累积误差
        • end-to-end approaches:一个网络直接出结果
    • our method
      • high-resolution detail branch (HRDB):keep high resolution instead of encoder-decoder,fine-grained details
      • semantic context branch (SCB) :segmentation subtask, foreground-background ambiguity
      • fuse:give the final alpha matte
  3. 方法

    • overview of network architecture

      • two branches:SCB & HRDB
      • shared encoder
      • PPM to strengthen semantic context
      • guidance flow to merge

    • shared encoder

      • need both high-resolution details and high-level abstract semantics
      • HRNet48 pre-trained on ImageNet
    • Semantic Context Branch(SCB)

      • 用来生成trimap(fg/bg/transition)
      • 5 blocks
      • each block:3xConvBNReLU+bilinear upsample
      • 32x 下采样的特征图作为分割分支的输入,加一个PPM,
      • 输出是semantic map,分割目标是3分类的segmentation mask,也就是trimap
    • High-Resolution Detail Branch(HRDB)

      • 3 residual blocks + 1 conv
      • 2x和4x 的特征图上采样到原始resolution,然后combine,作为分支输入
      • SCB的中间结果作为guidance flow,也融合进HRD分支, to obtain semantic context
      • 输出是detail map,focuses on details representation in the transition region
    • Guidance Flow

      • Gate Convolutional Layers (GCL) 用来生成guidance map $g \in R^{H \times W}$
        • $g = \sigma (C_{1 \times 1} (s||d))$
        • semanic map和detail map先concat,然后conv-bn-relu & conv-sigmoid
      • merge guidance flow和 original flow 得到最终的merged flow $\hat d$
        • $\hat d = (d \odot g + d)^T w $
        • detail map和guidance map先做element-wise的点乘,作为辅助信息
        • 然后叠加detail map
        • 最后进行channel-wise的re-weight
      • 用semantic map的1、3、5 block的输出进行guidance
    • Loss Function

      • 3个losses

      • [1] semantic loss $L_s$:pixel-level的3分类CE

      • [2] detail loss $L_d$:是alpha loss $L_{\alpha}$和gradient loss $L_{grad}$的sum,而且只计算transition region

        • the alpha-prediction loss:是 groud truth alpha(下标g)和 predict alpha(下标d)的absolute difference
        • the gradient loss:是像素gradient magnitude的差值
        • $\epsilon=1e-6$

      • [3] fusion loss $L_f$:包含alpha loss $L_{\alpha}$、gradient loss $L_{grad}$、composition loss $L_{comp}$ based on the final alpha matte

        • the composition loss:是ground truth RGB value与predict RGB value的差值
        • predict Image $I_p$是用predicted alpha matte对ground truth foreground & background的加权得到
        • alpha loss和gradient loss的算法与上面保持一致,但是alpha matte的值是不同的,一个是detail map的结果,一个是fusion map的结果

      • the final weighted loss:$\lambda_1=\lambda_2=\lambda_3=1.0$

  4. 实验

    • Datasets
      • Distinctions-646:训练集包含596个前景目标及ground truth alpha mattes,测试集包含50个
      • Adobe Composition-1k:431/50
      • 在训练中每个前景会被添加进100张背景中,测试是20张
    • Implementation Details
      • input images:random crop into [512,640,800] /pad into [512,],augmented by random [distortion, blurring, horizontal flipping]
      • SGD:momentum=0.9,weight decay=4e-5
      • lr:初始0.01,300k iteration的时候pow by 0.9
      • batchsize = 4
      • conduct on a single Tesla V100 GPU
    • Evaluation metrics:the lower,the better
      • the sum of absolute differences (SAD)
      • mean squared error (MSE)
      • gradient (Grad)
      • connectivity (Conn)

MODNet: Real-Time Trimap-Free Portrait Matting via Objective Decomposition

  1. 动机

    • existing methods
      • either require auxiliary inputs
      • or involve multiple stages
    • thus we propose MODNet (matting objective decomposition network)
      • light-weight
      • real-time
      • end-to-end
      • single-input
      • portrait matting
    • two novel techniques
      • an Efficient Atrous Spatial Pyramid Pooling (e-ASPP) module:fuse multi-scale
      • a self-supervised sub-objectives consistency (SOC) strategy:domain shift problem
    • test device:under 512x512, 67 FPS, GTX 1080Ti GPU
    • dataset:a carefully designed photographic portrait matting (PPM-100) benchmark & Adobe Matting Dataset
  2. 论点

    • Portrait Matting Approaches

      • trimap-based:有一个pre-defined trimap作为先验
      • multi-stage:先用一个semantic network生成一个pseudo trimap,然后refine成alpha matte,数据集有限,suffer from the domain shift problem,对real-world data的泛化性不好
      • 本文方法能够同时自动化完成matting task的子任务:背景提取 & 前景语义,同时用子任务之间的consistency做自监督,提升模型泛化性

    • Image Matting Formulation

      • an alpha matte predicting task:$I^i = \alpha^i F^i + (1-\alpha^i)B^i$
      • ill-posed explanation:上面这个公式,等式右边的参数全是未知的,3通道像素值也就是3+3+1=7个未知数per pixel
      • 所以通常才需要trimap辅助信息
        • 提供0/0.5/1三种alpha构成的mask
        • absolute foreground (α = 1),
        • absolute background (α = 0)
        • unknown area (α = 0.5)
        • 这样任务就简化为,基于known 0/1 region的像素信息,需要预测unknown region的alpha probability
      • matting任务heavily rely on low-level features
      • trimap-free matting
        • a semantic estimation step will then be needed to locate the foreground
      • ASPP
        • proved to boost the performance notably in segmentation-based tasks
        • 本文改进了an efficient variant of ASPP
      • Consistency Constraint
        • Consistency supervision用于semi-/self-supervised
        • MODNet imposes consistency among various sub-objectives within a model
  3. 方法

    • overview

      • divide into three parts
        • semantic estimation:前背景,x16 seg map,用coarse alpha监督
        • detail prediction:边界细节,x4 boundary map,用boundary region的alpha监督
        • semantic-detail fusion:信息融合,得到最终的alpha matte prediction
      • Architecture

    • Semantic Estimation

      • the low-resolution branch S

      • use an encoder MobileNetV2 to predict a coarse mask:16x downsamp

      • seg head:1x1conv + sigmoid

      • ground truth也是粗糙版:ground truth matte也做16x下采样+blur,去除了一些类似发丝的fine feature,专心提取前景整体

      • Efficient ASPP (e-ASPP):

        • 标准的ASPP能解决分割前景有洞的情况,但是huge computation
        • 空洞卷积多尺度提取+常规卷积融合——modify it in three steps:
          • 空洞卷积分解成depth-wise conv和point-wise conv
          • 调换point-wise和fuse conv的顺序
          • fuse conv也替换为更cheaper的depth-wise conv
      • L2 loss

        • $s_p$:predict alpha
        • $G(\alpha_g)$:粗糙化以后的ground truth alpha matte

    • Detail Prediction

      • the high-resolution branch D

      • 用原始图像I、Semantic Branch的输出S(I)、以及Semantic Branch的中间结果(low-level features) 作为输入

      • Branch D 超级轻量

        • 层数少:12个conv
        • 通道数少:64 maximum
        • 没有保留原始解析率:第一层就下采样到4x,最后两层再复原,impact is negligible owing to skip link
      • L1 loss:focus on transition region only

        • $m_d$:boundary mask,对$\alpha_g$先膨胀再腐蚀,提取transition area作为boundary mask
        • $d_p$:D(I, S(I)),branch输出
        • $\alpha_g$:ground truth alpha matte

    • Semantic-Detail Fusion

      • the fusion branch F

      • combine semantics and details

        • upsample S(I)
        • then concat S(I) & D(I, S(I))
        • 然后是convs+sigmoid,得到final predict matte
      • L1 loss + compositional loss

        • L1 loss:全图的alpha matte的L1 loss
        • $L_c$: the absolute difference between input image I and the composited image,跟PP-matting公式10一样,用均方根~

          • the composited image $I_p$ 用gt的fg和bg以及预测的alpha计算
          • loss是L2 loss

        • $\alpha_p$:final prediction

    • train end-to-end through the sum of losses above

      • $\lambda_s = \lambda_a = 1$
      • $\lambda_d = 10$

  4. SOC for Real-World Data

    • portrait matting requires excellent labeling in the hair area,通常在摄影网站上找虚化背景的图标注的

    • 常规的data aug是用背景替换的方式,但是和real-world data还是存在domain gap,模型通常过拟合训练集

    • utilize sub-objectives consistency (SOC) to adapt MODNet to unseen data distributions

      • MODNet的3个子任务在无标签数据上should have consistent outputs

      • given unlabeled Image I,预测$s, d, \alpha$

      • enforce the semantics in $\alpha$ to be consistent with $s$

        • 还是用supervised training里面的L2 loss
        • gt alpha替换成pred alpha
      • enforce the details in $\alpha$ to be consistent with $d$

        • 还是用supervised training里面的L1 loss in transition region
        • gt alpha替换成pred alpha
      • 两项consistency loss加起来

      • extra regularization term防止alpha虚化使得detail loss丧失细节信息