U2Net

发表于 2023-07-23 |

paper2022 for SOD (salient object detection)

SOD任务：显著目标检测，预测的是一个binary mask，无类别前景
highly reference HED paper (Holistically-nested edge de- tection)
https://github.com/xuebinqin/U-2-Net

U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection

abstract
- propose ReSidual U-blocks (RSU)
  - mixed contextual information from different scales
  - deeper但是没有显著增加计算cost
  - 没有用预训练backbone，直接train from scratch
- 两个版本
  - u2net：176.3 MB, 30 FPS with 320x320x3 on GTX 1080Ti GPU
  - u2net tiny：4.7 MB, 40 FPS
introduction
- use existing backbones
  - 用于图像分类预训练的backbone，更多的保留的是语义信息，local details和global contrast information保留的少
  - 预训练的backbone通常在第一个stem就用stride2conv+maxpooling将原图下采样到x4，但是segmentation task中high resolution map也很重要
- multi-scale feature
  - 3x3 conv能够很好的获取local feature
  - 但是不能通过增大kernel size来获得global feature，会显著增加参数量和计算量
  - this paper directly extracts multi-scale features stage by stage
- stacking UNets
  - pose estimation的hourglass之类通常是sequentially stack UNets，作为cascaded refine
  - this paper stack by stage
method
- Residual U-blocks
  - 为了在high resolution的shallow layers中就学习more global information
    - inception方法用了不同的dilation rate，但是memory consume
    - PPM先下采样feature，然后run常规的卷积，再unsampling，但是这样恢复的feature会degrade
  - RSU three components
    - an input convolution layer，用来提取local feature
    - a U-Net like symmetric encoder-decoder structure with height of L，用来提取multi-scale contextual information
      - larger L leads to deeper block，more pooling operations，larger reception field，richer feature
    - a residual connection，add local&ms feature
  - plain conv block / residual block / dense block / inception block / RSU block的对比
    - 网络结构
    - gflops对比
  - U2Net
    - overview
    - U2Net 3 parts
      - a 6 stage encoder
        
        en1 - en4：use RSU-7, RSU-6, RSU-5 and RSU-4，7/6/5/4表示depth
        
        en5 - en6：RSU-4F，“F” means that the RSU is a dilated version，上/下采样都换成空洞卷积
      - a 5 stage decoder
        
        跟encoder镜像，输入是en-x和prev-stage unsampling的concatenation
        
        de5：也是空洞卷积版本的RSU-4F
        
        de4 - de1：RSU4-7
      - a saliency map fusion module attached with the decoder stages and the last encoder stage
        
        side6-1：generate six side output saliency probability maps from en6,de5-1，预测头是3x3conv+sigmoid
        
        side fuse：upsample the logits from side6-1，concat，1x1conv+sigmoid
  - supervision
    - sum loss of side6-1 & side fuse
    - 每个saliency map的loss是bce
    - infer time用fuse map作为final output

WinoGrad

发表于 2023-07-05 |

FFT
- 两个序列的乘法通过FFT可以从原始O(n^2)复杂度变成O(nlogn)
- 算法性能完全取决于傅立叶变换的性能以及相应卷积参数

WinoGrad
- 通过将卷积中的乘法使用加法来替换，并把一部分替换出来的加法放到 weight 的提前处理中，达到减少卷积运算中乘法的计算总量
- startup：F(2,3)
  - 对一维tensor $d=[d_0,d_1,d_2,d_3]$和kernel size为3的滑动filter $g=[g_0,g_1,g_3]$
  - 计算展开：需要做6次乘法和4次加法
    $F(2,3) = \left( \begin{array}{cc} d_0, d_1, d_2 \\ d_1, d_2, d_3 \end{array} \right) \left(\begin{array}{cc} g_0\\ g_1\\ g_2 \end{array} \right)$
  - 更general的表达：需要做 mk次乘法和m\(k-1)次加法
    $F(m,k) = \left( \begin{array}{cc} x_0, x_1, ..., x_{k-1} \\ x_1, x_2, ..., x_k \\ ...\\ x_{m-1}, x_{m}, ..., x_{n+k-2} \end{array} \right) \left(\begin{array}{cc} g_0\\ g_1\\ ... \\ g_{k-1} \end{array} \right)$
  - 左边的矩阵存在很多重复的数值，推导：https://zhuanlan.zhihu.com/p/260109670，可以得到：
    $\left( \begin{array}{cc} d_0, d_1, d_2 \\ d_1, d_2, d_3 \end{array} \right) \left(\begin{array}{cc} g_0\\ g_1\\ g_2 \end{array} \right) = \left( \begin{array}{cc} n_1+n_2+n_3 \\ n_2-n_3-n_4 \end{array} \right)$

    * 其中：
        $$
        n_1 = (d_0-d_2)g_0 \\
        n_2 = (d_1+d_2)(g_0+g_1+g_2)/2\\
        n_3 = (d_2-d_1)(g_0-g_1+g_2)/2 \\
        n_4 = (d_1-d_3)g_2
        $$

        * g相关的计算可以在推理之前提前处理
        * 简化为4次乘法4次加法

    * 推广到general表达式：$Y=A^T[(Gg)\odot(B^Td)]$

        * G是卷积核变换矩阵，对g做pre-refine的
        * B是输入变换矩阵
        * $\odot$是点乘，就是对应位置的乘法
        * A是输出变换矩阵

    * 推广到2D的general表达式：$Y=A^T[(GgG^T)\odot(B^TdB)]A$

        * 2D的conv可以tile为2x3个1D filter，合并每个1D的F(2,3)以后2D filter又变成了一个F(2,3)
        * 所以乘法次数为4x4=16
        * 原始的im2col的实现，乘法次数为4x9=36

    * 工程实现

        * step0：winograd矩阵获取，https://github.com/andravin/wincnn
        * step1：得到G矩阵，进行卷积核变换
        * step2：得到B矩阵，进行输入变换
        * step3：计算tile矩阵M，$[(GgG^T)\odot(B^TdB)]$
        * step4：得到A矩阵，计算结果Y，$A^TMA$

Deblurring via Stochastic Refinement

发表于 2023-07-01 |

papers

[cvpr 2022] Deblurring via Stochastic Refinement

[SR3] SR3:Image Super-Resolution via Iterative Refinement

overview
- based on conditional diffusion models
- refines the output of a deterministic predictor
regress-based methods
- 通常用一个unet结构，直接回归clean image
- 训练数据是blur-clean pair
- ill posed：If there are multiple possible clean images that correspond to the blurry input, the optimal reconstruction according to the given loss function will be an average of them
- 通常结果比较平滑，细节纹理恢复不出来，我觉的是因为缺少noise sampling，只从确定的latent vec去做恢复
- unet regression将去噪task建模成了a to b的确定回归任务，但是实际上去噪的结果是多样的，更接近生成任务，从一个后验分布中生成并挑选不同的样本作为结果，说白了以前的方法是AE，现在的方法是VAE
DDPM
- 核心是两个公式
  - 前向$q(x_t|x_0)$：given $x_0$，可以采样得到任意time step的$x_t$
  - 后向$q(x_{t-1}|x_t,x_0)$：given $x_t$，可以随机denoise a single diffusion step得到$x_{t-1}$
- 优化目标
  - 希望网络学到的$p_{\theta}(x_{t-1}|x_t)$接近真实的reverse diffusion step$q(x_{t-1}|x_t,x_0)$
  - 方式就是最大化变分下界函数$log p_{\theta}(x)$
  - 通过将x重参数化，将优化目标转换成noise的预测
- Continuous noise level
  - allows us to sample from the model using a noise schedule α1:T different from the one used during training
  - inference-time noise schedule和训练时不同，但是无需重新训练
- Conditional DPM
  - 我们的condition就是blur image
  - 和输入concat在一起，x是3hw的noise/denoise input，y是3hw的blur prior

method

overview
- 两部分网络
  - initial predictor g：provides a data-adaptive candidate for the clean image
  - denoiser network f：denoiser，only needs to model the residual
- 这样的好处是G可以做很大，F做很小
- 预测残差的话，x0-xT的定义也变为残差，优化目标里面的$x_0$改为$x_0-g_{\theta}(x_0)$
samping algorithm
- run G得到clean image的初始值$x_{init}$
- 随机采样正态分布得到初始input $z_T$
- from time step T到1:
  - 随机采样一个
  - reverse step：得到$z_{t-1}$
- 最后add初始值和residual得到最终预测

inference-time sampling

较高的step数和较低的噪声水平可以获得高感知的图像，better perceptual quality
较低的step数和较高的噪声水平可以获得高保真的图像，lower distortion
网格搜索noise schedule hyperparameters：就是DDIM的$\eta$和$\sigma$
- 两个超参：
  - inference steps T from [10,20,30,50,100,200,300,500]
  - noisy schedule $\alpha_T$

残差模型对图像进行采样所需的时间要少得多

# construct DDPM noise schedule
b_t = (beta2 - beta1) * torch.linspace(0, 1, timesteps + 1, device=device) + beta1
a_t = 1 - b_t
ab_t = torch.cumsum(a_t.log(), dim=0).exp()    
ab_t[0] = 1


def denoise_add_noise(x, t, pred_noise, z=None):
    if z is None:
        z = torch.randn_like(x)
    noise = b_t.sqrt()[t] * z
    mean = (x - pred_noise * ((1 - a_t[t]) / (1 - ab_t[t]).sqrt())) / a_t[t].sqrt()
    return mean + noise     # return sampling p(xt-1|xt)


def denoise_ddim(x, t, t_prev, pred_noise):
    ab = ab_t[t]
    ab_prev = ab_t[t_prev]
    
    x0_pred = ab_prev.sqrt() / ab.sqrt() * (x - (1 - ab).sqrt() * pred_noise)
    dir_xt = (1 - ab_prev).sqrt() * pred_noise

    return x0_pred + dir_xt

A White Paper on Neural Network Quantization

发表于 2023-06-30 |

A White Paper on Neural Network Quantization

高通白皮书
Quantization fundamentals
- basic block：y=Wx+b
- 推理的瓶颈通常在带宽，32bit数据的加载和写入占据大部分能耗
- 16个处理单元PE，使用int8可以优化数据瓶颈
- 累加器Accumulator，通常保持32bit，做完加法以后再量化到8bit
- 浮点做定点近似：$\hat x=s_x x_{int} \approx x$
- $\hat y = \hat b + \sum (s_w w_{int}) (s_x x_{int}) = \hat b + s_w s_x \sum(w_{int} x_{int})$
- 常见的量化方式
  - 均匀仿射量化(Uniform affine quantization)，也叫非对称量化(asymmetric quantization)
    - 量化：$x_{int} = clamp(round[\frac{x}{s}]+z, 0, 2^{b}-1)$
    - 反量化：$x\approx \hat x=s(x_{int}-z)$
    - 反量化恢复的x范围：$[-sz，s(2^b-1-z)]$，不再这个范围内的值就被截断了，扩大s能够扩大量化范围减少clipping error，但是会增加rounding error，所以在截断误差和舍入误差之间需要trade-off
  - 对称均匀量化(Symmetric uniform quantization)
    - 省略zero point
    - 量化：$x_{int} = clamp (round[\frac{x}{s}], 0, 2^b-1)$
    - 反量化：$\hat x=sx_{int}$
    - 无符号对称量化非常适用于单边激活值，如relu
    - 有符号对称量化可以被用于大致关于零对称分布的数据（如conv-bn）
  - 二次幂量化(Power-of-two quantizer)
    - 二次幂量化是对称量化的一个特例，zero point依旧是0，比例因子被限制为二次幂
    - 可以提升硬件的计算效率
    - 但是比例因子s的限制性表达可能会使截断误差和舍入误差之间的权衡变得复杂
- 量化粒度
  - per tensor量化：为每个tensor定义一组量化参数（权重量化参数$s_w, z_w$以及激活量化参数$s_x, z_x$）
  - per channel量化：
    - 对于权重量化，可以为每个输出通道指定一个不同的量化器，因为basic block里面是逐行累加，不会影响计算效率
    - 但是对于激活值量化，通常不会做per-channel量化，因为如果每个x都有一组s和z，$\sum wx$中的scale就没法提到$\sum$外面了
    - 实际卷积的per-channel量化，是对每个kernel单独统计一个scale和zero-point
      *
  - per group量化：还有比per channel更细粒度的量化方案，增加组的粒度通常可以提升量化的准确性，但是要额外付出一些计算开销，目前大多数的定点累加器都不支持这类操作
- 量化模拟
  - 在训练硬件上模拟量化行为
    - 用于QAT
    - 用于测试神经网络在量化设备上的运行情况
    - on-device真实场景下：tensor、weight都是int8的
    - 模拟场景下：tensor、weights都是浮点的，通过给conv加入量化器来模拟权重的量化，通过给act加入量化器来模拟激活值的量化
    - 量化器实现了上面定义的量化函数
    - 量化器的输入输出都是浮点格式的，但是输出是限制在量化范围以内的
- 常见层量化
  - 批量归一化的折叠(Batch normalization folding)：bn参数就被融合到前一个线性层
  - 激活函数融合(Activation function fusing)：
    - 通常我们做完conv-bn以后要反量化了，然后写入内存，然后act层再读取进来，再进行量化，然后process
    - 但是像relu可以和轻松的和反量化前的conv-bn结合：直接对量化值做relu就可以了，这样就节省了一组读写和量化
    - 像sigmoid/swish之类的，硬件会针对这个计算有专门的实现：泰勒展开近似/查找表
  - 最大值池化(Max pooling)：激活量化是不需要的，因为输入和输出的范围是一致的
  - 均值池化(Average pooling)：整数的平均不一定是整数，因此需要在平均之后增加一个量化步骤。但是我们对输入和输出使用相同的量化器，因为求平均不会显著改变量化后值的范围。
  - 逐点相加(Element-wise addition)：在计算的时候两个输入的量化范围必须要完全匹配。如果输入的量化范围不匹配，就需要格外的注意才能确保计算能正确的执行。因此没有公认的解决方案。额外增加一个反量化步骤可以粗略的模拟增加的误差或者噪声，另外一种方案是绑定多个输入的量化器从而实现一致的输入，这可以省去反量化步骤但是可能需要微调(fine-tuning)。
  - 连接(Concatenation)：被连接的两个分支通常不共享量化参数，这意味着它们的量化范围不一定会重叠，因此反量化步骤可能是需要的。与逐点相加一样，你可以对网络进行优化(fine-tuning)以使得多个连接分支可以共享量化参数。
  - 通常对激活使用非对称量化，对权重使用对称量化：因为权重分布通常在0附近，而且反量化WX时候不会引入running time item $s_wz_ws_xx_{int}$
PTQ
- data-free/small calibration set
QAT
- STE：直通估计器，bp时候跳过量化器，同时假定该模块的梯度为1，更新浮点参数以后，用min-max来更新量化器的s和z
- LSQ：可微量化参数，训练中对s和z求导，直接更新量化参数
- standart QAT pipeline
- 首先拿到trained浮点模型
- Cross-layer equalization 跨层均衡化(CLE) ，对于遇到不平衡权重分布的模型（例如 MobileNet 架构），此步骤是必要的。
- Add quantizers 使用量化模块，选择量化器并在网络中添加量化操作，针对特定硬件，通常对权重使用对称量化器，对激活使用非对称量化器，如果硬件支持，对权重使用per-channel量化
- Range estimation 范围设置，基于层的MSE标准来设置所有量化参数，在按通道(per-channel)量化的特定情况下，使用最小-最大设置有时可能是更有利的。
- Learnable Quantization Parameters 量化参数可学习，直接学习量化参数，用对s和z的梯度直接更新他们

tensorRT

发表于 2023-06-07 |

框架
- 训练框架：tensorflow/pytorch，trained model转pb/onnx
- 推理部署框架：TensorRT加速模型推理，将pb/onnx转换成特定的文件格式，在硬件平台高速运行
- 部署流水线 PyTorch - ONNX - ONNX Runtime/TensorRT
部署难点
- dynamic shape
- 自定义算子
- 计算图（onnx/pb）与推理引擎的兼容
ONNX Runtime
- 直接对接 ONNX
- 支持多平台（Windows、Linux、MacOS、Web Browser、Android、iOS等）
- 支持GPU & CPU
- 我暂时理解为session建立在host端（cpu），然后不断向高效计算单元（GPU/NPU）launch op
TensorRT
- pipeline
  - 使用OnnxParser解析onnx
  - 创建builder、config、profile
  - serialize network into a plan
  - deserialize plan ，创建engine
  - 创建context，类似session
  - run：host - device之间的mem交互
- plugin
  - plugin是以.so的形式插入网络中的，因此不能与其他层融合
Triton：https://mp.weixin.qq.com/s/jWZuNKpVM4k5aDe2JmB-Tg
- NVIDIA于2018年开源的服务框架，可以对TensorRT生成的推理引擎进行更好的调度以及处理推理请求
- 会在本地端口创建一个server，用pytritonclient将需要预测的数据访问相应端口进行推理

yolo实践：https://ost.51cto.com/posts/18986

图像预处理：长边resize、padding、归一化

cv::Mat input_image = cv::imread("dog.jpg");
cv::Mat resize_image;
const int model_width = 640;
const int model_height = 640;
const float ratio = std::min(model_width / (input_image.cols * 1.0f),
                              model_height / (input_image.rows * 1.0f));

const int border_width = input_image.cols * ratio;
const int border_height = input_image.rows * ratio;

const int x_offset = (model_width - border_width) / 2;
const int y_offset = (model_height - border_height) / 2;
cv::resize(input_image, resize_image, cv::Size(border_width, border_height));
cv::copyMakeBorder(resize_image, resize_image, y_offset, y_offset, x_offset,
                    x_offset, cv::BORDER_CONSTANT, cv::Scalar(114, 114, 114));

cv::cvtColor(resize_image, resize_image, cv::COLOR_BGR2RGB);

input_blob = new float[model_height * model_width * 3];     // CHW
const int channels = resize_image.channels();
const int width = resize_image.cols;
const int height = resize_image.rows;
for (int c = 0; c < channels; c++) {
  for (int h = 0; h < height; h++) {
    for (int w = 0; w < width; w++) {
      input_blob[c * width * height + h * width + w] =
          resize_image.at<cv::Vec3b>(h, w)[c] / 255.0f;
    }
  }
}

模型序列化

#include "NvInfer.h"
# include "NvOnnxParser.h"


// logger
class MyLogger : public nvinfer1::ILogger {
public:
  explicit MyLogger(nvinfer1::ILogger::Severity severity =
                        nvinfer1::ILogger::Severity::kWARNING)
      : severity_(severity) {}

  void log(nvinfer1::ILogger::Severity severity,
           const char *msg) noexcept override {
    if (severity <= severity_) {
      std::cerr << msg << std::endl;
    }
  }
  nvinfer1::ILogger::Severity severity_;
};

// builder
MyLogger logger;
nvinfer1::IBuilder *builder = nvinfer1::createInferBuilder(logger);

// network
const uint32_t explicit_batch = 1U << static_cast<uint32_t>(
          nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
nvinfer1::INetworkDefinition *network = builder->createNetworkV2(explicit_batch);

// parse
const std::string model_path = "yolov5m.onnx";
nvonnxparser::IParser *parser = nvonnxparser::createParser(*network, logger);
parser->parseFromFile(model_path.c_str(),
    static_cast<int>(nvinfer1::ILogger::Severity::kERROR))
// 如果有错误则输出错误信息
for (int32_t i = 0; i < parser->getNbErrors(); ++i) {
    std::cout << parser->getError(i)->desc() << std::endl;
}

// build config: mem/precision
nvinfer1::IBuilderConfig *config = builder->createBuilderConfig();
config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 1U << 25);
if (builder->platformHasFastFp16()) {
  config->setFlag(nvinfer1::BuilderFlag::kFP16);
}

// serialize: tensorRT执行构建优化
nvinfer1::IHostMemory *serialized_model =
      builder->buildSerializedNetwork(*network, *config);

// save engine
std::stringstream engine_file_stream;
engine_file_stream.seekg(0, engine_file_stream.beg);
engine_file_stream.write(static_cast<const char *>(serialized_model->data()),
                        serialized_model->size());
const std::string engine_file_path = "yolov5m.engine";
std::ofstream out_file(engine_file_path);
assert(out_file.is_open());
out_file << engine_file_stream.rdbuf();
out_file.close();

// engine file构建好了以后前面的builder/network/config/parser啥的都不需要了
delete config;
delete parser;
delete network;
delete builder;

模型反序列化

// runtime deserialize from a serialize object
nvinfer1::IRuntime *runtime = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine *engine = runtime->deserializeCudaEngine(
    serialized_model->data(), serialized_model->size());

delete serialized_model;
delete runtime;

// or load engine file
const std::string engine_file_path = "yolov5m.engine";
std::stringstream engine_file_stream;
engine_file_stream.seekg(0, engine_file_stream.beg);
std::ifstream ifs(engine_file_path);
engine_file_stream << ifs.rdbuf();
ifs.close();

engine_file_stream.seekg(0, std::ios::end);
const int model_size = engine_file_stream.tellg();
engine_file_stream.seekg(0, std::ios::beg);
void *model_mem = malloc(model_size);
engine_file_stream.read(static_cast<char *>(model_mem), model_size);

nvinfer1::IRuntime *runtime = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine *engine = runtime->deserializeCudaEngine(model_mem, model_size);

delete runtime;
free(model_mem);

模型推理

// context管理推理
nvinfer1::IExecutionContext *context = engine->createExecutionContext();

// mem prep
void *buffers[2];
// 获取模型输入尺寸并分配GPU内存
nvinfer1::Dims input_dim = engine->getBindingDimensions(0);
int input_size = 1;
for (int j = 0; j < input_dim.nbDims; ++j) {
  input_size *= input_dim.d[j];
}
cudaMalloc(&buffers[0], input_size * sizeof(float));
// 获取模型输出尺寸并分配GPU内存
nvinfer1::Dims output_dim = engine->getBindingDimensions(1);
int output_size = 1;
for (int j = 0; j < output_dim.nbDims; ++j) {
  output_size *= output_dim.d[j];
}
cudaMalloc(&buffers[1], output_size * sizeof(float));
// 给模型输出数据分配相应的CPU内存
float *output_buffer = new float[output_size]();

// cuda stream, enqueue
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync(buffers[0], input_blob,input_size * sizeof(float),   // host to device
                  cudaMemcpyHostToDevice, stream);
context->enqueueV2(buffers, stream, nullptr);
cudaMemcpyAsync(output_buffer, buffers[1],output_size * sizeof(float),   // device to host
                  cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize(stream);

PETR

发表于 2023-06-06 |

prev knowledge

DETR：Dectection Transformer，facebook，主要贡献是提出了query-based decoder的架构

deformable DETR：use multi-scale deformable attn，sensetime，主要贡献是deformable attn来降低计算复杂度，加速收敛

PETR: Position Embedding Transformation for Multi-View 3D Object Detection

introduction
- 将3D coordinates编码成PE，aggregate进image features，得到3D position-aware features
- Comparison of DETR, DETR3D, PETR：区别就在PE上
  - DETR适用于单张图场景，只用2D PE
  - DETR3D可以用于multi-camera场景，PE是在3D ref points world coords通过相机参数转换到对应camera coords上
  - PETR直接使用3D world coords PE
overview
- a 3D coordinates generator：将2D的image转化成3D的meshgrid，然后通过相机参数将3D meshgrid投影的世界坐标系
- a 3D position encoder：输入2D feature和3D grids coords，使得2D feature包含3D coords的信息—— 3D position-aware features
- a query generator：这个还需要PE吗？
method
- 3D Coordinates Generator
  - given meshgrid (H,W,D)
  - 每个meshgrid的camera-3D坐标为$(ud,vd,d,1)$，uv是2D image的坐标，d是depth
  - world 3D coord为$p_w = K^{-1} p_c$
  - 最后对xyz分别normalize
- 3D Position Encoder
  - given 2D features：N个HWC，先接1x1 conv降通道
  - given 3D coords：N个HW3，先接MLP转化成embedding
  - 两个通道对齐以后相加，作为decoder的context输入
- Query Generator
  - learnable query被初始化为3D world space下的anchor points，也是用一个MLP编码成embedding
- Decoder
  - standard decoder from DETR
  - query heads
    - classification：focal loss
    - regression：预测anchor query的offsets，L1 loss

LLM finetune

发表于 2023-06-06 |

prve knowledge

LLM：take a sequence of words as input and recursively generate next words
- GPT
- LLaMA：meta的大模型
  - PyLLaMA：社区项目，比官方原版降低显存，可以压到13G
  - llama-int8：int8方案，只需要 7.12GB 的初始显存，但是模型加载和推理时间变长
  - unofficial startup guide：https://github.com/soulteary/llama-docker-playground
finetuning
- RLHF：基于人类反馈对语言模型进行强化学习
  - 花钱招人给问题（prompt）写回答（demonstration），然后finetune一个GPT3，supervising
  - 用多个模型给出问题的多个回答，人工评价排序QA对，训练一个reward model来打分，supervising
  - 用强化学习训练上面那个finetune后的GPT3，reward function是是基于reward model输出来的
- alpaca：斯坦福羊驼，从chatgpt拿了5.2w条数据微调LLaMA-7B
- Alpaca-LoRA：羊驼上改进，Low-rank adaptation，进一步压缩算力，4090上5小时完成，也可以用来微调stable diffusion，冻结原模型参数，在模型旁边添加一个旁路，来模拟task，并只训练这个旁路，推理时这部分训练参数加在原始权重上
- LIMA是「Less is More for Alignment」，一个强大的预训练AI模型，通过几个样本就足以实现高质量的结果

BEV

发表于 2023-06-05 |

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

introduction
- BEVFormer：lookup and aggregate
  - spatial cross-attn：Transformer-based
  - temporal self-attn：with Temporal structure
- the BEV features can support
  - 3D perception tasks such as 3D object detection
  - map segmentation
- Camera-based 3D Perception
  - DETR3D projects learnable 3D queries in 2D images
  - BEV：transform image features into BEV features and predict 3D bounding boxes from the top-down view
method
- Overview
  - 模型主体就是stacking encoder layers * 6，每个encoder layer包含三个部分
    - running Q是grid-shaped BEV queries
    - 先和前一个timestep的BEV feature做temporal self-attention
    - 再和encoded multi-camera features做spatial cross-attention
    - 最后是FFN，输出当前t的BEV feature
- BEV Queries
  - given the HW BEV plane，grid-shaped $Q\in R^{CHW}$负责grid feature
  - BEV feature的中心点是自己
  - add PE
- Spatial Cross-Attention
  - deformable attention：query only interacts with its regions of interest across camera views
  - 与原始的deformAttn不同，sampling不是learnable的：首先将query拉高成bin query，从bin query中采集3D reference points，然后再映射到2D view
    - given grid scale s：real word loc $x^{‘}=(x-\frac{W}{2})s$, $y^{‘}=(y-\frac{W}{2})s$
    - define a set of anchor heights $\{z_i^{‘}\}_{N_{ref}}$，定义采样点个数的高度
    - 于是对每个grid query，都有N_ref个sampling choices
    - 最后将world coord通过相机参数转换到2D map上
- Temporal Self-Attention
  - 首先要algin中心点
  - 然后做deformable attn
  - 与原始的deformAttn不同，learnable sampling不是在query上做的，而是在concat的query和aligned former query上做的
Applications
- given BEV feature CHW
  - 3D Det based on DETR
    - image feature替换成single-scale的BEV feature
    - 预测 3D bounding boxes and velocity
    - 3D boxes reg loss只用L1 loss
  - map segmentation based on Panoptic SegFormer
    - 跟mask2former的decoder基本一样
    - 就是N个learnable的query做cross-self-attn，最后做semantic aggregate

FQ-ViT

发表于 2023-06-02 |

prev knowledge

deployment techniques
- 量化quantization
- 剪枝pruning
- 蒸馏distillation
QAT & PTQ
- QAT：Quantization-Aware Training，通过训练的方式达到与浮点模型精度一致的量化模型，需要重训练
- PTQ：post-training quantization，训练感知量化，通过对量化层determine the value range来降低量化误差
  - without finetuning：直接用calibration后的quant weight做floor/ceil round
  - finetuning：如Adaround，calibration后的quant weight自适应地学+1/+0
LayerNorm

PTQ

given原始输入$x$
quantization
- 对称量化：$x_q = clip(round(\frac{x}{s}), -2^{b-1}, 2^{b-1}-1)$
- 非对称量化：$x_q = clip(round(\frac{x}{s})+zp, -2^{b-1}, 2^{b-1}-1)$
dequantization
- $\hat x=x_q*s$
- $\hat x=(x_q-zp)*s$
- 理想的$\hat x$应该约等于$x$
matmul
- $\hat Y = (x_q w_q)s_xs_w$
- $\hat Y = (x_q-zp_x)s_x(w_q-zp_w)*s_w=s_xs_w(x_qw_q +zp_xzp_w - zp_wx_q-zp_xw_q) $
  - 其中带$x_q$的两项是dynamic的
  - 比对称量化多引入一个矩阵乘
CUDA progamming
- GPU作为CPU的外挂设备，CPU作为host
  - 准备数据，image.cuda()
  - luanch kernel，gpu进行实际的op计算
  - 结果返回，reuslt.cpu()
- memory structure
  - 金字塔结构
  - register：thread-private的mem
  - local memory：为了防止register溢出的bkp
  - shared memory：among a block
  - global memory：显存
- Cuda Core
  - 标准的浮点计算单元，one operation per cycle
  - progress in parallel
- Tensor Core
  - 用于优化特定op，如4x4GEMM
  - 如果float16改成int8，单个cycle的计算能力就翻倍了
- Fake Quant
  - 通常X的Q是per tensor的，W的Q是per channel的
  - W的Q可以离线计算好
  - DQ+float conv-relu的部分可以融合为QConvRelu，输入/输出都是int8，可以放到Tensor Core去算实现加速
- FP16 & Int8
  - bs比较小的时候，还没到数据搬运瓶颈
  - bs大的时候，int8才显著

ViT

PTQ method1
- 量化矩阵乘：QK，attnV，mlp
- 计算similarity loss：皮尔逊相关系数，$sim(Y, \hat Y)$
- 计算ranking loss：监督attn的相对关系
- 交替循环采样
  - 首先计算X和W的calibration
  - fix X的calibration，采样W的calibration，得到最好的$s_w$
  - fix W的calibration，采样X的calibration，得到最好的$s_x$
  - 交替
PTQ method2
- LN
  - BN vs. LN：BN保存per-channel的statics，LN动态计算per-tensor的statics，不能融合进conv，需要独立量化
  - LN的输入数值范围分布比较大，要么损失离群点，要么拉大range损失small value的quant precision
  - 如果用per-channel量化来解，在DQ的时候恢复的数值范围不一致，还是要用浮点防止溢出
  - 把channel scale表征成2的幂数，在rescale的时候可以替换成移位操作
- Softmax
  - log2放大small value的quant range
  - i-exp将softmax转换成整数运算，因此可以直接输入前面的量化结果$softmax(s*x_q)$
  - log2量化以后的DQ可以看成
    - $x_q=-log2(x)$
    - $\hat x=2^{-x_q}$，做attnV的matmul的时候可以直接移位

pipeline

model to Qmodel：trace静态图节点

qconfig配置

build qmodel

from sparsebit.quantization import QuantModel, parse_qconfig
from model import resnet18

model = resnet18(num_classes=10)
model.load_state_dict(torch.load('pretrained.pth'))

# eval float model
# ...

# build quant model
qconfig_file = "qconfig.yaml"                 # W8A8
qconfig = parse_qconfig(qconfig_file)
qmodel = QuantModel(model, config=qconfig)

calibration：计算qparams

PTQ

# Set calibration
qmodel.prepare_calibration()
# Forward Calibrate
calibration_size = 256
cur_size = 0
if torch.cuda.is_available():
    qmodel.cuda()
for data,target in trainloader:
    if torch.cuda.is_available():
        data,target = data.cuda(),target.cuda()
    res = qmodel(data)
    cur_size += data.shape[0]
    if cur_size >= calibration_size:
        break
qmodel.calc_qparams()

QAT训练

qmodel.init_QAT()    #调用API，初始化QAT
qmodel.set_lastmodule_wbit(bit=8)    #额外规定最后一层权重的量化bit数
print(qmodel.model)    #可以在print出的模型信息中看到网络各层weight和activation的量化scale和zeropoint

# train like a float model, fake quantize等过程在QuantModel中自行完成
# Q/DQ nodes simulate quantization loss and add it to the training loss during fine-tuning

    * 得到quant model，导出onnx

    1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Set Quantization
qmodel.set_quant(w_quant=True,a_quant=True)
correct = 0
total = 0
qmodel.eval()
with torch.no_grad():
    for data in testloader:
        image,labels = data
        if torch.cuda.is_available():
            image,labels= image.cuda(),labels.cuda()
        outputs = qmodel(image)
        _,predicted = torch.max(outputs.data,1)
        total+=labels.size(0)
        correct += (predicted == labels).sum().item()
acc1 = 100 * correct / total
print(f'Accuracy of the Quant Model on the 10000 test images: {acc1} %')

# 导出onnx
qmodel.export_onnx(torch.randn(1, 3, 224, 224), name="qresnet20.onnx")



9. homework

    * imagenet上resnet18、vgg16、mobileNetv2的PTQ掉点情况

        * Resnet18<vgg16<mobileNetv2
        * mobilenetv2量化之后掉点严重的主要原因在于该网络中的深度可分离卷积，Depthwise Conv不同输出通道的动态范围差异较大，因此采用per-tensor的量化方式将会引入较大的量化误差，从而导致精度损失严重，采用per-channel的量化能够缓解精度损失的问题。

    * 把calibration-set里面的图片换成标准高斯噪声输入, 当calibration-set大小为1, 10, 100时, 精度不是0或者很低的原因是什么呢

        * 图像预处理的Normalization操作可确保数据满足N(0,1)的高斯分布，与随机高斯噪声类似

    * moving average calibration with alpha from [0.5, 0.9, 0.99]

        * 全局的calibration MinMax好过moving average MinMax，因为step有限的情况下两者并不接近

        * alpha越大结果越差

    * int8和float16的模型latency比较

        * 二者在GPU中的速度并没有太大区别，int8的主要优势在于传输数据，相同带宽下，Int8传输的数据量是FP16的两倍
        * BatchSize较小时，其传输带宽并没有被完全利用，GPU上Int8和FP16的Throughput接近

    * QAT resnet18 with 4w4f & 2w4f

FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

introduction
- difficuties in quant ViT
  - serious inter-channel variation in LayerNorm inputs：一些通道的值是均值的40倍，这种fluctuation一般的量化方法解决不了，会导致large quantization error
  - extreme non-uniform distribution in attention maps：attn map中大多数的units范围在0-0.01，只有少数high attention接近1，极小值在量化过程中的精度损失比较大
- propose Fully Quantized Vision Transformer (FQ-ViT)
  - PTF：power of Two Factor，能够降低LN的量化误差，计算量与layer-wise quantization一致
  - LIS：log int softmax，能够 provides higher quantization resolution for small values
  - simplify 4- bit quantization by the BitShift operator
method
- quantization preliminary
  - 量化$Q(X|b)$就是将浮点值域$X\in R$映射到量化值域$b\in q$
  - b代表bit width，那么量化值域q的分布为
    - signed：$\{-2^{b-1}, …, 2^{b-1}-1\}$
    - unsigned：$\{0,1,2,…, 2^b-1\}$
  - quantizer Q最常用的是uniform和log2
    - Uniform Quantization
      - $Q(X|b) = clip(\frac{X}{s}+zp,0,2^b-1)$
      - 两个量化参数通常是由X的分布决定，given lower bound $l$ & upper bound $u$
        
        scale $s=\frac{u-l}{2^b-1}$
        
        zero-point $zp=clip(-\frac{l}{s}, 0, 2^b-1)$
    - Log2 Quantization
      - 非线性映射
      - $Q(X|b) = sign(X)\cdot clip(-log_2\frac{|X|}{max(|X|)}, 0, 2^b-1)$
      - 会放大small value的量化区域：如果X值很大如[0.5,1]，那么对应log2的范围在[0,1]，如果X值很小如[0,0.5]那么log2的范围就在[1,inf]，可以看到log2可以放大小值的量化范围，但是对极小值有精度损失
  - this paper fully quantize ViT
    - uniform MinMax quantization is used for Conv, Linear and MatMul
    - following new proposed methods are used for LayerNorm and Softmax
- Power-of-Two Factor for LayerNorm Quantization
  - LayerNorm：$LN(x)=\frac{X-\mu_X}{\sqrt{\sigma_X^2 + \epsilon}} \cdot \gamma + \beta$
    - dynamic computing of mean & variance $\mu_X$ and $\sigma_X$
    - reaffine by trained $\gamma$ and $\beta$
    - LN层要动态norm，所以不能跟前面的线性层合并
    - look into LN的inputs：there is a serious inter-channel variation，整体的输入值分布范围很大，而且channel之间的最大值最小值差异很大
    - inter-channel的extreme variation导致layer-wise quantization有很大的quant error
      - group-wise quant和channel-wise quant：会引入更多的mean和var的计算
      - Power-of-Two Factor (PTF)：only introduce a channel-wise factor
  - $X_Q = Q(X|b) = clip(\frac{X}{2^\alpha s}+zp, 0, 2^b-1)$
    - $s=\frac{max(X)-min(X)}{2^b-1} / 2^K$
    - $zp=clip(-\frac{min(X)}{2^K s},0,2^b-1)$
    - 新引入的per-channel factor $\alpha_c= argmin ||X_c - \frac{X_c}{2^{\alpha_c} s} \cdot 2^{\alpha_c} s||_2$ among $\alpha_c\ in \ [0,1,2,…, K]$
    - hyperparam K：default K=3，可以根据需求调整，控制的是scaling factor的范围，决定了保留的精度区间，如果$X_c$的variation很大，那么对应的$\alpha_c$也要大一些
  - $X_Q$是LN的输入前量化，所以我们还可以基于它计算integer domain的LN的mean & var
    - 首先还原网络输入：$\hat X_Q = (X_Q - zp) << \alpha$
    - 网络的输入：$X = s * \hat X_Q$
    - $\mu (X) = \mu (s \hat X_Q) = \mu(\hat X_Q) s$
    - $\sigma (X) = \sigma (s \hat X_Q) = \sigma (\hat X_Q) s$
    - 这样就可以在integer domain得到LN的量化前输出：$Y=\gamma * s \frac{\hat X_Q - \mu(\hat X_Q)}{\sqrt {s^2 \sigma^2(\hat X_Q)} + \epsilon} + \beta$
  - 量化LN层
    - given quant param $s_{out}$ & $zp_{out}$
    - $Y_Q = \frac{Y}{s_{out}} + zp_{out} = \frac{\gamma s }{s_{out} \sqrt {s^2 \sigma^2(\hat X_Q)+\epsilon}} \hat X_Q + \frac{\beta\sqrt {s^2 \sigma^2(\hat X_Q)+\epsilon} - \gammas\mu(\hat X_Q)}{s_{out } * \sqrt {s^2 \sigma^2(\hat X_Q)+\epsilon}} + zp_{out} = A \hat X_Q + B + zp_{out}$
    - A近似为
      - given 目标位宽b
      - $N_1=b-1-log_2|A|$
      - $N_2= |A| 2^{N_1}$
      - $A=sign(A) \cdot \frac{N_2}{2^{N_1}}$
    - 最终的量化推理：$Y_Q=\frac{sign(A) \cdot {N_2} \hat X_Q + B*2^{N_1}}{2^{N_1}} + zp_{out}$
- Log-Int-Softmax for Softmax Quantization
  - look into attn map：a distribution centering at a fairly small value，然后少数离群点在1附近
  - log2 quant
    - 首先softmax自带normalization
    - 其次后面的attn*VQ这个matmul算子也可以通过bitshift来实现
  - $Attn_Q = Q(Attn|b) = clip(-log_2{Attn}, 0, 2^b-1)$
  - after quant matmul
    - $Attn V_Q = 2^{-Attn_Q} V_Q = V_Q >> Attn_Q$
    - 右移会损失精度，所以改成左移，$V_Q>>Attn_Q = V_Q << (N-Attn_Q) / 2^N$，$N=2^b-1$
  - softmax计算的优化LIS Log-Int-Softmax
    - 使用i-exp来polynomial approximation of exponential function
    - $exp(s X_Q) = s^{‘}i-exp(X_Q)$
    - Log-Int-Softmax：$LIS(s*X_Q) = N- log_2 \frac{\sum i-exp(X_Q)}{i-exp(X_Q)}$
      - integer的log2就是找bit码中第一个1，然后累加后面的值
      - i-exp就是近似表达式
        
        softmax(X)先转化：$\hat X=X-max(X)$，输入变为负数
        
        exp(X)分解：$exp(\hat X) = 2^{-z}exp(p) = exp(p)>>z$
        
        $z=\frac{-\hat X}{ln2}$
        
        $p = \hat X + zln2 \in (ln2, 0]$
        
        exp(p)近似：$L(p) = 0.3585(p+1.353)^2 + 0.344 \approx exp(p)$
        
        $exp(\hat X) \approx L(p)>>z$
        
        归一化$exp(\hat X)$得到integer softmax
      - attn的量化用了int4来节省计算量
  - 对比
    - 不量化softmax需要QK之后先dequant，移动到cpu进行浮点计算，然后再quant，再移动回GPU/NPU
    - proposed method始终在GPU里，始终是integer

GPU Computing

发表于 2023-05-19 |

background

GPU Computing：the principal motivation for using the GPU is the prospect of high performance at a relatively low cost
CUDA：CUDA provides a flexible programming model and C-like language for implementing data-parallel algorithms on the GPU
FLOPs：每秒所执行的浮点运算次数（floating-point operations per second）
- https://www.zhihu.com/zvideo/1421421171497304064?playTime=463.3
- compute intensity
- memory latency
  - 我们的GNPU，0.75V/900M，1个时钟cycle近似于1ns，所以一个10w cycle的softmax的计算latency是1ms
  - chip size：做小是为了降低电流传输距离，实现更短的latency
  - 带宽memory bandwidth：例如传输带宽是131G/sec，clock latency是89ns，可以算出每个cycle最多可以传输11659bytes的数据，如果我们的程序只要求了f64的x和f64的y也就是8 bytes的数据量，那么内存利用率就是0.14%
- threads
  - 把任务拆解成高并发，比如可以将矩阵乘法拆解成
  - GPU的设计核心就是支持超多线程，而且线程切换很快
- reg & cache
  - GPU通过设计很多寄存器的方式，维持高效的数据传输效率
  - 不同层次的数据存储，reg/cache/shared mem/global mem
- Cuda Core & Tensor Core
  - 都是基本计算单元
  - Cuda Core算浮点，Tensor Core算int
CUDA kernel：https://www.youtube.com/watch?v=jYCxVirq4d0
- cuda kernel的code写的是1个thread要做的事
- 但是它manipulate两个层次，1是thread block，thread block内部的threads has shared memory，2是thread block构成的launching grid，对应地，GPU的memory hierarchy也分为两级，每个thread的私有mem叫local registers，每个block的mem叫shared memory，除此以外tensor.cuda()放的地方叫global memory
- a kernel executes a grid
- threads使用shared memory的时候要sync：所有thread finish以后整个block才算finish
- parallel的程度取决于具体算子：
  - 如果c[i]=f(c[i-1])，比如cumadd，那这种就不好并行化了
  - 如果是ele-wise的，那就可以高度并行化
1
2
3
4
5
6
// implement cuda kernel of C=A+B
__global__ void VecAddKernel(float* A, float* B, float* C, int n) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i<n)
C[i] = A[i] + B[i]
}

CPU host

在cpu上要运行code，to launch the CUDA kernel
指定/计算threads和blocks数量

void VecAddCPU(float* A, float* B, float* C, int n) {
    int threads_per_block = 512;
    int nblocks = ceil(n/threads_per_block);
    VecAddKernel<<<nblocks, threads_per_block>>>(A, B, C, n);
}

CPU & GPU

#include <stdio.h>

// 在cpu上执行
int main(void)
{
    printf("Hello World from CPU\n");
    return 0;
}


// 在gpu上执行
__global__ void HelloFromGPU(void)
{
    printf("Hello from GPU\n");
}
// 在cpu上发起调用
int main(void)
{
    printf("Hello from CPU\n");
    HelloFromGPU<<<1, 5>>>();
    cudaDeviceReset();
    return 0;
}

__global__：用它去修饰函数HelloFromGPU，成为内核函数，该函数在host端异步调用，在device端执行，无返回值
__device__：限定函数在device上调用，在device上执行，用于gpu内部调用
__host__：限定函数在host上调用，在host上执行，即默认cpu执行，所以这个修饰可以省略
<<>>：也指明了是从host端到device端的内核函数调用，里面的参数是执行配置，例子中kernel函数将被执行5个线程执行

流式多处理器(Streaming Multiprocessors，SM)
- GPU架构是通过复制这种架构的构建来实现硬件并行
  - 1个GPU包含多个SM，每个SM支持许多线程并发执行
  - CUDA采用单指令多线程(Single-Instruction Multiple-Thread，SIMT)来管理和执行GPU上的众多线程
    - 两级线程结构
    - 1个内核的所有线程为一个线程grid，所有线程共享全局内存空间
    - 1个grid由多个线程块（block）组成，一个线程块包含一组线程，同一线程块内的线程（thread）通过同步和共享内存的方式实现协作
    - 不同块内的线程不能协作
  - 当host发布一个kernel时，这个内核网格的线程块就被分配到可用的SM上来执行，多个线程在SM上并发执行，多个线程块可以并发地在一个SM上执行，当线程块终止时，新的线程块又可以在腾出的SM上启动执行
- SM核心组件
  - 核心
  - 共享内存/一级缓存
  - 寄存器文件
  - 加载/存储单元
  - 特殊功能单元
  - 线程束调度器warp
- 并行
  - 任务并行：将任务切分成子任务，多线执行，网格并发是任务并行
  - 数据并行：每个线程处理一份数据，线程执行的计算任务是一样的，线程并行是数据并行
- 线程束
  - CUDA采用SIMT架构来管理和执行线程，将线程块中的线程每32个为一组进行划分，每一组被称为一个线程束(warp)
  - 线程束的大小warpSize是CUDA中的一个内部属性
  - 线程束是GPU的基本执行单元
  - 一旦线程块被调度到一个SM上，线程块中的线程就会被进一步划分为线程束，每个线程束中的所有线程执行相同的命令，每个线程拥有自己的指令地址计数器和寄存器状态
CUDA中的可编程内存的类型
- overview
  - 寄存器(Registers)
  - 本地内存(Local Memory)
  - 共享内存(Shared Memory)
  - 常量内存(Constant Memory)
  - 纹理内存(Texture Memory)
  - 全局内存(Global Memory)
- thread独享：寄存器(Registers) & 本地内存(Local Memory)
- 一个block中的thread共享：共享内存(Shared Memory)
- 其余全局共享
- 寄存器
  - 内核函数中声明且没有其他修饰符修饰的变量，如VectorAddGPU中的线程索引变量i
  - 用于存放内核函数中需要频繁访问的线程私有变量
  - 变量与内核函数的生命周期相同
  - 寄存器是GPU中访问速度最快的内存空间
  - 寄存器溢出：一旦内核函数使用了超过硬件限制的寄存器数量，则会使用本地内存来代替多占用的寄存器
- 共享内存
  - 被__shared__修饰符修饰的变量被存储到共享内存中
  - 生命周期伴随整个线程块
  - 线程通过使用共享内存中的数据可以实现互相之间的协作
  - 使用共享内存必须调用如下函数进行同步：void __sybcthreads()
- 常量内存
  - 常量变量用__constant__修饰符进行修饰
  - 必须在全局空间内和所有内核函数之外进行声明
  - 用于线程束中的所有线程都需要从相同的内存地址中读取数据的情况
- 全局内存
  - GPU中容量最大、延迟最高的内存空间，其作用域和生命空间都是全局的
  - 可以在任何SM设备中被访问到
GPU缓存
- 缓存：
  - 缓存是cpu和内存之间，内存中访问频率高的数据的复制品（映射）
  - CPU找数据或指令的顺序是：先到一级缓存中找，找不到再到二级缓存中找，还找不到就到内存中找
- overview
  - 一级缓存
  - 二级缓存
  - 只读常量缓存
  - 只读纹理缓存
- 每个SM都有一个一级缓存，所有SM共享一个二级缓存
- 一级和二级缓存用来存储本地内存和全局内存中的数据，包括寄存器溢出的部分
Host内存：https://mp.weixin.qq.com/s?__biz=MzkwNjE2ODMyMQ==&mid=2247484370&idx=1&sn=a23ddab4c1f7bff4f221f7ae78048d0f&chksm=c0edd2bcf79a5baa9889b06e721f2c70d62bcc489490e8bb3fdc3a74e4ad77803c447351645f&scene=178&cur_album_id=2132879019699240961#rd
- 可分页内存pageable：指令/数据被离散存储至内存中，因为cpu和gpu是异步的，可能会发生cpu在gpu执行过程中修改内存发生不安全的情况，所以cuda会将临时分配页面锁定
- 固定内存pinned：cuda可以分配固定的主机内存（cudaMallocHost），不需要中间freeze的过程，直接被device端访问，因此可以用很高的带宽进行读写操作
CUDA Stream
- CUDA流是一系列异步操作的集合，同一个CUDA流中的操作严格按照顺序在GPU上运行
- 多个流同时启动多个内核任务就可以实现网格级并发，任务并行
- 显式地创建多个流，去执行数据拷贝、kernel计算等，不同的CUDA操作是可以重叠进行的
- 创建显式流：
  1
  2
  cudaStream_t stream;
  cudaStreamCreate(&stream);
- 销毁流：
  1
  cudaError_t cudaStreamDestroy(cudaStream_t stream);
- 异步拷贝数据：
  1
  cudaError_t cudaMemcpyAsync(void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0);
- 执行指定流的kernel
  1
  kernel_name<<<grid, block, sharedMemSize, stream>>>(...);
- 考虑矩阵乘法
  - 假如任务是两个矩阵乘法，可以将每个乘法任务绑定一个stream
  - 每个stream异步计算，然后结果同步
tensorRT
- NVIDIA TensorRT是一种高性能神经网络推理(Inference)引擎，用于在生产环境中部署深度学习应用，可提供最大的推理吞吐量和效率
- TensorRT有如下功能
  - Quantization
    - QAT：Quantization Aware Training，训练中量化，通过在浮点训练过程中插入虚拟的int8量化节点
    - PTQ：Post-Training Quantization，训练后量化，那一些样本图片分析不同激活层的结果分布，在以有模型上添加scale实现量化，tensorRT的PTQ以速度优先，无法控制某一层的精度
  - Kernel Auto Tuning：比如一个卷积有多种实现方案，tensorRT会根据实际的部署设备、tensor尺寸、batchsize等参数选择最efficient的方案
  - Elimination of Redundant Layers and Operations：
    - 读写tensor有个memory bandwidth bottleneck，所以layer fusion可以加速
    - 像concat的这种可以预留一个target output buffers直接把相关结果写进去，这是通过layer elimination来实现加速
    - 还有一种情况是shared structure but different weights，tensorRT会把它合并成一个宽的层，类似group convolution的原理，一次层推理得到3个path的结果，这也是layer fusion来使用GPU的计算能力
  - Dynamic Tensor Memory
- 使用流程
  - given a trained model
  - use TensorRT to parse a trained model，做如上优化得到一个plan file，a plan file不仅包含序列化的图，还包含了inference schedule，用什么kernel、什么执行顺序之类的，可以理解为tensorRT Graph
  - load and deserialize a saved plan file to create a TensorRT engine object and run inferenc
- TensorRT配套周边：https://github.com/NVIDIA/TensorRT/tree/master/tools
  - ONNX GraphSurgeon：修改ONNX模型，增加或者剪掉某些节点，修改名字或者维度等等
  - PyTorch-Quantization：在Pytorch训练或者推理的时候加入模拟量化操作，支持量化训练后的模型导出ONNX和TRT
- cookbook：https://github.com/NVIDIA/trt-samples-for-hackathon-cn/tree/master/cookbook
- workflow
  - 使用深度学习框架自带的tensorRT接口，遇到不支持的算子：返回原框架计算
  - 【推荐】导出onnx，然后导入tensorRT，遇到不支持的算子：custom node
  - 使用tensorRT搭建网络，遇到不支持的算子：写plugin(cudac++)
- 命令行工具
  - trtexec：由onnx构建模型序列plan，推理plan及查看相关信息
  - polygraphy：
Jetson
- NVIDIA将TensorRT、cuDNN、CUDA等功能整合到Jetson平台，通过一体化软件包JetPack按需提供给开发者们

pytorch custom cuda op

custom add：https://mp.weixin.qq.com/s?__biz=MzkwNjE2ODMyMQ==&mid=2247484261&idx=1&sn=d7b500fb8e81da96e4828b36d64a749f&chksm=c0edd20bf79a5b1d54e6557be02552c8ab16d7225202bb87e3a89bbd203d0b3fcd643f4be6a6&scene=178&cur_album_id=2132879019699240961#rd

cpu code

#include <iostream>

void VectorAddCPU(const float *const a, const float *const b, float *const c,
                  const int n) {
  for (int i = 0; i < n; ++i) {
    c[i] = a[i] + b[i];
  }
}

int main(void) {
  // alloc memory for host
  const size_t size = 1024;
  float *ha = new float[size]();
  float *hb = new float[size]();
  float *hc = new float[size]();
  for (int i = 0; i < size; ++i) {
    ha[i] = i;
    hb[i] = size - i;
  }

  VectorAddCPU(ha, hb, hc, size);

  delete[] ha;
  delete[] hb;
  delete[] hc;

  return 0;
}

gpu code

#include <cuda_runtime.h>
#include <iostream>

__global__ void VectorAddGPU(const float *const a, const float *const b,
                             float *const c, const int n) {
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  if (i < n) {
    c[i] = a[i] + b[i];
  }
}

int main(void) {
  // 分配CPU内存
  const size_t size = 1024;
  float *ha = new float[size]();
  float *hb = new float[size]();
  float *hc = new float[size]();
  for (int i = 0; i < size; ++i) {
    ha[i] = i;
    hb[i] = size - i;
  }

  // 分配GPU内存
  float *da = nullptr;
  float *db = nullptr;
  float *dc = nullptr;
  cudaMalloc((void **)&da, size);
  cudaMalloc((void **)&db, size);
  cudaMalloc((void **)&dc, size);
  cudaMemcpy(da, ha, size, cudaMemcpyHostToDevice);
  cudaMemcpy(db, hb, size, cudaMemcpyHostToDevice);
  cudaMemcpy(dc, hc, size, cudaMemcpyHostToDevice);
  
  // kernel config
  const int thread_per_block = 256;
  const int block_per_grid = (size + thread_per_block - 1) / thread_per_block;
 
  VectorAddGPU<<<block_per_grid, thread_per_block>>>(da, db, dc, size);
  
  // 把数据从GPU拷贝回CPU
  cudaMemcpy(hc, dc, size, cudaMemcpyDeviceToHost);

  // 释放GPU显存
  cudaFree(da);
  cudaFree(db);
  cudaFree(dc);
    
  // 释放CPU内存
  delete[] ha;
  delete[] hb;
  delete[] hc;
    
  return 0;
}

deformable conv
- torch已经有官方版本了：https://pytorch.org/vision/stable/_modules/torchvision/ops/deform_conv.html
- 民间版本：https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0
- 也是weighted bilinear这部分需要cuda kernel实现
- src下面是cuda files
- functions下面是pytorch的function封装，相当于torch.nn.functional.conv2d这样使用
- modules下面是pytorch的module封装，相当于torch.nn.Conv2d这样使用
MSDeformAttn
- kernel forward & backward在ops/src/cuda/ms_deform_attn_cuda.cu中
- 打包在ops/src/vision.cpp中
- 编译在setup.py中
- torch op封装在ops/functions中MSDeformAttnFunction，用静态方法定义pytorch op的forward & backward

python bindings

从 Python 调用 C 或 C++，以及将数据从 Python 传递到 C 或 C++
ctypes直接加载dll/so

PyBind11

PyBind11 生成的 Python 绑定是一个完整的 Python 模块，可以直接导入和使用

step1：在c++侧编写方法

// cmult.h
float cmult(int int_param, float float_param);

// cmult.cpp
float cmult(int int_param, float float_param) {
    float return_value = int_param * float_param;
    printf("    In cmult : int: %d float %.1f returning  %.1f\n", int_param,
            float_param, return_value);
    return return_value;
}

step2：pybind module编写绑定
1
2
3
4
5
6
// pybind11_wrapper.cpp
#include <cmult.h>

PYBIND11_MODULE(cmult_example, m) {
m.def("cmult", &cmult, "A function that multiplies two numbers");
}
PYBIND11_MODULE宏是python解释器导入扩展模块的入口，当调用自定义的python库函数cmult时，cmult的函数地址的将被调用

step3：setup编译并添加python module

# setup.py
from pathlib import Path
from pybind11.setup_helpers import Pybind11Extension, build_ext
from setuptools import setup


example_module = Pybind11Extension(
    'cmult_example',
    [str(fname) for fname in Path('src').glob('*.cpp')],
    include_dirs=['./'],
    extra_compile_args=['-O3']
)

setup(
    name='cmult_example',
    description='pybind11+setup.py example',
    ext_modules=[example_module],
    cmdclass={"build_ext": build_ext},
)

安装：python setup install，就会添加名为cmult_example的python module

step4：python脚本中调用module以及cmult函数

from cmult_example import cmult

a = 10
b = 1.25
print(cmult(a, b))

Cython

pytorch-quantization

tensor quant和fake tensor quant：given a tensor，调用pytorch_quantization.tensor_quant.tensor_quant(x)返回一个tensor的量化tensor(int)和scale，调用pytorch_quantization.tensor_quant.fake_tensor_quant(x)返回一个tensor的伪量化值(float)，即QDQ的值，可以对比原始tensor，计算quant error
QuantDescriptor：量化配置
TensorQuantizer：基于量化配置实例化一个Quantizer

quant_nn：实例化一个quant layer

given fc1 = nn.Linear(in_features, out_features, bias=True)

quant_fc1 = quant_nn.Linear(
    in_features, out_features, bias=True,
    quant_desc_input=tensor_quant.QUANT_DESC_8BIT_PER_TENSOR,
    quant_desc_weight=tensor_quant.QUANT_DESC_8BIT_LINEAR_WEIGHT_PER_ROW)

c++ basics recall
- cmake
- 编译原理
- 堆栈
- L1/L2 cache
- 深浅拷贝、传变量/传引用
- 重载：函数名相同，用参数和功能来自动区分
- 类和对象
  - 类：抽象，定义属性，默认private，protected只能被子类访问，public可以被访问
  - 对象：类的实例，具有类的所有变量和属性
- 继承
  - class derive_cls：public parent_cls1, public parent_cls2
  - 多态：重新实现父类的方法，来实现不同功能
自动驾驶 basics
- bev：BEVFormer
- 3D det：PETR
- 多模态：lidar+camera

# raw conv forwards
def conv2d(X, W, B, stride, pad, dilation=1):
    # X: [b,c,h,w]
    # W: [k1,k2,in,out]
    # B: [out]

    # pad
    pad_h, pad_w = pad
    X_pad = np.pad(X, [[0,0],[0,0],[0,pad_h],[0,pad_w]])
    b, _, h_pad, w_pad = X_pad.shape

    # out shape
    s, d = stride, dilation
    k1, k2, _, out_channels = W.shape
    k1, k2 = k1+(k1-1)*(d-1), k1+(k2-1)*(d-1)     # receptive field
    out_rows, out_cols = (h_pad + pad_h - k1) // s + 1, (w_pad + pad_w - k2) // s + 1

    # tranverse spatial
    Y = np.zeros((b,out_channels,out_rows, out_cols))
    for i in range(out_rows):
        for j in range(out_cols):
            i0,i1 = i*s, i*s + k1
            j0,j1 = j*s, j*s + k2
            Y[:,:,i,j] = X_pad[:,:,i0:i1:d,j0:j1:d].dot(W.T) + B  # [b,out]

    return Y


# raw conv backwards
def conv2d_bp(X_pad, W, B, Y_grads, stride, pad, dilation=1):
    # X_pad: [b,c,h_pad,w_pad]
    # W: [k1,k2,in,out]
    # Y_grads: [b,out,h_out,w_out]
    # return: X_grads, W_grads, B_grads

    b, _, out_rows, out_cols = Y_grads.shape
    s, d = stride, dilation
    k1, k2, _, out_channels = W.shape
    k1, k2 = k1+(k1-1)*(d-1), k1+(k2-1)*(d-1)

    # compute grads
    X_grads = np.zeros_like(X_pad)
    W_grads = np.zeros_like(W)
    B_grads = np.zeros_like(B)
    for i in range(out_rows):
        for j in range(out_cols):
            i0,i1 = i*s, i*s + k1
            j0,j1 = j*s, j*s + k2
            kernel = Y_grads[:,:,i,j]   # dL/dY
            B_grads += kernel * 1       # dL/dY * dY/dB
            W_grads += kernel * X_pad[:,:,i0:i1:d, j0:j1:d]   # dL/dY * X
            X_grads[:,:,i0:i1:d, j0:j1:d] += kernel * W       # dL/dY * W

    # remove pad
    pad_h, pad_w = pad
    b, _, h_pad, w_pad = X_pad.shape
    X_grads = X_grads[:,:,:h_pad-pad_h, :w_pad-pad_w]

    return X_grads, W_grads, B_grads

im2col & col2im
- conv2d里面X_pad[:,:,i0:i1:d,j0:j1:d].dot(W.T)这个做的是patch[b,in,k,k]和filter[out,in,k,k]的点积
- im2col可以将循环转化成矩阵乘法
  - filter转化成[out, in*k*k]，X转化成overlap patches[in*k*k, b*n_rows*n_cols]
  - 结果得到[out, b*n_rows*n_cols]
- col2im转换矩阵结果
  - [b, out, n_rows, n_cols]
BN/LN/GN/IN
sgd
adamw
regularization

bit hacks：https://zhuanlan.zhihu.com/p/37014715

判断奇偶：与1做交
1
2
def isOdd(n):
return n & 1

交换两个数字：

def swap1(a,b):
    a = a + b       # a+b
    b = a - b       # a
    a = a - b       # b

def swap2(a,b):
    a = a^b
    b = a^b         # b=(a^b)^b=a^(b^b)=a^0=a
    a = a^b         # a=(a^b)^a=b^(a^a)=b^0=b

将比特序列反向：一个右移弹出，一个左移加入

def inverse(n):
    result = 0
    while n:
        tmp = n & 1
        n >>= 1
        result += tmp
        result <<= 1
    return result

对2的幂数取余

1
2
3

# d=1<<s, {1,2,4,...}
def residual(n, d):
	return n&(d-1)

计算log2：寻找最高位

def log2(n):
    cnt = 0
    while n>>1:
        n >>= 1
        cnt += 1
    return cnt

计算log10：换底数，log10(x) = log2(x) / log2(10)， log2(10)=1233/4096

def log10(n):
    PowersOf10 = [1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000]
	t = (log2(n)+1) * 1233 >> 12
    r = t - (v < PowersOf10[t])

判断正负（零）

1 2	sign = -(v < 0) sign = (v > 0) - (v < 0)

判断两数异号

1	sign = ((x ^ y) < 0) // 有且仅有一个负数flag

判断是否是2的幂数
1
sign = n & (n & n(n-1))==0

计算bit数中1的个数

# 计算效率与n的bit位数成正比
def cnt_1(n):
	cnt = 0
    while n:
        tmp = n&1
        cnt += tmp
        n >>= 1
    return cnt

# 计算效率与n中1的个数成正比
def cnt_1(n):
	cnt = 0
    while n:
        cnt += 1
        n &= n-1       # remove the highest 1
	return cnt

计算平方根

amber.zhang

要糖有糖，要猫有猫

GitHub