torch-note

常用库函数

1.1 torch.flatten(input, start_dim=0, end_dim=-1)：展开start_dim到end_dim之间的dim成一维

1.2 [einops][https://ggcgarciae.github.io/einops/2-einops-for-deep-learning/].rearrange(element, pattern)：贼强，用高级pattern指导张量变换
torch.cuda.amp

自动混合精度：FloatTensor & HalfTensor
- 安装
- 使用
torch.jit.script

将模型从纯Python程序转换为能够独立于Python运行的TorchScript程序
[torch.nn.DataParallel & DistributedDataParallel][https://blog.csdn.net/kuweicai/article/details/120516410]
- DP和DDP都是实现数据并行方式的分布式训练，主要区别如下：
  - DP是单进程多线程，DDP是采用多进程（有进程通信）
  - DP只能在单机上使用，DDP单机和多机都可以使用
  - DDP相比于DP训练速度要快
  - DP的架构是有一个main GPU，然后一对多通信&同步其他子GPU，全程负责将切分的数据和复制的模型发布到子GPU上，通信时间和卡数目成正比，DDP的通信结构是环状worker，从一开始就分配给每个进程独立的获取数据&构建模型任务，每个GPU接收的数据量&信息量恒定，通信成本恒定（[reference][https://blog.csdn.net/qiumokucao/article/details/120179961]）
- DP使用
  - 简单的单机多卡，forward pass在多卡上做，然后汇总的主卡，模型更新只在主卡上做，然后再分发到各子GPU
  - GPU利用率低
  - 只需要给single model做一个封装
  1
  net = torch.nn.DataParallel(model,device_ids=[0,1,2])
- [DDP使用][https://zhuanlan.zhihu.com/p/107139605]
  - 每个batch以后，分发模型权重，太麻烦了，可以考虑同步梯度，把所有卡的loss同步，然后各自梯度更新（需要设置同样的seed）就好了——【重复计算好多遍】要快过【io通信分发】
  - 相对复杂，核心配置参数：
    - group：进程组，默认情况下，只有一个组
    - world size：全局进程个数，如果是多机多卡就表示机器数量，如果是单机多卡就表示 GPU 数量
    - rank：进程号，多机多卡表示机器编号，单机多卡表示 GPU编号
    - local_rank：进程内GPU 编号
  - 两种代码封装方式
    - spawn
    - launch：一般看到的都是torch.distributed.launch
  - torch.distributed.launch
    - 用法：python3 -m torch.distributed.launch [—usage] single_training_script.py [—training_script_args]
      1
      2
      3
      [-h] [--nnodes NNODES] [--node_rank NODE_RANK]
      [--nproc_per_node NPROC_PER_NODE] [--master_addr MASTER_ADDR]
      [--master_port MASTER_PORT] [--use_env] [-m] [--no_python]

      * -h/--help：查看帮助
      * --nnodes NNODES：节点数
      * --node_rank NODE_RANK：当前节点的rank
      * --nproc_per_node NPROC_PER_NODE：每个节点的GPU数量
      * --master_addr MASTER_ADDR：node 0的IP/host name，单机多卡时候就是127.0.0.1
      * --master_port MASTER_PORT：node 0的free port，用来节点间通信
      * --use_env：读取环境变量的LOCAL_RANK，然后用来传递local rank
      * -m：类似python -m，如果single_training_script.py被打包成python module了，可以-m调用
      * --no_python：用不上

    * 查看帮助：python -m torch.distributed.launch --help

指定GPU
- 在代码里面指定
  1
  os.environ['CUDA_VISIBLE_DEVICES'] = '0'
- 在命令行运行脚本/文件时指定
  1
  2
  CUDA_VISIBLE_DEVICES=0,1 python3 train.py
  CUDA_VISIBLE_DEVICES=0,1 sh run.sh
- 在sh脚本中指定
  1
  2
  3
  source bashrc
  export CUDA_VISIBLE_DEVICES=gpu_ids && python3 train.py # 两个命令
  CUDA_VISIBLE_DEVICES=gpu_ids python3 train.py # 1个命令
- 优先级：代码>命令>脚本
============================== 分隔符 ================================
- .cuda()指定
  1
  2
  3
  model.cuda(gpu_id) # 只能指定一张显卡
  model.cuda('cuda:'+str(gpu_ids)) # 可以多卡
  model.cuda('cuda:1,2')
- torch.cuda.set_device()指定
  1
  2
  torch.cuda.set_device(gpu_id) # 单卡
  torch.cuda.set_device('cuda:'+str(gpu_ids)) # 可指定多卡
- 优先级：.cuda() > torch.cuda.set_device()
============================== 分隔符 ================================
- 另外分隔符上下两种指定方式，指定的GPU设备的效果，会叠加：
  1
  2
  3
  4
  5
  6
  7
  # run shell
  CUDA_VISIBLE_DEVICES=2,3,4,5 python3 train.py
  
  # 代码内部
  model.cuda(1)
  loss.cuda(1)
  tensor.cuda(1)
  - 此时代码会运行在GPU3上，因为首先指定GPU 2 3 4 5作为VISIBLE_DEVICES，内部编号0 1 2 3，然后在代码内部指定1号卡，也就是外部的3号
- 推荐os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘0’ 方式，童叟无欺

随机种子

为了保证每次训练的可复现性，在程序开始的时候固定torch的随机种子，同时也把numpy的随机种子固定

np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)

torch.backends.cudnn.deterministic = True     # 每次卷积计算算法固定
torch.backends.cudnn.benchmark = False        # 同上，组合使用

多卡同步 BN
- 默认情况下，各卡用各自的数据独立计算均值和标准差
  - 因为最开始的任务mini-batch够大
  - 数据同步通信浪费时间
  - 【QUESTION】一个疑问，滑动平均最终都是近似样本均值的吧，是不是只影响训练初期的收敛速度啊，和精度有直接影响吗？？【一个解释】因为总体的数据会切分，然后分配给每个卡，这样多卡的情况下，其实不能完全保证一张卡是跑过全集的，所以可能导致每个 GPU 过拟合自己那份数据
- 同步BN用所有卡的数据一起计算均值和标准差，BP的时候计算全局梯度，对检测任务提升较大
1
2
sync_bn = torch.nn.SyncBatchNorm(num_features, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)

张量

基本属性

tensor = torch.randn(3,4,5)
print(tensor.type())  # 数据类型
print(tensor.size())  # 张量的shape，是个元组
print(tensor.dim())   # 维度的数量

轴命名 & 替代axis-index

# Tensor[N, C, H, W]
images = torch.randn(32, 3, 56, 56)
images.sum(dim=1)
images.select(dim=1, index=0)

# PyTorch 1.3之后
NCHW = [‘N’, ‘C’, ‘H’, ‘W’]
images = torch.randn(32, 3, 56, 56, names=NCHW)
images.sum('C')
images.select('C', index=0)
# 也可以这么设置
tensor = torch.rand(3,4,1,2,names=('C', 'N', 'H', 'W'))
# 使用align_to可以对维度方便地排序
tensor = tensor.align_to('N', 'C', 'H', 'W')

数据类型转换

# 设置默认类型，pytorch中的FloatTensor远远快于DoubleTensor
torch.set_default_tensor_type(torch.FloatTensor)

# 类型转换
tensor = tensor.cuda()       # cuda类型的tensor仅用于在GPU上进行计算，不能与其他类型混用
tensor = tensor.cpu()        # cpu类型的tensor可以与ndarrray/PIL.Image自由转换
tensor = tensor.float()
tensor = tensor.long()

# ndarray
ndarray = tensor.cpu().numpy()
tensor = torch.from_numpy(ndarray).float()
tensor = torch.from_numpy(ndarray.copy()).float()

# PIL.Image
image = PIL.Image.fromarray(torch.clamp(tensor*255, min=0, max=255).byte().permute(1,2,0).cpu().numpy())   # byte()=uint8(), char()=int8(), [C,H,W]->[H,W,C]
image = torchvision.transforms.functional.to_pil_image(tensor)
tensor = torch.from_numpy(np.asarray(PIL.Image.open(path))).permute(2,0,1).float() / 255    # 0-1的f32, [C,H,W]
tensor = torchvision.transforms.functional.to_tensor(PIL.Image.open(path)) 

# scalar
value = torch.rand(1).item()

张量基本操作

# 负步长，pytorch不支持tensor[::-1]这样的负步长操作，需要通过张量索引实现
tensor = tensor[:,:,:,torch.arange(tensor.size(3) - 1, -1, -1).long()]   # [N,C,H,W] 水平翻转

# 复制张量
tensor.clone()                # new memory, still in computation graph
tensor.detach()               # shared memory, not in computation graph
tensor.detach.clone()()       # new memory, not in computation graph

# 张量比较
torch.allclose(tensor1, tensor2)  # float tensor
torch.equal(tensor1, tensor2)     # int tensor

# 矩阵乘法
# Matrix multiplcation: (m*n) * (n*p) -> (m*p).
result = torch.mm(tensor1, tensor2)
# Batch matrix multiplication: (b*m*n) * (b*n*p) -> (b*m*p)
result = torch.bmm(tensor1, tensor2)
# Element-wise multiplication: (m*n) * (m*n) -> (m*n)
result = tensor1 * tensor2
result = torch.mul(tensor1, tensor2)
# xjb乘之matmul: 不限定输入几维矩阵，始终后两维进行矩阵乘法，前面的维度broadcast
a = torch.ones(2,1,3,4)
b = torch.ones(5,4,2)
c = torch.matmul(a,b)    # torch.Size([2,5,3,2])

数据集Dataset, DataLoader

torch.utils.data.Dataset：Dataset可以理解为一个list，上层调用时候会传给他一个index，dataset则复制读取、变换、预处理指定文件，返回一个(input_x, target_y)-pair，主体结构如下：

class CustomDataset(torch.utils.data.Dataset):

    def __init__(self):
        # TODO
        # 1. Initialize file path or list of file names.
        pass
    def __getitem__(self, index):
        # TODO
        # 1. Read one data from file (e.g. using numpy.fromfile, PIL.Image.open).
        # 2. Preprocess the data (e.g. torchvision.Transform).
        # 3. Return a data pair (e.g. image and label).
        #这里需要注意的是，第一步：read one data，是一个data
        pass
    def __len__(self):
        # You should change 0 to the total size of your dataset.
        return 0

# Dataset的长度代表样本量
# DataLoader的长度代表batch steps

torch.utils.data.DataLoader：DataLoader是真正对接模型这一层，负责整合batch data，同时调整采样策略、workers、shuffle等一系列设置，用如下参数将其实例化（加粗为常用）：
- dataset(Dataset): 传入的数据集
- batch_size(int, optional): 每个batch有多少个样本
- shuffle(bool, optional): 在每个epoch开始的时候，对数据进行重新排序
- sampler(Sampler, optional): 自定义从数据集中取样本的策略，如果指定这个参数，那么shuffle必须为False
- batch_sampler(Sampler, optional): 与sampler类似，但是一次只返回一个batch的indices（索引），需要注意的是，一旦指定了这个参数，那么batch_size,shuffle,sampler,drop_last就不能再制定了（互斥——Mutually exclusive）
- num_workers (int, optional): 这个参数决定了有几个进程来处理data loading。0意味着所有的数据都会被load进主进程。（默认为0）
- collate_fn (callable, optional): 将一个list的sample组成一个mini-batch的函数
- pin_memory (bool, optional)： 如果设置为True，那么data loader将会在返回它们之前，将tensors拷贝到CUDA中的固定内存（CUDA pinned memory）中
- drop_last (bool, optional): 如果设置为True：这个是对最后的未完成的batch来说的，比如你的batch_size设置为64，而一个epoch只有100个样本，那么训练的时候后面的36个就被扔掉了。如果为False（默认），那么会继续正常执行，只是最后的batch_size会小一点
- timeout(numeric, optional):如果是正数，表明等待从worker进程中收集一个batch等待的时间，若超出设定的时间还没有收集到，那就不收集这个内容了。这个numeric应总是大于等于0。默认为0
- worker_init_fn (callable, optional): 每个worker初始化函数 If not None, this will be called on eachworker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

采样器Sampler

所有的采样器都继承自torch.utils.data.sampler

class SequentialSampler(Sampler):
  r"""Samples elements sequentially, always in the same order.
    Arguments:
        data_source (Dataset): dataset to sample from
    """
   # 产生顺序 迭代器
    def __init__(self, data_source):
        self.data_source = data_source

    def __iter__(self):
        return iter(range(len(self.data_source)))     # 主要区别在这里

  def __len__(self):
        return len(self.data_source)

已有Sampler
```
* SequentialSampler(data_source)：按顺序采集，data_source可以是一个Dataset，返回一个indices的生成器
```
- RandomSampler(data_source, replacement=False, num_samples=None)：随机、有/无放回、采集指定数目的样本
  - SubsetRandomSampler(indices)：无放回采样，就是打乱全集，是RandomSampler的懒人常用写法
- WeightedRandomSampler(weights, num_samples, replacement=True)：也是RandomSampler的衍生，样本带了权重
  - BatchSampler(sampler, batch_size, drop_last)：将以上Sampler包装成批索引返回

模型

nn.Module：定义模型时继承的基类，因为基类封装了train/eval/梯度回传等高级功能，就相当于keras的Model类

自定义层、模型、Loss都是继承这个类

迭代模型的所有子层：

for layer in model.modules():               # 返回所有子层
    if isinstance(layer, torch.nn.Conv2d):
        torch.nn.init.kaiming_normal_(layer.weight, mode='fan_out',
                                      nonlinearity='relu')

for layer in model.named_modules():         # 返回所有的[名字,子层]pairs
    if isinstance(layer[1],nn.Conv2d):
         conv_model.add_module(layer[0],layer[1])

model(x) 前用 model.train() 和 model.eval() 切换网络状态

nn.ModuleList：是个List，可以把任意 nn.Module 的子类加入到这个List，而且是会自动注册到整个网络上（在computation graph上），但是用普通的python List定义则不会真正添加进网络结构里（应该跟局部定义域有关吧）
- module的执行顺序根据 forward 函数来决定
- 一个module可以在 forward 函数中被调用多次，但是参数共享
nn.Sequential：它更进一步，已经在内部实现了forward方法——定义即实现，必须按照层顺序去定义
nn.Xxx & nn.functional.xxx：如nn.Conv2d和nn.functional.conv2d，这就类似keras.Layer.Conv2d和tf.nn.conv2d，一个是封装的层，需要实例化使用，一个是函数借口，直接使用但是要传入参数

模型参数量：torch.numel

1 2	total_parameters = sum(torch.numel(p) for p in model.parameters()) trained_parameters = sum(torch.numel(p) for p in model.parameters() if p.requires_grad)

模型参数：

model.parameters()     # 生成器
model.state_dict()     # dict

model.load_state_dict(torch.load('model.pth'), strict=False)

# 模型参数量：torch.numel
sum_parameters = sum(torch.numel(parameter) for parameter in model.parameters())

# 浮点运算次数：GFLOPs
model.layers[0].flops() / 1e9

1个special case：BN层，在调用.parameters()方法的时候，可以看到BN层只有两个参数，但是实际上还有running mean & running std，这两个变量严格来说不算网络参数，而是一个数值统计，所以在state_dict()里面可以看到

以较大学习率微调全连接层，较小学习率微调卷积层

model = torchvision.models.resnet18(pretrained=True)
finetuned_parameters = list(map(id, model.fc.parameters()))
conv_parameters = (p for p in model.parameters() if id(p) not in finetuned_parameters)
parameters = [{'params': conv_parameters, 'lr': 1e-3}, 
              {'params': model.fc.parameters()}]
optimizer = torch.optim.SGD(parameters, lr=1e-2, momentum=0.9, weight_decay=1e-4)

pytorch-summary