GPT | Less is More

GPT papers，openAI三部曲，通用预训练模型
- [2018 GPT-1] Improving Language Understanding by Generative Pre-Training：transformer-based，pre-training+task-specific finetuning，将所有的task的输入都整合成sequence-to-sequence form，结构上不需要task-specific architecture
- [2019 GPT-2] Language Models are Unsupervised Multitask Learners：对GPT-1结构上微调，引入huge dataset进行无监督训练
- [2020 GPT-3] Language models are few-shot learners：scaling up LMs，zero-shot
- BERT有3亿参数

GPT-1: Improving Language Understanding by Generative Pre-Training

动机
- NLP tasks
  - textual entailment：文本蕴含
  - question answering
  - semantic similarity assessment
  - document classification
- labeled data少，unlabeled corpora充足
- large gains can be realized by
  - generative pre-training of a language model on diverse unlabeled corpus，无监督general model，learn universal representations
  - discriminative fine-tuning on specific task，有监督task-specific model，adapt to wide range of tasks
- general task-agnostic model能够打败discriminatively trained models
- use task-aware input transformations
论点
- learn from raw text &alleviate the dependence on supervised learning still challenging：
  - 不清楚选什么optmization objectives：language modeling/machine translation/discourse coherence
  - effective way to transfer：加task-specific模型结构改动/auxiliary learning objectives/learning schemes
- two-stage training procedure
  - pretrain + fine-tuning
  - use Transformer：better handling long-term dependencies
  - task-specific input adaptions将输入处理成structured词向量序列
- evaluate on
  - natural language inference
  - question answering
  - semantic similarity
  - text classification
方法
- overview
  - architecture：transformer decoder
  - training objectives
    - unsupervised：text prediction，前文预测后文
    - supervised：task classifier，对整个序列分类
- Unsupervised pre-training
  - given unsupervised corpus of tokens $U={u_1, …, u_n}$
  - context window size $k$
  - use standard language modeling objective：$L_1(U)=\sum log P(u_i|u_{i-k},…,u_{i-1};\Theta)$
  - use multi-layer Transformer decoder
    - input：$h_0 = UW_e + W_p$
    - attention blocks：$h_l = tranformer_block(h_{l-1}), \forall l\in[1,n]$
    - output：$P(u)=softmax(h_l W_e^T)$
  - use SGD
- Supervised fine-tuning
  - given labeled dataset $C$ consists of $[x^1,…,x^m;y]$ instances
  - use the final transformer block’s activation $h_l^m$
  - fed into an linear+softmax output layer：$P(y|x^1,…,x^m)=softmax(h_l^mW_y)$
  - 优化目标是y：$L_2(C) = \sum log(P(y|x^1,…,x^m))$
  - 实验发现加上Unsupervised loss helps learning：提升泛化性，加速收敛
    $L_3(C) = L_2(C) + \lambda * L_1(C)$
- Task-specific input transformations
  - certain tasks has structured inputs如问答pairs/triplets
  - we convert them into ordered sequences
    - Textual entailment：将前提premise和推理hypothesis concat在一起
    - Similarity tasks：两个文本没有先后顺序关系，所以一对文本变成顺序交换的两个sequence，最后的hidden units $h^m_l$相加，然后接输出层
    - Question Answering and Commonsense Reasoning：given document $z$, question $q$, and possible answers $\{a_k\}$，context $zq$和每个答案$a_i$都构造一组连接，然后分别independently processed with our model，最后共同接入一个softmax，生成对所有possible answers的概率分布
  - 所有的连接都使用分隔符$
  - 所有的sequence的首尾都加上一个randomly initialized start&end tokens
实验

GPT-2: Language Models are Unsupervised Multitask Learners

动机
- more general models which can perform many tasks
- train language model
  - without explicit supervision
  - trained on a new dataset of millions of webpages called WebText
  - outperforms several baselines
- GPT-2：a 1.5B parameter Transformer
论点
- Machine learning systems are sensitive to slight changes of
  - data distribution
  - task specification
  - ‘narrow experts’
  - lack of generalization since ingle task training on single domain datasets
- methods
  - multitask training：还不成熟
  - pretraining + finetuning：still require supervised training
- this paper
  - connect the two lines above
  - perform donw-stream tasks in a zero-shot setting
方法
- natural sequential characteristic makes the general formulation $p(output|input)$
  $p(x) = \Pi_{i=1}^n p(s_{n-k}, ..., s_n|s_1, ..., s_{n-k-1})$
- task specific system requires the probabilistic framework also condition on the task to be performed $p(output|input, task)$
  - architectural level：task-specific encoders/decoders
  - algorithmic level：like MAML
  - or in a more flexible way to specify tasks：write all as sequences
    - translation：(translate to french, english text, french text)
    - comprehension：(answer the question, document, question, answer)
- training dataset
  - 海量document可以通过爬虫获得but significant data quality issues
  - 与target dataset similar的外部doc的子集能够给到提升
  - 因此本文设定了一个搜集文本的机制：Reddit的外链，去掉Wikipedia
- input representation
  - word-level language model VS byte-level language model
    - word-level performs better
    - 但是受到vocabulary限制
  - Byte Pair Encoding (BPE)
    - combine the empirical benefits of word-level LMs with the generality of byte-level approaches
    - 具体改进还没理解
- model
  - Transformer-based，few modifications on GPT-1 model
    - layer normalization was moved to the input of each sub-block
    - additional layer normalization was added after the final self-attention block
    - initialization on residual path：N个residual layers，就将residual weights rescale $\frac{1}{\sqrt{N}}$
    - context size：1024
    - batch size：512
  - residual block
实验

GPT-3: Language Models are Few-Shot Learners

动机
- zero-shot：pretraining+finetuning scheme还是需要task-specific finetuning datset
- scale-up：scaling up language models greatly improves general few-shot performance