GRAPH ATTENTION NETWORKS

official repo: https://github.com/PetarV-/GAT

reference: https://zhuanlan.zhihu.com/p/34232818

归纳学习（Inductive Learning）：先从训练样本中学习到一定的模式，然后利用其对测试样本进行预测（即首先从特殊到一般，然后再从一般到特殊），这类模型如常见的贝叶斯模型。
转导学习（Transductive Learning）：先观察特定的训练样本，然后对特定的测试样本做出预测（从特殊到特殊），这类模型如k近邻、SVM等。

GRAPH ATTENTION NETWORKS

动机
- task：node classification
- 在GCN基础上引入 masked self-attentional layers
- specify different weights to different nodes in a neighborhood，感觉是用attention矩阵替换邻接矩阵？
论点
- the attention architecture properties
  - parallelizable，计算高效
  - can be applied to graph nodes having different degrees，这个邻接矩阵也可以啊
  - directly applicable to inductive learning problems，是说原始GCN那种semi-supervised场景吗
  - 感觉后面两点有点牵强
- GCN
  - 可以避免复杂的矩阵运算
  - 但是依赖固定图结构，不能直接用于其他图
methods
- graph attentional layer
  - input：node features
    - N个节点，F-dim representation
    - $h=\{\overrightarrow {h_1},\overrightarrow {h_2},…,\overrightarrow {h_N} \}$，$\overrightarrow {h_i} \in R^F$
  - output：a new set of node features
    - $h=\{\overrightarrow {h_1^{‘}},\overrightarrow {h_2^{‘}},…,\overrightarrow {h_N^{‘}} \}$，$\overrightarrow {h_i} \in R^{F^{‘}}$
  - a weight matrix
    - $W \in R^{F \times F^{‘}}$
    - applied to every node
  - then self-attention
    - compute attention coefficients：$e_{ij} = a(W\overrightarrow {h_i},W\overrightarrow {h_j})$
      - attention mechanism a：是一个single-layer feedforward neural network + LeakyReLU(0.2)
      - weight vector $\in R^{2F^{‘}}$
    - softmax norm
    - overall expression
      - 两个feature vector concat到一起
      - 然后全连接层+LeakyReLU
      - 然后softmax
    - 表达的是节点j对节点i的重要性
    - masked attention：inject graph structure，只计算节点i的neighborhood的importance
    - neighborhood：the first-order neighbors
    - 加权和 + nonlinearity
    - multi-head attention：
      - trainable weights有多组，一个节点与其neighborhood的attention coefficients有多组
      - 最后每组weights计算出那个new node feature（加权平均+nonlinear unit），可以选择concat/avg，作为最终输出
      - concat
      - 如果是网络最后一层的MHA layer，先avg，再非线性激活函数：
    - overall
  - comparisons to related work
    - our proposed GAT layer directly address several issues that were present in prior approaches
      - computationally efficient，说是比矩阵操作高效，这个不懂
      - assign different importance to nodes of a neighborhood，这个GCN with tranable adjacent matrix不也是一样性质的吗，不懂
      - enable 有向图
      - enable inductive learning，可以被直接用于解决归纳学习问题，即可以对从未见过的图结构进行处理，为啥可以不懂
    - 数学表达上看，attention和adjacent matrix本质上都是用来建模graph edges的
      - adj-trainable GCN：dag paper里面那种，adjacent matrix本身就是一个可训练变量(N,N)，随着训练更新参数
      - GAT：attention的更新引入了新的线性层权重 $ W \in R^{2F^{‘}}$