sec2. Encoder-Decoder Structure

一般需要对输入数据做一次映射,将其映射到易于学习的隐空间。
  1. Embedding
      • Tokenization:将输入序列拆成一个个的token。
        • 对于文本数据,一般是通过分词方法将文本序列划分为token序列;如果是图片数据,则会把它划分为若干个patch(即ViT里的Patch Embedding输入预处理)
          对于文本数据,该步骤的工作流程为:
        • Primary Embedding:将分词后,且映射完毕的token序列映射到隐空间,形成特征向量。
        • Position Encoding & Fusion
          • 位置编码的可视化,注意奇偶的编码方式是不一致的。
            位置编码的可视化,注意奇偶的编码方式是不一致的。
            class PositionalEncoding(nn.Module): """ compute sinusoid encoding. """ def __init__(self, d_model, max_len, device): """ constructor of sinusoid encoding class :param d_model: dimension of model :param max_len: max sequence length :param device: hardware device setting """ super(PositionalEncoding, self).__init__() # same size with input matrix (for adding with input matrix) self.encoding = torch.zeros(max_len, d_model, device=device) self.encoding.requires_grad = False # we don't need to compute gradient pos = torch.arange(0, max_len, device=device) pos = pos.float().unsqueeze(dim=1) # 1D => 2D unsqueeze to represent word's position _2i = torch.arange(0, d_model, step=2, device=device).float() # 'i' means index of d_model (e.g. embedding size = 50, 'i' = [0,50]) # "step=2" means 'i' multiplied with two (same with 2 * i) self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / d_model))) self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / d_model))) # compute positional encoding to consider positional information of words def forward(self, x): # self.encoding # [max_len = 512, d_model = 512] batch_size, seq_len = x.size() # [batch_size = 128, seq_len = 30] return self.encoding[:seq_len, :] # [seq_len = 30, d_model = 512] # it will add with tok_emb : [128, 30, 512]
    1. Encoder & Decoder
      1. notion image