# Llama

# 创新点

  • Pre-Normalization :前置层归一化,可以让训练更加稳定,防止梯度消失的问题.
  • RMSNorm :均方根归一化,相比于 Layer-Normalization
  • RoPE :旋转位置编码,与传统的 Transform 模型的固定位置编码不同,其是一种动态的位置编码. RoPE 不直接嵌入词向量中,而是对 QKV 做了选择操作,从而引入对应词的位置信息.
  • SwiGLU :激活函数。结合了 GLUSwitch 的优势.

# RMS Normalization

xi=xiRMS(X)giRMS(X)=1ni=1nxi2x_{i} = \frac{x_{i}}{RMS(X)}g_{i} \\ RMS(X) = \sqrt{\frac{1}{n}\sum_{i=1}^{n}x_{i}^{2}}

class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsion = eps
    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = torch.to(torch.float32)
        variance = hidden_states.pow(2).mean(dim=-1, keepdims=True)
        hidden_states = hidden_states * torch.rsqrt(variance)
        return self.weight * hidden_states.to(input_dtype)

# LlamaMLP

Edited on Views times

Give me a cup of [coffee]~( ̄▽ ̄)~*

Value WeChat Pay

WeChat Pay