[llama3/llama/model.py][class TransformerBlock] def __init__, def forward

ma-kjh·2024년 8월 28일
0

LLM

목록 보기
9/14
@dataclass
class ModelArgs:
    dim: int = 4096
    n_layers: int = 32
    n_heads: int = 32
    n_kv_heads: Optional[int] = None
    vocab_size: int = -1
    multiple_of: int = 256  # make SwiGLU hidden layer size multiple of large power of 2
    ffn_dim_multiplier: Optional[float] = None
    norm_eps: float = 1e-5
    rope_theta: float = 500000

    max_batch_size: int = 32
    max_seq_len: int = 2048
class TransformerBlock(nn.module):
	def __init__(self, layer_id: int, args: ModelArgs):
    	super().__init__()
        self.n_heads = args.n_heads # 위 보면 args.n_heads == 32, multi-head attention진행할 때 32개로 쪼갠다는 의미
        self.dim = args.dim
        self.head_dim = args.dim // args.n_heads # 여기서 쪼개버림
        self.attention = Attention(args) # 추후에 Attention class로 볼거
        self.feed_forward = FeedForward( # 마찬가지
        	dim=args.dim, # 4096 dim을 가짐
            hidden_dim=4 * args.dim, # 쌓을 때 잠깐 늘리나봄 4096 * 4 만큼. (상당히 많이 늘리네)
            multiple_of=args.multiple_of # 이건 정확하게 뭔지 모르겠네, SwiGLU hidden layer size multiple of large power of 2 여튼 256 들어감. 나중에 확인.
            ffn_dim_multiplier=args.ffn_dim_multiplier, # Optional[float]=None
        
        )
        self.layer_id = layer_id # layer_id
        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps) # RMSNorm이라는 attention norm을 적용.
        self.ffn_norm = RMSNorm(args.dim, eps=)
        
    def forward(
    	self,
        x: torch.Tensor,
        start_pos: int, # forward시에 start position을 왜 정하지 ??
        freqs_cis: torch.Tensor, # 이거 뭘까..
        mask: Optional[torch.Tensor], # Optional
    ):
    	h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) # 먼저 , attention_norm에 x를 넣고, x를 attention을 통해 뽑아냄. 그리고 residual connection
        out = h + self.feed_forward(self.ffn_norm(h)) # 그렇게 나온 feature를 다시 ffn norm에 넣고 feed forward network타게 만듬. 다시 residual connection
        return out
        
        
        

Introduction to RMSNorm in LLaMA

  • RMSNorm (Root Mean Square Normalization) is a normalization technique used in the architecture of large language models like LLaMA (Large Language Model Meta AI).
  • RMSNorm is a variant of Layer Normalization (LayerNorm) and is designed to normalize the activations of a neural network layer, helping to stabilize training and improve the performance of deep learning models.

Why RMSNorm?

  • Traditional normalization methods like LayerNorm normalize the activations of a neural network by computing the mean and variance for each layer's activations.
  • However, RMSNorm simplifies this process by using only the Root Mean Square (RMS) of the activations, which can reduce computational overhead while still effectively normalizing the activations.

Mathematical Formulation of RMSNorm

Let's break down the formulation of RMSNorm step by step.

1. Activation Input

Consider the input to a neural network layer as a vector hRd\mathbf{h} \in \mathbb{R}^d, where dd is the dimensionality of the layer (i.e., the number of neurons in that layer).

2. Root Mean Square (RMS) Calculation

The first step in RMSNorm is to calculate the Root Mean Square (RMS) of the activations:

RMS(h)=1di=1dhi2\text{RMS}(\mathbf{h}) = \sqrt{\frac{1}{d} \sum_{i=1}^{d} h_i^2}

Here:

  • hih_i represents the ii-th element of the activation vector h\mathbf{h}.
  • dd is the dimensionality of the vector (number of neurons in the layer).

3. Normalization Step

Next, RMSNorm normalizes the input activation vector h\mathbf{h} by dividing each element by the RMS value:

h^=hRMS(h)\mathbf{\hat{h}} = \frac{\mathbf{h}}{\text{RMS}(\mathbf{h})}

Where:

  • h^\mathbf{\hat{h}} is the normalized activation vector.

4. Scaling with Learnable Parameter

To give the model the ability to adjust the scale of the normalized activations, RMSNorm includes a learnable scaling parameter γ\gamma:

hnorm=γh^\mathbf{h}_{\text{norm}} = \gamma \cdot \mathbf{\hat{h}}

Where:

  • γ\gamma is a learnable parameter that adjusts the scale of the normalized activations.

This allows the model to adapt the normalization process during training.

Comparison with LayerNorm

  • LayerNorm computes both the mean and variance of the activations and normalizes them based on these statistics.
  • RMSNorm simplifies this by using only the RMS of the activations, which reduces computational complexity and can lead to faster training.

Advantages of RMSNorm

  1. Efficiency: By avoiding the computation of the mean and variance, RMSNorm can be more computationally efficient than LayerNorm.
  2. Stability: RMSNorm provides similar benefits in stabilizing the training of deep models, preventing issues like exploding or vanishing gradients.
  3. Flexibility: The learnable scaling parameter γ\gamma allows the model to fine-tune the normalization, which can lead to better performance.

Application in LLaMA

In the LLaMA architecture, RMSNorm is used to normalize the activations of various layers within the model, such as the feedforward layers and attention mechanisms. The use of RMSNorm contributes to the stability and efficiency of training large-scale language models like LLaMA, enabling them to learn from vast amounts of text data effectively.

Conclusion

RMSNorm is a streamlined and effective normalization technique that plays a crucial role in the architecture of modern large language models like LLaMA. By focusing on the RMS of activations, RMSNorm simplifies the normalization process while maintaining the benefits of stability and efficiency, making it well-suited for deep learning models with many layers.

The main difference between Layer Normalization (LayerNorm) and RMSNorm lies in how they normalize the activations of a neural network layer.

1. Layer Normalization (LayerNorm)

LayerNorm normalizes the activations by computing both the mean and variance of the activations across all neurons in a layer. It then uses these statistics to standardize the activations, followed by scaling and shifting with learnable parameters.

Formulation:

Given an input activation vector h\mathbf{h} with dd dimensions (neurons in the layer):

  1. Mean:

    μ=1di=1dhi\mu = \frac{1}{d} \sum_{i=1}^{d} h_i
  2. Variance:

    σ2=1di=1d(hiμ)2\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (h_i - \mu)^2
  3. Normalization:

    h^=hμσ2+ϵ\mathbf{\hat{h}} = \frac{\mathbf{h} - \mu}{\sqrt{\sigma^2 + \epsilon}}

    where ϵ\epsilon is a small constant to avoid division by zero.

  4. Scaling and Shifting:

    hnorm=γh^+β\mathbf{h}_{\text{norm}} = \gamma \cdot \mathbf{\hat{h}} + \beta

    where γ\gamma and β\beta are learnable parameters that scale and shift the normalized output.

2. RMSNorm

RMSNorm simplifies the normalization process by only using the Root Mean Square (RMS) of the activations, without computing the mean and variance. It then normalizes the activations based on the RMS value and applies a learnable scaling parameter, but it does not include a shifting operation.

Formulation:

Given an input activation vector ( \mathbf{h} ) with ( d ) dimensions:

  1. RMS:

    RMS(h)=1di=1dhi2\text{RMS}(\mathbf{h}) = \sqrt{\frac{1}{d} \sum_{i=1}^{d} h_i^2}
  2. Normalization:

    h^=hRMS(h)\mathbf{\hat{h}} = \frac{\mathbf{h}}{\text{RMS}(\mathbf{h})}
  3. Scaling:

    hnorm=γh^\mathbf{h}_{\text{norm}} = \gamma \cdot \mathbf{\hat{h}}

    where γ\gamma is a learnable scaling parameter.

Key Differences:

  1. Normalization Statistic:

    • LayerNorm uses both mean and variance for normalization.
    • RMSNorm uses only the RMS of the activations.
  2. Computation Complexity:

    • LayerNorm requires computing both the mean and variance, making it slightly more computationally intensive.
    • RMSNorm simplifies this by only requiring the calculation of the RMS, reducing computational overhead.
  3. Shifting Operation:

    • LayerNorm includes both scaling and shifting (using learnable parameters γ\gamma and β\beta).
    • RMSNorm includes only scaling (using γ\gamma), with no shifting operation.
  4. Practical Implications:

    • LayerNorm is more general and commonly used across various types of neural networks, particularly where the shifting operation might be beneficial.
    • RMSNorm is more streamlined and can be more efficient, particularly in large models like LLaMA, where computational efficiency is crucial.

Summary:

The core difference is that LayerNorm normalizes using both mean and variance, whereas RMSNorm simplifies this by normalizing based solely on the RMS of the activations. This simplification makes RMSNorm more computationally efficient while still providing effective normalization.

profile
거인의 어깨에 올라서서 더 넓은 세상을 바라보라 - 아이작 뉴턴

0개의 댓글