@dataclass
class ModelArgs:
dim: int = 4096
n_layers: int = 32
n_heads: int = 32
n_kv_heads: Optional[int] = None
vocab_size: int = -1
multiple_of: int = 256 # make SwiGLU hidden layer size multiple of large power of 2
ffn_dim_multiplier: Optional[float] = None
norm_eps: float = 1e-5
rope_theta: float = 500000
max_batch_size: int = 32
max_seq_len: int = 2048
class TransformerBlock(nn.module):
def __init__(self, layer_id: int, args: ModelArgs):
super().__init__()
self.n_heads = args.n_heads # 위 보면 args.n_heads == 32, multi-head attention진행할 때 32개로 쪼갠다는 의미
self.dim = args.dim
self.head_dim = args.dim // args.n_heads # 여기서 쪼개버림
self.attention = Attention(args) # 추후에 Attention class로 볼거
self.feed_forward = FeedForward( # 마찬가지
dim=args.dim, # 4096 dim을 가짐
hidden_dim=4 * args.dim, # 쌓을 때 잠깐 늘리나봄 4096 * 4 만큼. (상당히 많이 늘리네)
multiple_of=args.multiple_of # 이건 정확하게 뭔지 모르겠네, SwiGLU hidden layer size multiple of large power of 2 여튼 256 들어감. 나중에 확인.
ffn_dim_multiplier=args.ffn_dim_multiplier, # Optional[float]=None
)
self.layer_id = layer_id # layer_id
self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps) # RMSNorm이라는 attention norm을 적용.
self.ffn_norm = RMSNorm(args.dim, eps=)
def forward(
self,
x: torch.Tensor,
start_pos: int, # forward시에 start position을 왜 정하지 ??
freqs_cis: torch.Tensor, # 이거 뭘까..
mask: Optional[torch.Tensor], # Optional
):
h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) # 먼저 , attention_norm에 x를 넣고, x를 attention을 통해 뽑아냄. 그리고 residual connection
out = h + self.feed_forward(self.ffn_norm(h)) # 그렇게 나온 feature를 다시 ffn norm에 넣고 feed forward network타게 만듬. 다시 residual connection
return out
Let's break down the formulation of RMSNorm step by step.
Consider the input to a neural network layer as a vector , where is the dimensionality of the layer (i.e., the number of neurons in that layer).
The first step in RMSNorm is to calculate the Root Mean Square (RMS) of the activations:
Here:
Next, RMSNorm normalizes the input activation vector by dividing each element by the RMS value:
Where:
To give the model the ability to adjust the scale of the normalized activations, RMSNorm includes a learnable scaling parameter :
Where:
This allows the model to adapt the normalization process during training.
In the LLaMA architecture, RMSNorm is used to normalize the activations of various layers within the model, such as the feedforward layers and attention mechanisms. The use of RMSNorm contributes to the stability and efficiency of training large-scale language models like LLaMA, enabling them to learn from vast amounts of text data effectively.
RMSNorm is a streamlined and effective normalization technique that plays a crucial role in the architecture of modern large language models like LLaMA. By focusing on the RMS of activations, RMSNorm simplifies the normalization process while maintaining the benefits of stability and efficiency, making it well-suited for deep learning models with many layers.
The main difference between Layer Normalization (LayerNorm) and RMSNorm lies in how they normalize the activations of a neural network layer.
LayerNorm normalizes the activations by computing both the mean and variance of the activations across all neurons in a layer. It then uses these statistics to standardize the activations, followed by scaling and shifting with learnable parameters.
Formulation:
Given an input activation vector with dimensions (neurons in the layer):
Mean:
Variance:
Normalization:
where is a small constant to avoid division by zero.
Scaling and Shifting:
where and are learnable parameters that scale and shift the normalized output.
RMSNorm simplifies the normalization process by only using the Root Mean Square (RMS) of the activations, without computing the mean and variance. It then normalizes the activations based on the RMS value and applies a learnable scaling parameter, but it does not include a shifting operation.
Formulation:
Given an input activation vector ( \mathbf{h} ) with ( d ) dimensions:
RMS:
Normalization:
Scaling:
where is a learnable scaling parameter.
Normalization Statistic:
Computation Complexity:
Shifting Operation:
Practical Implications:
The core difference is that LayerNorm normalizes using both mean and variance, whereas RMSNorm simplifies this by normalizing based solely on the RMS of the activations. This simplification makes RMSNorm more computationally efficient while still providing effective normalization.