HuggingFace 모델의 parameter 를 확인하는 방법과 parameter freeze 하는 방법을 정리한다. MariaMT 모델을 사용해서 예시를 보여준다.
AutoModelForSeq2SeqLM
로 불러오는 경우와 MariaMTModel
로 바로 불러오는 두가지 경우를 설명한다.
AutoModelForSeq2SeqLM
먼저 AutoModelForSeq2SeqLM
로 불러오는 것이다. from_pretrained()
에 Helsinki-NLP/opus-mt-{src}={tgt}
를 넘겨주어서 pretrained 모델을 불러올 수 있다.
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model
MarianMTModel(
(model): MarianModel(
(shared): Embedding(58101, 512, padding_idx=58100)
(encoder): MarianEncoder(
(embed_tokens): Embedding(58101, 512, padding_idx=58100)
(embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
(layers): ModuleList(
(0-5): 6 x MarianEncoderLayer(
(self_attn): MarianAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation_fn): SiLUActivation()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): MarianDecoder(
(embed_tokens): Embedding(58101, 512, padding_idx=58100)
(embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
...
)
)
)
(lm_head): Linear(in_features=512, out_features=58101, bias=False)
)
model.get_encoder().parameters()
불러온 모델의 parameter 값을 아래와 같이 확인해보자. 위와 같이 from_pretrained()
메소드로 불러온 모델의 parameter 는 이미 학습된 모델의 값이다.
parameter 값과 함께 requires_grad
값도 보여준다. 아래에서 parameter freeze 를 할 때 requires_grad
값을 사용해서 어쩌고 저쩌고 할 예정이다.
for param in model.get_encoder().parameters():
print(param)
Parameter containing:
tensor([[ 0.0053, -0.0041, 0.0045, ..., -0.0420, -0.0391, -0.0400],
[ 0.0451, 0.0542, 0.0384, ..., -0.0205, -0.0519, 0.0105],
[-0.0088, 0.0169, 0.0082, ..., -0.0500, -0.0089, -0.0421],
...,
[ 0.0247, 0.0379, 0.0113, ..., -0.0635, -0.0925, -0.0208],
[ 0.0239, 0.0394, 0.0120, ..., -0.0650, -0.0933, -0.0212],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]],
requires_grad=True)
Parameter containing:
tensor([[ 0.0000, 0.0000, 0.0000, ..., 1.0000, 1.0000, 1.0000],
[ 0.8415, 0.8219, 0.8020, ..., 1.0000, 1.0000, 1.0000],
[ 0.9093, 0.9364, 0.9581, ..., 1.0000, 1.0000, 1.0000],
...,
[ 0.0620, 0.7982, 0.6589, ..., 0.9984, 0.9985, 0.9986],
[ 0.8733, 0.9498, -0.2097, ..., 0.9984, 0.9985, 0.9986],
[ 0.8818, 0.2840, -0.9094, ..., 0.9984, 0.9985, 0.9986]])
Parameter containing:
tensor([[-0.0432, 0.0531, 0.0199, ..., 0.0515, -0.0586, -0.0680],
[ 0.0343, -0.0319, -0.0264, ..., -0.0077, 0.0399, 0.0289],
[ 0.0046, -0.0075, 0.0938, ..., -0.0922, -0.1382, -0.0110],
...,
[-0.0362, 0.0137, 0.0551, ..., -0.0105, 0.0095, -0.0624],
[-0.0201, 0.0875, 0.0610, ..., 0.0493, -0.1188, 0.0916],
[-0.0032, 0.0506, -0.0118, ..., 0.0734, -0.0367, 0.0189]],
...
1.4817e-01, 1.2603e-02, 2.1145e-01, -8.7563e-02, -3.2081e-02,
-3.3186e-01, -3.1934e-01, -7.9800e-02, -5.7997e-02, 6.0249e-02,
-4.1444e-02, -1.4424e-01, 3.3969e-03, -7.0272e-03, -3.2986e-02,
-8.2371e-03, 9.0344e-03], requires_grad=True)
model.state_dict()
모델 파라미터 값을 확인할 또 다른 방법이다. state_dict()
메소드로 불러올 수 있다.
model.state_dict()
OrderedDict([('final_logits_bias', tensor([[ 2.6369, -2.2144, -0.2626, ..., -1.4751, -1.4528, 0.0000]])), ('model.shared.weight', tensor([[ 0.0053, -0.0041, 0.0045, ..., -0.0420, -0.0391, -0.0400],
[ 0.0451, 0.0542, 0.0384, ..., -0.0205, -0.0519, 0.0105],
[-0.0088, 0.0169, 0.0082, ..., -0.0500, -0.0089, -0.0421],
...,
[ 0.0247, 0.0379, 0.0113, ..., -0.0635, -0.0925, -0.0208],
[ 0.0239, 0.0394, 0.0120, ..., -0.0650, -0.0933, -0.0212],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]])), ('model.encoder.embed_tokens.weight', tensor([[ 0.0053, -0.0041, 0.0045, ..., -0.0420, -0.0391, -0.0400],
[ 0.0451, 0.0542, 0.0384, ..., -0.0205, -0.0519, 0.0105],
[-0.0088, 0.0169, 0.0082, ..., -0.0500, -0.0089, -0.0421],
...,
[ 0.0247, 0.0379, 0.0113, ..., -0.0635, -0.0925, -0.0208],
[ 0.0239, 0.0394, 0.0120, ..., -0.0650, -0.0933, -0.0212],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]])), ('model.encoder.embed_positions.weight', tensor([[ 0.0000, 0.0000, 0.0000, ..., 1.0000, 1.0000, 1.0000],
[ 0.8415, 0.8219, 0.8020, ..., 1.0000, 1.0000, 1.0000],
[ 0.9093, 0.9364, 0.9581, ..., 1.0000, 1.0000, 1.0000],
...,
[ 0.0620, 0.7982, 0.6589, ..., 0.9984, 0.9985, 0.9986],
[ 0.8733, 0.9498, -0.2097, ..., 0.9984, 0.9985, 0.9986],
[ 0.8818, 0.2840, -0.9094, ..., 0.9984, 0.9985, 0.9986]])), ('model.encoder.layers.0.self_attn.k_proj.weight', tensor([[-0.0432, 0.0531, 0.0199, ..., 0.0515, -0.0586, -0.0680],
[ 0.0343, -0.0319, -0.0264, ..., -0.0077, 0.0399, 0.0289],
[ 0.0046, -0.0075, 0.0938, ..., -0.0922, -0.1382, -0.0110],
...,
[-0.0362, 0.0137, 0.0551, ..., -0.0105, 0.0095, -0.0624],
[-0.0201, 0.0875, 0.0610, ..., 0.0493, -0.1188, 0.0916],
[-0.0032, 0.0506, -0.0118, ..., 0.0734, -0.0367, 0.0189]])), ('model.encoder.layers.0.self_attn.k_proj.bias', tensor([ 8.8546e-04, 1.2788e-01, -1.6511e-02, 3.0796e-02, -5.0024e-02,
...
...,
[ 0.0247, 0.0379, 0.0113, ..., -0.0635, -0.0925, -0.0208],
[ 0.0239, 0.0394, 0.0120, ..., -0.0650, -0.0933, -0.0212],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]]))])
configuration
다른 방법으로 모델을 불러와서 확인해보자. 이번에는 MarianConfig()
로 configuration 을 불러오고 MarianMTModel()
로 모델을 불러온다.
config = MarianConfig()
model = MarianMTModel(config)
model
MarianMTModel(
(model): MarianModel(
(shared): Embedding(58101, 1024, padding_idx=58100)
(encoder): MarianEncoder(
(embed_tokens): Embedding(58101, 1024, padding_idx=58100)
(embed_positions): MarianSinusoidalPositionalEmbedding(1024, 1024)
(layers): ModuleList(
(0-11): 12 x MarianEncoderLayer(
(self_attn): MarianAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): MarianDecoder(
(embed_tokens): Embedding(58101, 1024, padding_idx=58100)
(embed_positions): MarianSinusoidalPositionalEmbedding(1024, 1024)
...
)
)
)
(lm_head): Linear(in_features=1024, out_features=58101, bias=False)
)
모델의 parameter 값을 확인하면 아래와 같다.
model.state_dict()
OrderedDict([('final_logits_bias', tensor([[0., 0., 0., ..., 0., 0., 0.]])),
('model.shared.weight',
tensor([[-0.0109, 0.0112, 0.0127, ..., 0.0365, -0.0166, -0.0262],
[ 0.0059, 0.0065, -0.0022, ..., 0.0184, 0.0402, -0.0057],
[-0.0191, -0.0081, 0.0285, ..., 0.0140, 0.0031, -0.0248],
...,
[ 0.0059, 0.0062, -0.0087, ..., -0.0190, 0.0158, 0.0132],
[-0.0082, -0.0285, 0.0345, ..., 0.0032, -0.0302, -0.0160],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]])),
('model.encoder.embed_tokens.weight',
tensor([[-0.0109, 0.0112, 0.0127, ..., 0.0365, -0.0166, -0.0262],
[ 0.0059, 0.0065, -0.0022, ..., 0.0184, 0.0402, -0.0057],
[-0.0191, -0.0081, 0.0285, ..., 0.0140, 0.0031, -0.0248],
...,
[ 0.0059, 0.0062, -0.0087, ..., -0.0190, 0.0158, 0.0132],
[-0.0082, -0.0285, 0.0345, ..., 0.0032, -0.0302, -0.0160],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]])),
('model.encoder.embed_positions.weight',
tensor([[ 0.0000, 0.0000, 0.0000, ..., 1.0000, 1.0000, 1.0000],
[ 0.8415, 0.8317, 0.8219, ..., 1.0000, 1.0000, 1.0000],
[ 0.9093, 0.9236, 0.9364, ..., 1.0000, 1.0000, 1.0000],
...,
[ 0.0176, -0.5887, -0.9995, ..., 0.9942, 0.9944, 0.9946],
[-0.8318, -0.9992, -0.5446, ..., 0.9942, 0.9944, 0.9946],
[-0.9165, -0.5208, 0.3790, ..., 0.9942, 0.9944, 0.9946]])),
...
[-0.0191, -0.0081, 0.0285, ..., 0.0140, 0.0031, -0.0248],
...,
[ 0.0059, 0.0062, -0.0087, ..., -0.0190, 0.0158, 0.0132],
[-0.0082, -0.0285, 0.0345, ..., 0.0032, -0.0302, -0.0160],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]]))])
모델의 파라미터 값은 초기화 되어있다. 훈련되지 않은 모델로, 파라미터 값이 랜덤으로 들어가 있는 것이다. 따라서 이미 훈련된 pretrained model 을 사용하려면 from_pretrained("...)
를 써서 불러야 한다.
모델의 파라미터 값을 freeze 시키는 방법을 정리한다.
모델의 특정 layer를 freezing 한다는 것은 해당 모델의 레이어의 파라미터를 학습 과정(back propagation) 중 파라미터 최적화를 수행하지 않겠다는 의미이다. 즉, freezing layer의 파라미터는 학습 중이라도 파라미터가 유지된다. 파라미터 freeze 는 transfer learning을 할 때 주로 사용하는 방법으로, 사전 학습 모델의 피처 정보를 온전히 유지하기 위해 사용한다.
먼저 다음과 같이 모델의 layer 마다 requires_grad
값을 확인해보자.
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-de")
for name, param in model.named_parameters():
print(name, param.requires_grad)
model.shared.weight True
model.encoder.embed_positions.weight False
model.encoder.layers.0.self_attn.k_proj.weight True
model.encoder.layers.0.self_attn.k_proj.bias True
model.encoder.layers.0.self_attn.v_proj.weight True
model.encoder.layers.0.self_attn.v_proj.bias True
model.encoder.layers.0.self_attn.q_proj.weight True
model.encoder.layers.0.self_attn.q_proj.bias True
model.encoder.layers.0.self_attn.out_proj.weight True
model.encoder.layers.0.self_attn.out_proj.bias True
model.encoder.layers.0.self_attn_layer_norm.weight True
model.encoder.layers.0.self_attn_layer_norm.bias True
model.encoder.layers.0.fc1.weight True
model.encoder.layers.0.fc1.bias True
model.encoder.layers.0.fc2.weight True
model.encoder.layers.0.fc2.bias True
model.encoder.layers.0.final_layer_norm.weight True
model.encoder.layers.0.final_layer_norm.bias True
model.encoder.layers.1.self_attn.k_proj.weight True
model.encoder.layers.1.self_attn.k_proj.bias True
model.encoder.layers.1.self_attn.v_proj.weight True
model.encoder.layers.1.self_attn.v_proj.bias True
model.encoder.layers.1.self_attn.q_proj.weight True
model.encoder.layers.1.self_attn.q_proj.bias True
model.encoder.layers.1.self_attn.out_proj.weight True
...
model.decoder.layers.5.fc2.weight True
model.decoder.layers.5.fc2.bias True
model.decoder.layers.5.final_layer_norm.weight True
model.decoder.layers.5.final_layer_norm.bias True
위와 같이 모든 layer 의 requires_grad
값을 확인했다. 이제 encoder layer 만 requires_grad=False
로 설정해서 parameter 를 freeze 시킬 것이다. 아래 코드를 보자.
for name, param in model.named_parameters():
if name.split('.')[1] == "encoder":
param.requires_grad = False
print(name, param.requires_grad)
model.shared.weight True
model.encoder.embed_positions.weight False
model.encoder.layers.0.self_attn.k_proj.weight False
model.encoder.layers.0.self_attn.k_proj.bias False
model.encoder.layers.0.self_attn.v_proj.weight False
model.encoder.layers.0.self_attn.v_proj.bias False
model.encoder.layers.0.self_attn.q_proj.weight False
model.encoder.layers.0.self_attn.q_proj.bias False
model.encoder.layers.0.self_attn.out_proj.weight False
model.encoder.layers.0.self_attn.out_proj.bias False
model.encoder.layers.0.self_attn_layer_norm.weight False
model.encoder.layers.0.self_attn_layer_norm.bias False
model.encoder.layers.0.fc1.weight False
model.encoder.layers.0.fc1.bias False
model.encoder.layers.0.fc2.weight False
model.encoder.layers.0.fc2.bias False
model.encoder.layers.0.final_layer_norm.weight False
model.encoder.layers.0.final_layer_norm.bias False
model.encoder.layers.1.self_attn.k_proj.weight False
model.encoder.layers.1.self_attn.k_proj.bias False
model.encoder.layers.1.self_attn.v_proj.weight False
model.encoder.layers.1.self_attn.v_proj.bias False
model.encoder.layers.1.self_attn.q_proj.weight False
model.encoder.layers.1.self_attn.q_proj.bias False
model.encoder.layers.1.self_attn.out_proj.weight False
...
model.decoder.layers.5.fc2.weight True
model.decoder.layers.5.fc2.bias True
model.decoder.layers.5.final_layer_norm.weight True
model.decoder.layers.5.final_layer_norm.bias True
이제 인코더 부분의 파라미터가 freeze 된 것을 볼 수 있다.
[Reference]
https://huggingface.co/docs/transformers/model_doc/marian
https://think-tech.tistory.com/61