Model parameter 확인과 Model layer freeze

AFL·2023년 7월 5일

HuggingFace 모델의 parameter 를 확인하는 방법과 parameter freeze 하는 방법을 정리한다. MariaMT 모델을 사용해서 예시를 보여준다.

AutoModelForSeq2SeqLM 로 불러오는 경우와 MariaMTModel 로 바로 불러오는 두가지 경우를 설명한다.

(1) MarianMTModel from `AutoModelForSeq2SeqLM`

먼저 AutoModelForSeq2SeqLM 로 불러오는 것이다. from_pretrained() 에 Helsinki-NLP/opus-mt-{src}={tgt} 를 넘겨주어서 pretrained 모델을 불러올 수 있다.

model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model

MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(58101, 512, padding_idx=58100)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(58101, 512, padding_idx=58100)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0-5): 6 x MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): SiLUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (decoder): MarianDecoder(
      (embed_tokens): Embedding(58101, 512, padding_idx=58100)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
...
      )
    )
  )
  (lm_head): Linear(in_features=512, out_features=58101, bias=False)
)

`model.get_encoder().parameters()`

불러온 모델의 parameter 값을 아래와 같이 확인해보자. 위와 같이 from_pretrained() 메소드로 불러온 모델의 parameter 는 이미 학습된 모델의 값이다.

parameter 값과 함께 requires_grad 값도 보여준다. 아래에서 parameter freeze 를 할 때 requires_grad 값을 사용해서 어쩌고 저쩌고 할 예정이다.

for param in model.get_encoder().parameters():
    print(param)

Parameter containing:
tensor([[ 0.0053, -0.0041,  0.0045,  ..., -0.0420, -0.0391, -0.0400],
        [ 0.0451,  0.0542,  0.0384,  ..., -0.0205, -0.0519,  0.0105],
        [-0.0088,  0.0169,  0.0082,  ..., -0.0500, -0.0089, -0.0421],
        ...,
        [ 0.0247,  0.0379,  0.0113,  ..., -0.0635, -0.0925, -0.0208],
        [ 0.0239,  0.0394,  0.0120,  ..., -0.0650, -0.0933, -0.0212],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],
       requires_grad=True)
Parameter containing:
tensor([[ 0.0000,  0.0000,  0.0000,  ...,  1.0000,  1.0000,  1.0000],
        [ 0.8415,  0.8219,  0.8020,  ...,  1.0000,  1.0000,  1.0000],
        [ 0.9093,  0.9364,  0.9581,  ...,  1.0000,  1.0000,  1.0000],
        ...,
        [ 0.0620,  0.7982,  0.6589,  ...,  0.9984,  0.9985,  0.9986],
        [ 0.8733,  0.9498, -0.2097,  ...,  0.9984,  0.9985,  0.9986],
        [ 0.8818,  0.2840, -0.9094,  ...,  0.9984,  0.9985,  0.9986]])
Parameter containing:
tensor([[-0.0432,  0.0531,  0.0199,  ...,  0.0515, -0.0586, -0.0680],
        [ 0.0343, -0.0319, -0.0264,  ..., -0.0077,  0.0399,  0.0289],
        [ 0.0046, -0.0075,  0.0938,  ..., -0.0922, -0.1382, -0.0110],
        ...,
        [-0.0362,  0.0137,  0.0551,  ..., -0.0105,  0.0095, -0.0624],
        [-0.0201,  0.0875,  0.0610,  ...,  0.0493, -0.1188,  0.0916],
        [-0.0032,  0.0506, -0.0118,  ...,  0.0734, -0.0367,  0.0189]],
...
         1.4817e-01,  1.2603e-02,  2.1145e-01, -8.7563e-02, -3.2081e-02,
        -3.3186e-01, -3.1934e-01, -7.9800e-02, -5.7997e-02,  6.0249e-02,
        -4.1444e-02, -1.4424e-01,  3.3969e-03, -7.0272e-03, -3.2986e-02,
        -8.2371e-03,  9.0344e-03], requires_grad=True)

`model.state_dict()`

모델 파라미터 값을 확인할 또 다른 방법이다. state_dict() 메소드로 불러올 수 있다.

model.state_dict()

OrderedDict([('final_logits_bias', tensor([[ 2.6369, -2.2144, -0.2626,  ..., -1.4751, -1.4528,  0.0000]])), ('model.shared.weight', tensor([[ 0.0053, -0.0041,  0.0045,  ..., -0.0420, -0.0391, -0.0400],
        [ 0.0451,  0.0542,  0.0384,  ..., -0.0205, -0.0519,  0.0105],
        [-0.0088,  0.0169,  0.0082,  ..., -0.0500, -0.0089, -0.0421],
        ...,
        [ 0.0247,  0.0379,  0.0113,  ..., -0.0635, -0.0925, -0.0208],
        [ 0.0239,  0.0394,  0.0120,  ..., -0.0650, -0.0933, -0.0212],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])), ('model.encoder.embed_tokens.weight', tensor([[ 0.0053, -0.0041,  0.0045,  ..., -0.0420, -0.0391, -0.0400],
        [ 0.0451,  0.0542,  0.0384,  ..., -0.0205, -0.0519,  0.0105],
        [-0.0088,  0.0169,  0.0082,  ..., -0.0500, -0.0089, -0.0421],
        ...,
        [ 0.0247,  0.0379,  0.0113,  ..., -0.0635, -0.0925, -0.0208],
        [ 0.0239,  0.0394,  0.0120,  ..., -0.0650, -0.0933, -0.0212],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])), ('model.encoder.embed_positions.weight', tensor([[ 0.0000,  0.0000,  0.0000,  ...,  1.0000,  1.0000,  1.0000],
        [ 0.8415,  0.8219,  0.8020,  ...,  1.0000,  1.0000,  1.0000],
        [ 0.9093,  0.9364,  0.9581,  ...,  1.0000,  1.0000,  1.0000],
        ...,
        [ 0.0620,  0.7982,  0.6589,  ...,  0.9984,  0.9985,  0.9986],
        [ 0.8733,  0.9498, -0.2097,  ...,  0.9984,  0.9985,  0.9986],
        [ 0.8818,  0.2840, -0.9094,  ...,  0.9984,  0.9985,  0.9986]])), ('model.encoder.layers.0.self_attn.k_proj.weight', tensor([[-0.0432,  0.0531,  0.0199,  ...,  0.0515, -0.0586, -0.0680],
        [ 0.0343, -0.0319, -0.0264,  ..., -0.0077,  0.0399,  0.0289],
        [ 0.0046, -0.0075,  0.0938,  ..., -0.0922, -0.1382, -0.0110],
        ...,
        [-0.0362,  0.0137,  0.0551,  ..., -0.0105,  0.0095, -0.0624],
        [-0.0201,  0.0875,  0.0610,  ...,  0.0493, -0.1188,  0.0916],
        [-0.0032,  0.0506, -0.0118,  ...,  0.0734, -0.0367,  0.0189]])), ('model.encoder.layers.0.self_attn.k_proj.bias', tensor([ 8.8546e-04,  1.2788e-01, -1.6511e-02,  3.0796e-02, -5.0024e-02,
...
        ...,
        [ 0.0247,  0.0379,  0.0113,  ..., -0.0635, -0.0925, -0.0208],
        [ 0.0239,  0.0394,  0.0120,  ..., -0.0650, -0.0933, -0.0212],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]]))])

(2) MarianMTModel from `configuration`

다른 방법으로 모델을 불러와서 확인해보자. 이번에는 MarianConfig() 로 configuration 을 불러오고 MarianMTModel() 로 모델을 불러온다.

config = MarianConfig()
model = MarianMTModel(config)
model

MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(58101, 1024, padding_idx=58100)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(58101, 1024, padding_idx=58100)
      (embed_positions): MarianSinusoidalPositionalEmbedding(1024, 1024)
      (layers): ModuleList(
        (0-11): 12 x MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (decoder): MarianDecoder(
      (embed_tokens): Embedding(58101, 1024, padding_idx=58100)
      (embed_positions): MarianSinusoidalPositionalEmbedding(1024, 1024)
...
      )
    )
  )
  (lm_head): Linear(in_features=1024, out_features=58101, bias=False)
)

모델의 parameter 값을 확인하면 아래와 같다.

model.state_dict()

OrderedDict([('final_logits_bias', tensor([[0., 0., 0.,  ..., 0., 0., 0.]])),
             ('model.shared.weight',
              tensor([[-0.0109,  0.0112,  0.0127,  ...,  0.0365, -0.0166, -0.0262],
                      [ 0.0059,  0.0065, -0.0022,  ...,  0.0184,  0.0402, -0.0057],
                      [-0.0191, -0.0081,  0.0285,  ...,  0.0140,  0.0031, -0.0248],
                      ...,
                      [ 0.0059,  0.0062, -0.0087,  ..., -0.0190,  0.0158,  0.0132],
                      [-0.0082, -0.0285,  0.0345,  ...,  0.0032, -0.0302, -0.0160],
                      [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])),
             ('model.encoder.embed_tokens.weight',
              tensor([[-0.0109,  0.0112,  0.0127,  ...,  0.0365, -0.0166, -0.0262],
                      [ 0.0059,  0.0065, -0.0022,  ...,  0.0184,  0.0402, -0.0057],
                      [-0.0191, -0.0081,  0.0285,  ...,  0.0140,  0.0031, -0.0248],
                      ...,
                      [ 0.0059,  0.0062, -0.0087,  ..., -0.0190,  0.0158,  0.0132],
                      [-0.0082, -0.0285,  0.0345,  ...,  0.0032, -0.0302, -0.0160],
                      [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])),
             ('model.encoder.embed_positions.weight',
              tensor([[ 0.0000,  0.0000,  0.0000,  ...,  1.0000,  1.0000,  1.0000],
                      [ 0.8415,  0.8317,  0.8219,  ...,  1.0000,  1.0000,  1.0000],
                      [ 0.9093,  0.9236,  0.9364,  ...,  1.0000,  1.0000,  1.0000],
                      ...,
                      [ 0.0176, -0.5887, -0.9995,  ...,  0.9942,  0.9944,  0.9946],
                      [-0.8318, -0.9992, -0.5446,  ...,  0.9942,  0.9944,  0.9946],
                      [-0.9165, -0.5208,  0.3790,  ...,  0.9942,  0.9944,  0.9946]])),
...
                      [-0.0191, -0.0081,  0.0285,  ...,  0.0140,  0.0031, -0.0248],
                      ...,
                      [ 0.0059,  0.0062, -0.0087,  ..., -0.0190,  0.0158,  0.0132],
                      [-0.0082, -0.0285,  0.0345,  ...,  0.0032, -0.0302, -0.0160],
                      [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]]))])

모델의 파라미터 값은 초기화 되어있다. 훈련되지 않은 모델로, 파라미터 값이 랜덤으로 들어가 있는 것이다. 따라서 이미 훈련된 pretrained model 을 사용하려면 from_pretrained("...) 를 써서 불러야 한다.

Parameter Freeze

모델의 파라미터 값을 freeze 시키는 방법을 정리한다.

모델의 특정 layer를 freezing 한다는 것은 해당 모델의 레이어의 파라미터를 학습 과정(back propagation) 중 파라미터 최적화를 수행하지 않겠다는 의미이다. 즉, freezing layer의 파라미터는 학습 중이라도 파라미터가 유지된다. 파라미터 freeze 는 transfer learning을 할 때 주로 사용하는 방법으로, 사전 학습 모델의 피처 정보를 온전히 유지하기 위해 사용한다.

먼저 다음과 같이 모델의 layer 마다 requires_grad 값을 확인해보자.

model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-de")

for name, param in model.named_parameters():
    print(name, param.requires_grad)

model.shared.weight True
model.encoder.embed_positions.weight False
model.encoder.layers.0.self_attn.k_proj.weight True
model.encoder.layers.0.self_attn.k_proj.bias True
model.encoder.layers.0.self_attn.v_proj.weight True
model.encoder.layers.0.self_attn.v_proj.bias True
model.encoder.layers.0.self_attn.q_proj.weight True
model.encoder.layers.0.self_attn.q_proj.bias True
model.encoder.layers.0.self_attn.out_proj.weight True
model.encoder.layers.0.self_attn.out_proj.bias True
model.encoder.layers.0.self_attn_layer_norm.weight True
model.encoder.layers.0.self_attn_layer_norm.bias True
model.encoder.layers.0.fc1.weight True
model.encoder.layers.0.fc1.bias True
model.encoder.layers.0.fc2.weight True
model.encoder.layers.0.fc2.bias True
model.encoder.layers.0.final_layer_norm.weight True
model.encoder.layers.0.final_layer_norm.bias True
model.encoder.layers.1.self_attn.k_proj.weight True
model.encoder.layers.1.self_attn.k_proj.bias True
model.encoder.layers.1.self_attn.v_proj.weight True
model.encoder.layers.1.self_attn.v_proj.bias True
model.encoder.layers.1.self_attn.q_proj.weight True
model.encoder.layers.1.self_attn.q_proj.bias True
model.encoder.layers.1.self_attn.out_proj.weight True
...
model.decoder.layers.5.fc2.weight True
model.decoder.layers.5.fc2.bias True
model.decoder.layers.5.final_layer_norm.weight True
model.decoder.layers.5.final_layer_norm.bias True

위와 같이 모든 layer 의 requires_grad 값을 확인했다. 이제 encoder layer 만 requires_grad=False 로 설정해서 parameter 를 freeze 시킬 것이다. 아래 코드를 보자.

for name, param in model.named_parameters():
    if name.split('.')[1] == "encoder":
        param.requires_grad = False    
    print(name, param.requires_grad)

model.shared.weight True
model.encoder.embed_positions.weight False
model.encoder.layers.0.self_attn.k_proj.weight False
model.encoder.layers.0.self_attn.k_proj.bias False
model.encoder.layers.0.self_attn.v_proj.weight False
model.encoder.layers.0.self_attn.v_proj.bias False
model.encoder.layers.0.self_attn.q_proj.weight False
model.encoder.layers.0.self_attn.q_proj.bias False
model.encoder.layers.0.self_attn.out_proj.weight False
model.encoder.layers.0.self_attn.out_proj.bias False
model.encoder.layers.0.self_attn_layer_norm.weight False
model.encoder.layers.0.self_attn_layer_norm.bias False
model.encoder.layers.0.fc1.weight False
model.encoder.layers.0.fc1.bias False
model.encoder.layers.0.fc2.weight False
model.encoder.layers.0.fc2.bias False
model.encoder.layers.0.final_layer_norm.weight False
model.encoder.layers.0.final_layer_norm.bias False
model.encoder.layers.1.self_attn.k_proj.weight False
model.encoder.layers.1.self_attn.k_proj.bias False
model.encoder.layers.1.self_attn.v_proj.weight False
model.encoder.layers.1.self_attn.v_proj.bias False
model.encoder.layers.1.self_attn.q_proj.weight False
model.encoder.layers.1.self_attn.q_proj.bias False
model.encoder.layers.1.self_attn.out_proj.weight False
...
model.decoder.layers.5.fc2.weight True
model.decoder.layers.5.fc2.bias True
model.decoder.layers.5.final_layer_norm.weight True
model.decoder.layers.5.final_layer_norm.bias True

이제 인코더 부분의 파라미터가 freeze 된 것을 볼 수 있다.

[Reference]

https://huggingface.co/docs/transformers/model_doc/marian

https://think-tech.tistory.com/61

https://blog.naver.com/qbxlvnf11/222064434330

https://inni-iii.tistory.com/42