Machine Translation with Hugging Face๐Ÿค—

AFLยท2023๋…„ 6์›” 25์ผ
1

MT

๋ชฉ๋ก ๋ณด๊ธฐ
2/2

Hugging Face ๊ฐ€ ๋ฌด์—‡์ธ์ง€, ๊ทธ๋ฆฌ๊ณ  Hugging Face ๋ฅผ ์‚ฌ์šฉํ•ด์„œ translation ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ •๋ฆฌํ•œ๋‹ค.

Hugging Face?

ํ—ˆ๊น… ํŽ˜์ด์Šค๋Š” ๋‹ค์–‘ํ•œ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ (transformer.models)๊ณผ ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ(transformer.Trainer)๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋ชจ๋“ˆ์ด๋‹ค. ํ—ˆ๊น… ํŽ˜์ด์Šค๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ๋•Œ layer, model ๋“ฑ์„ ์„ ์–ธํ•˜๊ฑฐ๋‚˜ ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ๋ฅผ ๊ตฌํ˜„ํ•ด์•ผ ํ•˜๋Š” ์ˆ˜๊ณ ๋ฅผ ๋œ์–ด์ค€๋‹ค.

๐Ÿค—Transformers

๐Ÿค—Transformers๋Š” SOTA pretrained model๋“ค์„ ์‰ฝ๊ฒŒ ๋‹ค์šด๋ฐ›๊ณ  ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ API๋“ค๊ณผ tool๋“ค์„ ์ œ๊ณตํ•œ๋‹ค. pretrained model ์„ ์‚ฌ์šฉํ•˜๋ฉด ๊ณ„์‚ฐ ๋น„์šฉ๋„ ์ค„์ผ ์ˆ˜ ์žˆ๊ณ , ์„œ๋ฒ„๋ฅผ ๋œ ์“ฐ๋‹ˆ carbon footprint๋„ ์ค„์ผ ์ˆ˜ ์žˆ๊ณ , ์ฒ˜์Œ๋ถ€ํ„ฐ ๋ชจ๋ธ์„ ํ•™์Šตํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ์‹œ๊ฐ„๋„ ์ž์›๋„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

๐Ÿค— Transformers๋Š” PyTorch, TensorFlow, ๊ทธ๋ฆฌ๊ณ  JAX ๊ฐ„์— ํ”„๋ ˆ์ž„ ์›Œํฌ ๊ฐ„ ์ƒํ˜ธ ์šด์šฉ์„ฑ์„ ์ง€์›ํ•œ๋‹ค. ๋ชจ๋ธ์˜ ๋‹ค๋ฅธ ํ”„๋ ˆ์ž„์›Œํฌ ์‚ฌ์šฉ์—๋„ ์œ ์—ฐํ•˜๊ฒŒ ์ž‘๋™ํ•œ๋‹ค.


HuggingFace ๋ฅผ ์‚ฌ์šฉํ•ด์„œ translation ํ•˜๊ธฐ

translation์€ sequence-to-sequence task ์ด๋‹ค. ์ฆ‰ ํ•œ sequence์ธ ๋ฌธ์žฅ์—์„œ ๋‹ค๋ฅธ sequence๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋‘ ์Œ ํ˜น์€ ๊ทธ ์ด์ƒ์˜ ์–ธ์–ด ์Œ๋“ค์„ ์ถฉ๋ถ„ํžˆ ๊ฐ–๊ณ  ์žˆ์œผ๋ฉด ์ƒˆ๋กœ์šด ๋ฒˆ์—ญ ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋ฏธ ๋งŒ๋“  ๋ฒˆ์—ญ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์™€์„œ fine-tune ํ•˜๋Š” ๊ฒƒ์ด ๋” ๋น ๋ฅด๋‹ค. ์•„๋ž˜์—์„œ๋Š” Marian model ์„ ๊ฐ€์ ธ์™€์„œ fine-tune ํ•˜๋Š” ์˜ˆ์‹œ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

1. Preparing data

Fine-tuning ์„ ํ•˜๊ฑฐ๋‚˜ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋จผ์ € ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ๋ฐ์ดํ„ฐ๋Š” HuggingFace Hub ์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๊ณ  ๋˜๋Š” ๋‚ด๊ฐ€ ๊ฐ€์ง„ custom ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ๋Š” KDE4 dataset์„ ๋กœ๋“œํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

KDE4 dataset

load_dataset() ์„ ์‚ฌ์šฉํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋ฐ›๋Š”๋‹ค.

from datasets import load_dataset

raw_datasets = load_dataset("kde4", lang1="en", lang2="fr")

๋‹ค๋ฅธ ์–ธ์–ด๋ฅผ ๋ฐ›๊ณ ์‹ถ์œผ๋ฉด lang1, lang2 ์˜ code๋ฅผ ๋ฐ”๊ฟ”์ฃผ๋ฉด ๋œ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ€๋Šฅํ•œ ๋‹ค๋ฅธ ์–ธ์–ด๋“ค์€ ์ด dataset card ๋งํฌ์—์„œ ํ™•์ธํ•ด๋ณด๊ธฐ!

๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋ฐ›์€ raw_datasets ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•์‹์ด๋‹ค.

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 210173
    })
})

ํ•˜๋‚˜์˜ ๋”•์…”๋„ˆ๋ฆฌ ์•ˆ์— 210,173 pair ๋ฌธ์žฅ์ด ํ†ต์งธ๋กœ ๋“ค์–ด์žˆ๋Š”๋ฐ, validation ์œผ๋กœ ๋”ฐ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ split ํ•ด์•ผํ•œ๋‹ค. ์ด๋•Œ train_test_split() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets
DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 189155
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 21018
    })
})

'test' key ์ด๋ฆ„์„ 'validation'์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๊ธฐ ์œ„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด ํ•ด์ค€๋‹ค.

split_datasets["validation"] = split_datasets.pop("test") 

์ด์ œ dataset ์—์„œ ํ•˜๋‚˜๋งŒ ํ™•์ธํ•ด๋ณด๋ฉด,

split_datasets["train"][1]["translation"]
{'en': 'Default to expanded threads',
 'fr': 'Par dรฉfaut, dรฉvelopper les fils de discussion'}

์ด๋ ‡๊ฒŒ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค! ์ด์ œ ํ•œ ์Œ์ด ๋œ ๋‘ ๋ฌธ์žฅ์ด ๋“ค์–ด๊ฐ„ dictionary ๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋œ๋‹ค.

Processing the data

text ๋Š” ์ „๋ถ€ sets of token ID ๋กœ ๋ณ€ํ™˜๋˜์–ด์•ผ ๋ชจ๋ธ์ด ์•Œ์•„๋“ค์„ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด input ๊ณผ target ๋‘˜ ๋‹ค tokenize ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด tokenizer object ๋ฅผ ๋งŒ๋“ ๋‹ค. ์•ž์—์„œ ๋งํ–ˆ ๋“ฏ Marian pretrained model ์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค. ๋‹ค๋ฅธ ์–ธ์–ด์Œ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” Helsinki-NLP/opus-mt-{src}-{tgt} ์— ์–ธ์–ด์Œ์„ ๋ฐ”๊ฟ”์ฃผ๋ฉด ๋œ๋‹ค. ๋˜๋Š”, ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์œผ๋ฉด model_checkpoint ์— HuggingFace Hub ์— ์žˆ๋Š” ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ์ง€์ •ํ•ด์ฃผ๊ฑฐ๋‚˜, ์ง์ ‘ ์ €์žฅํ•œ pretrained model ์„ ์ง€์ •ํ•ด์ค„ ์ˆ˜๋„ ์žˆ๋‹ค.

from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")

๋ฐ์ดํ„ฐ๋ฅผ ์ค€๋น„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ธฐ์–ตํ•  ํ•œ๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค. tokenizer ๊ฐ€ output language ๋กœ target ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์ง€ ํ™•์ธํ•ด์•ผํ•˜๋Š” ๊ฒƒ! ์ด๋ฅผ ์œ„ํ•ด์„œ tokenizer์— text_target ์— target ์„ ๋„˜๊ฒจ์ฃผ๋ฉด ๋œ๋‹ค.
์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋Š”์ง€ ์ง์ ‘ ํ•œ ๋ฌธ์žฅ์„ ํ•ด๋ณด์ž~!

en_sentence = split_datasets["train"][1]["translation"]["en"]
fr_sentence = split_datasets["train"][1]["translation"]["fr"]

inputs = tokenizer(en_sentence, text_target=fr_sentence)
inputs
{'input_ids': [47591, 12, 9842, 19634, 9, 0], 'attention_mask': [1, 1, 1, 1, 1, 1], 'labels': [577, 5891, 2, 3184, 16, 2542, 5, 1710, 0]}

output ์œผ๋กœ input_ids ์—๋Š” input๋ฌธ์žฅ(์˜์–ด)์˜ id, labels ์—๋Š” target ๋ฌธ์žฅ(ํ”„๋ž‘์Šค์–ด)์˜ id ๊ฐ€ ๋“ค์–ด์žˆ๋‹ค. ๋งŒ์•ฝ, label ์„ tokenize ํ•˜๋Š” ๊ฒƒ์„ ๊นŒ๋จน์œผ๋ฉด input tokenizer ๋กœ tokenize ํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, Marian model์˜ ๊ฒฝ์šฐ ๊ทธ๋Ÿฌ๋ฉด ์ด์ƒํ•˜๊ฒŒ ์ฒ˜๋ฆฌ๊ฐ€ ๋œ๋‹ค. ๋‹ค์Œ์„ ๋ณด์ž.

wrong_targets = tokenizer(fr_sentence)
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"])) 
print(tokenizer.convert_ids_to_tokens(inputs["labels"]))
['โ–Par', 'โ–dรฉ', 'f', 'aut', ',', 'โ–dรฉ', 've', 'lop', 'per', 'โ–les', 'โ–fil', 's', 'โ–de', 'โ–discussion', '</s>']
['โ–Par', 'โ–dรฉfaut', ',', 'โ–dรฉvelopper', 'โ–les', 'โ–fils', 'โ–de', 'โ–discussion', '</s>']

์ถœ๋ ฅ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ํ”„๋ž‘์Šค์–ด ๋ฌธ์žฅ์„ English tokenizer๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด tokenizer๊ฐ€ ํ”„๋ž‘์Šค ๋‹จ์–ด๋Š” ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๋” ๋งŽ์€ token ์„ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค.

inputs ๋Š” input IDs, attention mask, ๋“ฑ์„ key๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋”•์…”๋„ˆ๋ฆฌ์ด๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์— ์ ์šฉํ•  preprocessing function ์„ ์ •์˜ํ•œ๋‹ค.

max_length = 128 

def preprocessing_function(examples):
	inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["fr"] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=max_length, truncation=True)
	return model_inputs

์ง€๊ธˆ ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋ฌธ์žฅ์€ ์งง๊ธฐ ๋•Œ๋ฌธ์— input๊ณผ output ๋‘˜ ๋‹ค ๊ฐ™์€ maximum length ๋กœ 128์„ ์ค€๋‹ค.

๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฐ์ดํ„ฐ์— preprocessing์„ ์ ์šฉํ•œ๋‹ค.

tokenized_datasets = split_datasets.map(
	preprocess_function, 
    batched=True,
    remove_columns=split_datases["train"].column_names,

๊ทธ๋Ÿผ ์ด์ œ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ฅผ ๋‹ค ํ–ˆ๊ณ  pretrained model์— fine-tuneํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ๋‹ค~


2. Fine-tuning the model with Traner API

ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ Seq2SeqTrainer ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค. Seq2SeqTrainer ๋Š” Trainer ์˜ subclass ๋กœ, input ์œผ๋กœ output ์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” generate() method ๋ฅผ ์‚ฌ์šฉํ•ด์„œ evaluation ์„ ์ ์ ˆํžˆ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

๋จผ์ € fine-tune ํ•  ๋ชจ๋ธ์ด ํ•„์š”ํ•˜๋‹ค. ์šฐ๋ฆฌ๋Š” AutoModel API ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Data collation

dynamic batching ์„ ์œ„ํ•ด padding ์„ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด data collator ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์—ฌ๊ธฐ์—์„œ๋Š” inputs ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ labels ๋„ maximum length ๋กœ pad ๋˜์–ด์•ผ ํ•œ๋‹ค. ๋˜ํ•œ ํŒจ๋”ฉ๋œ ๊ฐ’์€ Loss ๊ณ„์‚ฐ์—์„œ ๋ฌด์‹œ๋˜๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด labels ๋ฅผ ์ฑ„์šฐ๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋Š” ํŒจ๋”ฉ ๊ฐ’์€ tokenizer์˜ ํŒจ๋”ฉ ๊ฐ’์ด ์•„๋‹ˆ๋ผ -100 ์ด์–ด์•ผ ํ•œ๋‹ค.

์ด๋Š” DataCollatorForSeq2Seq ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ•  ์ˆ˜ ์žˆ๋‹ค. DataCollatorWithPadding ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ input ์ „์ฒ˜๋ฆฌ์— ์‚ฌ์šฉ๋˜๋Š” tokenizer๊ณผ ํ•จ๊ป˜ ๋ชจ๋ธ๋„ ์‚ฌ์šฉํ•œ๋‹ค. ๋ชจ๋ธ๋„ ์ž…๋ ฅ๋ฐ›๋Š” ์ด์œ ๋Š” data collator ๊ฐ€ ์‹œ์ž‘ ๋ถ€๋ถ„์— ํŠน์ˆ˜ ํ† ํฐ์ด ์žˆ๋Š”, label ์˜ shift ๋œ ๋ฒ„์ „์ธ decoder input ID ๋„ ์ค€๋น„ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด shift ๋Š” ์•„ํ‚คํ…์ณ๋งˆ๋‹ค ์•ฝ๊ฐ„ ๋‹ค๋ฅด๊ฒŒ ์ˆ˜ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— DataCollatorForSeq2Seq ๋Š” ๋ชจ๋ธ ๊ฐ์ฒด๋ฅผ ์•Œ์•„์•ผ ํ•œ๋‹ค.

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

๋ช‡๊ฐ€์ง€ ๋ฌธ์žฅ์œผ๋กœ ํ…Œ์ŠคํŠธ ํ•˜๊ธฐ ์œ„ํ•ด tokenized training set ์— ์˜ˆ์‹œ ๋ฌธ์žฅ์˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.

batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()
dict_keys(['attention_mask', 'input_ids', 'labels', 'decoder_input_ids'])

-100 ์„ ํ†ตํ•ด Labels ์ด ๋ฐฐ์น˜์˜ Maximum length ๋กœ ํŒจ๋”ฉ ๋œ ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๋‹ค.

batch["labels"]
tensor([[  577,  5891,     2,  3184,    16,  2542,     5,  1710,     0,  -100,
          -100,  -100,  -100,  -100,  -100,  -100],
        [ 1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,   817,
           550,  7032,  5821,  7907, 12649,     0]])

๋˜ํ•œ decoder input ID ๋ฅผ ๋ณด์•˜์„ ๋•Œ label ์˜ shifted version ์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

batch["decoder_input_ids"]
tensor([[59513,   577,  5891,     2,  3184,    16,  2542,     5,  1710,     0,
         59513, 59513, 59513, 59513, 59513, 59513],
        [59513,  1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,
           817,   550,  7032,  5821,  7907, 12649]])

๋‹ค์Œ์€ ๋ฐ์ดํ„ฐ์˜ ์ฒซ๋ฒˆ์งธ, ๋‘๋ฒˆ์งธ ๋ฌธ์žฅ์— ๋Œ€ํ•œ label ์ด๋‹ค.

for i in range(1, 3):
    print(tokenized_datasets["train"][i]["labels"])
[577, 5891, 2, 3184, 16, 2542, 5, 1710, 0]
[1211, 3, 49, 9409, 1211, 3, 29140, 817, 3124, 817, 550, 7032, 5821, 7907, 12649, 0]

์ด data_collator ๋Š” Seq2Seq2Trainer ๋กœ ๋ณด๋‚ด์ง€๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค. ๋‹ค์Œ์œผ๋กœ Metric ์„ ๋ณด์ž!

Metrics

Seq2SeqTrainer ๊ฐ€ superclass ์ธ Trainer ์— ์ถ”๊ฐ€ํ•˜๋Š” ๊ธฐ๋Šฅ์€ evaluation ๋˜๋Š” prediction ์ค‘์— generate() ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๋Šฅ์ด๋‹ค. ํ•™์Šตํ•˜๋Š” ๋™์•ˆ ๋ชจ๋ธ์€ ํ•™์Šต ์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด ์˜ˆ์ธกํ•˜๋ ค๋Š” ํ† ํฐ ์ดํ›„์— ์กด์žฌํ•˜๋Š” ํ† ํฐ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋„๋ก ํ•˜๋ ค๊ณ  attention masking ๊ณผ ํ•จ๊ป˜ decoder_input_ids ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. Inference ํ•  ๋•Œ์—๋Š” label ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ ๋™์ผํ•œ ์„ค์ •์œผ๋กœ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.

๋ฒˆ์—ญ์— ์žˆ์–ด์„œ ์ „ํ†ต์ ์ธ ํ‰๊ฐ€ metric ์€ 2002๋…„์— Kishore Papineni et al.์— ์˜ํ•ด ์†Œ๊ฐœ๋œ BLEU score ๋กœ ์•Œ๋ ค์ ธ์žˆ๋‹ค. BLEU score ๋Š” ๋ฒˆ์—ญ๋œ ๋ฌธ์žฅ์ด ์‹ค์ œ label ๊ณผ ์–ผ๋งˆ๋‚˜ ๊ฐ€๊นŒ์šด์ง€ ํ‰๊ฐ€ํ•œ๋‹ค. ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋ช…๋ฃŒํ•œ์ง€ ๋˜๋Š” ๋ฌธ๋ฒ•์ ์œผ๋กœ ์ ์ ˆํ•œ์ง€๋Š” ํ‰๊ฐ€ํ•˜์ง€ ์•Š๋Š”๋‹ค. ํ•˜์ง€๋งŒ ์ถœ๋ ฅ๋œ ๊ฒฐ๊ณผ์— ์žˆ๋Š” ๋‹จ์–ด๊ฐ€ target ์—๋„ ๋‚˜ํƒ€๋‚˜๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ํ†ต๊ณ„์ ์ธ ๋ฐฉ๋ฒ•์„ ์“ด๋‹ค. ์‹ค์ œ ๋ฌธ์žฅ์—์„œ๋Š” ๋ฐ˜๋ณต์ด ์—†๋Š”๋ฐ ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ์—๋Š” ๊ฐ™์€ ๋‹จ์–ด๊ฐ€ ๋ฐ˜๋ณต๋˜๋Š” ๊ฒฝ์šฐ ํŒจ๋„ํ‹ฐ๋ฅผ ์ฃผ๊ณ , ์‹ค์ œ ๋ฌธ์žฅ๋ณด๋‹ค ์งง์€ ๋ฌธ์žฅ์„ ๋งŒ๋“ค์–ด ๋‚ด๋Š” ๊ฒฝ์šฐ์—๋„ ํŒจ๋„ํ‹ฐ๋ฅผ ์ค€๋‹ค.

BLEU score ์˜ ์•ฝ์  ์ค‘ ํ•œ๊ฐ€์ง€๋Š” ์ด๋ฏธ ํ† ํฐํ™” ๋œ ๋ฌธ์žฅ์„ ๋Œ€์ƒ์œผ๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ tokenizer ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ ๊ฐ„์˜ ์ ์ˆ˜๋ฅผ ๋น„๊ตํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ ์˜ค๋Š˜๋‚  ๋ฒˆ์—ญ ๋ชจ๋ธ์„ ๋ฒค์น˜๋งˆํ‚นํ•˜๋Š”๋ฐ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” metric ์€ ํ† ํฐํ™” ๋‹จ๊ณ„๋ฅผ ํ‘œ์ค€ํ™”ํ•˜์—ฌ ์ด๋Ÿฌํ•œ ์•ฝ์ ์„ ํ•ด๊ฒฐํ•˜๋Š” SacreBLEU ๋‹ค. ์ด metric ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด SacreBLEU ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ด์•ผ ํ•œ๋‹ค.

!python3 -m pip install sacrebleu

๊ทธ ๋‹ค์Œ์€ load_metric() ์„ ํ†ตํ•ด ๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋‹ค.

from datasets import load_metric

metric = load_metric("sacrebleu")

์ด metric ์€ ํ…์ŠคํŠธ๋ฅผ input ๊ณผ target ์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. ๊ฐ™์€ ๋ฌธ์žฅ์— ๋Œ€ํ•ด ๊ฐ€๋Šฅํ•œ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฐ€๋Šฅํ•œ ๋Œ€์ƒ์„ ์ž…๋ ฅ๋ฐ›๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค. NLP์—์„œ ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์„ label๋กœ ์ œ๊ณตํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹๋“ค๋„ ๋งŽ์ด ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— predictions ๋Š” ๋ฌธ์žฅ ๋ฆฌ์ŠคํŠธ์—ฌ์•ผ ํ•˜์ง€๋งŒ refenrences ๋Š” ๋ฌธ์ž ๋ฆฌ์ŠคํŠธ์˜ ๋ฆฌ์ŠคํŠธ ์ด์–ด์•ผ ํ•œ๋‹ค.

์˜ˆ์‹œ๋ฅผ ๋ณด์ž.

predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)
{'score': 46.750469682990165,
 'counts': [11, 6, 4, 3],
 'totals': [12, 11, 10, 9],
 'precisions': [91.67, 54.54, 40.0, 33.33],
 'bp': 0.9200444146293233,
 'sys_len': 12,
 'ref_len': 13}

BLEU score ๋กœ 46.75 ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ ๊ฝค ๊ดœ์ฐฎ์€ ๊ฒฐ๊ณผ๋‹ค. ๋ฐ˜๋ฉด์— ์•„๋ž˜์—๋Š” ๋ฐ˜๋ณต๋˜๊ณ  ์งง์€ ๋ฌธ์žฅ์œผ๋กœ ํ‰๊ฐ€ํ–ˆ์„ ๋•Œ ๋‚˜์œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์˜ˆ์‹œ์ด๋‹ค.

predictions = ["This This This This"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)
{'score': 1.683602693167689,
 'counts': [1, 0, 0, 0],
 'totals': [4, 3, 2, 1],
 'precisions': [25.0, 16.67, 12.5, 12.5],
 'bp': 0.10539922456186433,
 'sys_len': 4,
 'ref_len': 13}
predictions = ["This plugin"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)
{'score': 0.0,
 'counts': [2, 1, 0, 0],
 'totals': [2, 1, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'bp': 0.004086771438464067,
 'sys_len': 2,
 'ref_len': 13}

score ๋Š” ํด์ˆ˜๋ก ์ข‹์€ ๊ฒƒ์ด๋‹ค.

๋ชจ๋ธ ์ถœ๋ ฅ์—์„œ metric ์ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด tokenizer.batch_decode() ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. label ์—์„œ๋Š” ๋ชจ๋“  -100์„ ์ œ๊ฑฐํ•˜๋ฉด ๋œ๋‹ค. (ํ† ํฌ๋‚˜์ด์ €๋Š” ํŒจ๋”ฉ ํ† ํฐ์— ๋Œ€ํ•ด ์ž๋™์œผ๋กœ ๋™์ผํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•จ)

import numpy as np 

def compute_metrics(eval_preds):
	preds, labels = eval_preds 
    # In case the model returns more than the prediction logits 
    if ininstance(preds, tuple):
    	preds = preds[0] 
    
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    # Replace -100s in the labels as we can't decode them 
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id) 
    decoded_labels = tokenizer.batch_decode(labels, skip_spacial_tokens=True) 
    
    # Some simple post-processing 
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels) 
    return {"bleu": result["score"]}
    

Fine-tuning the model

๋ชจ๋ธ์„ ๋ฏธ์„ธ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ๋จผ์ € Seq2SeqTrainingArguments ๋ฅผ ์ •์˜ํ•œ๋‹ค.

from transformers import Seq2SeqTrainingArguments 

args = Seq2SeqTrainingArguments(
	f"marian-finetuned-kde4-en-to-fr",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

์ผ๋ฐ˜์ ์ธ hyperparameter (learning rate, num of epochs, batch size, weight decay) ๋ฅผ ์ œ์™ธํ•˜๊ณ , ์—ฌ๊ธฐ์—์„œ๋Š” ์กฐ๊ธˆ ๋ณ€ํ™”๋ฅผ ์ค€๋‹ค.

  • regular evaluation ์„ ์„ค์ •ํ•˜์ง€ ์•Š๋Š”๋‹ค. training ํ•˜๊ธฐ ์ „์ด๋‚˜ ํ›„์— evaluate ํ•œ๋‹ค.

  • fp16=True ๋กœ ์„ค์ •ํ•ด์„œ GPU ๋กœ training ํ•˜๋Š” ์†๋„๋ฅผ ๋†’์ธ๋‹ค.

  • predict_with_generate=True ์œผ๋กœ ํ•œ๋‹ค.

  • push_to_hub=True ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๊ฐ epoch ๊ฐ€ ๋๋‚  ๋•Œ Hub ์— ๋ชจ๋ธ์„ ์—…๋กœ๋“œํ•œ๋‹ค.

hub_model_id ์„ ์‚ฌ์šฉํ•˜๋ฉด ํ‘ธ์‰ฌํ•˜๋ ค๋Š” ์ €์žฅ์†Œ์˜ ์ „์ฒด ์ด๋ฆ„์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋ชจ๋ธ์„ huggingface-course ์— ํ‘ธ์‰ฌํ•  ๋•Œ, hub_model_id="huggingface-course/marian-finetuned-kde4-en-to-fr" ๋ฅผ Seq2SeqTrainingArguments ์— ์ถ”๊ฐ€ํ•˜๋ฉด ๋œ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ ๋ชจ๋“  ๊ฑธ Seq2SeqTrainer ์— ์ „๋‹ฌํ•œ๋‹ค.

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

ํ•™์Šตํ•˜๊ธฐ ์ „์— ๋จผ์ € ์ดˆ๊ธฐ ๋ชจ๋ธ์ด ์–ป๋Š” ์ ์ˆ˜๋ฅผ ๋ณด๊ณ , fine-tuning ์œผ๋กœ ํ˜น์‹œ ๋” ์•…ํ™”๋˜๋Š” ๊ฒƒ์€ ์•„๋‹Œ์ง€ ํ™•์ธํ•ด๋ณธ๋‹ค. ์•„๋ž˜ ๋ช…๋ น์€ ์ข€ ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค.

trainer.evaluate(max_length=max_target_length)
{'eval_loss': 1.6964408159255981,
 'eval_bleu': 39.26865061007616,
 'eval_runtime': 965.8884,
 'eval_samples_per_second': 21.76,
 'eval_steps_per_second': 0.341}

BLEU score 39 ๋Š” ๋‚˜์˜์ง€ ์•Š๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์„ ํƒํ•œ ๋ชจ๋ธ์ด ์ด๋ฏธ ์˜์–ด-ํ”„๋ž‘์Šค์–ด ๋ฌธ์žฅ์„ ์ž˜ ๋ฒˆ์—ญํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

๋‹ค์Œ์œผ๋กœ ๋ณธ๊ฒฉ์ ์œผ๋กœ ํ•™์Šต์„ ํ•œ๋‹ค. ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆด ๊ฒƒ์ด๋‹ค.

trainer.train()

ํ•™์Šต์ด ์ง„ํ–‰๋˜๋Š” ๋™์•ˆ ๋ชจ๋ธ์ด ์ €์žฅ๋  ๋•Œ๋งˆ๋‹ค (์—ฌ๊ธฐ์„  epoch ๋งˆ๋‹ค) ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ๋ชจ๋ธ์ด ํ—ˆ๋ธŒ์— ์—…๋กœ๋“œ๋œ๋‹ค. ์ด๋Ÿฐ ๋ฐฉ์‹์œผ๋กœ ๋‹ค๋ฅธ ๋จธ์‹ ์—์„œ ํ•™์Šต์„ ๋‹ค์‹œ ์‹œ์ž‘ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

ํ•™์Šต์ด ์™„๋ฃŒ๋˜๋ฉด ๋‹ค์‹œ ํ‰๊ฐ€ํ•ด๋ณด์ž.

trainer.evaluate(max_length=max_target_length)
{'eval_loss': 0.8558505773544312,
 'eval_bleu': 52.94161337775576,
 'eval_runtime': 714.2576,
 'eval_samples_per_second': 29.426,
 'eval_steps_per_second': 0.461,
 'epoch': 3.0}

๊ฑฐ์˜ 14 ์ ์ด ์˜ฌ๋ž๋‹ค!

๋งˆ์ง€๋ง‰์œผ๋กœ push_to_hub() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ตœ์‹  ๋ฒ„์ „์˜ ๋ชจ๋ธ์„ ์—…๋กœ๋“œํ–ˆ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค. Trainer ๋Š” ๋ชจ๋“  ํ‰๊ฐ€ ๊ฒฐ๊ณผ๊ฐ€ ํฌํ•จ๋œ ๋ชจ๋ธ ์นด๋“œ์˜ ์ดˆ์•ˆ์„ ์ž‘์„ฑํ•ด์„œ ์—…๋กœ๋“œํ•œ๋‹ค. ์ด ๋ชจ๋ธ ์นด๋“œ์—๋Š” Model Hub ๊ฐ€ inference ๋ฐ๋ชจ์šฉ ์œ„์ ฏ์„ ์„ ํƒํ•˜๋Š”๋ฐ ๋„์›€์ด ๋˜๋Š” ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋˜์–ด์žˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋ชจ๋ธ ํด๋ž˜์Šค์—์„œ ์˜ฌ๋ฐ”๋ฅธ ์œ„์ ฏ์„ ์œ ์ถ”ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์•„๋ฌด๊ฒƒ๋„ ํ•  ํ•„์š”๊ฐ€ ์—†์ง€๋งŒ, ์ด ๊ฒฝ์šฐ ๋™์ผํ•œ ๋ชจ๋ธ ํด๋ž˜์Šค๋ฅผ ๋ชจ๋“  ์ข…๋ฅ˜์˜ sequence-to-sequence ๋ฌธ์ œ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ฒˆ์—ญ๋ชจ๋ธ์ด๋ผ๊ณ  ์ง€์ •ํ•œ๋‹ค.

trainer.push_to_hub(tags="tanslation", commit_message="Training complete")

์œ„ ๋ช…๋ น์€ ์•„๋ž˜์™€ ๊ฐ™์ด ๋ฐฉ๊ธˆ ํ•œ ์ปค๋ฐ‹์˜ URL ์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

'https://huggingface.co/sgugger/marian-finetuned-kde4-en-to-fr/commit/3601d621e3baae2bc63d3311452535f8f58f6ef3'

์ด์ œ ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ•˜๊ธฐ ์œ„ํ•ด Model Hub ์˜ inference widget ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ณ  ๊ณต์œ ํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ๋“œ๋””์–ด ์„ฑ๊ณต์ ์œผ๋กœ ๋ฒˆ์—ญ ํ…Œ์Šคํฌ๋ฅผ ํ•˜๋Š” ๋ชจ๋ธ์„ fine-tune ์™„๋ฃŒํ–ˆ๋‹ค!

ํ•™์Šต ๋ฃจํ”„์— ๋Œ€ํ•ด ์ข€ ๋” ์ž์„ธํžˆ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด ์ด์ œ ๐Ÿค—Accelerate๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋™์ผํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•„๋ž˜์—์„œ ๋ณด์—ฌ์ค€๋‹ค.

3. A custom training loop

์ด์ œ full training loop ๋ฅผ ๋ณด๊ณ , ํ•„์š”ํ•  ๋•Œ ์›ํ•˜๋Š” ๋ถ€๋ถ„๋งŒ ์ปค์Šคํ…€ ํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๊ณต๋ถ€ํ•ด๋ณด์ž.

Preparing everything for training

๋จผ์ € ๋ฐ์ดํ„ฐ์…‹์„ torch ๋กœ ํฌ๋ฉงํ•œ ๋‹ค์Œ, ๋ฐ์ดํ„ฐ ์…‹์„ ๊ฐ€์ง€๊ณ  DataLoaders ๋ฅผ ๋งŒ๋“ ๋‹ค.

from torch.utils.data import DataLoader 

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(
	tokenized_datasets["train"], 
    shuffle=True, 
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
	tokenized_datasets[], collate_fn=data_collator, batch_size=8
)

๋‹ค์Œ์œผ๋กœ ๋ชจ๋ธ์„ reinstantiate ํ•ด์„œ pretrained model ์—์„œ ์‹œ์ž‘ํ•œ๋‹ค.

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

๊ทธ๋Ÿฐ ๋‹ค์Œ optmizer ๋„ ๋งŒ๋“ ๋‹ค.

from transformers import AdamW 

optimizer = AdamW(model.parameters(), lr=2e-5)

์ด๋ ‡๊ฒŒ ๋งŒ๋“ค๋ฉด accelerator.prepare() ๋ฉ”์†Œ๋“œ์— ๋ณด๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

์ด์ œ train_dataloader ๋ฅผ accelerator.prepare() ๋กœ ๋ณด๋‚ด๊ณ , ์ด ๊ธธ์ด๋ฅผ training step ์˜ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

์šฐ๋ฆฌ๊ฐ€ ๋งŒ๋“  ๋ชจ๋ธ์„ Hub ์— ์˜ฌ๋ฆฌ๋ ค๋ฉด Repository ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค. ๋กœ๊ทธ์ธ ํ•œ ์ƒํƒœ๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด ์ผ๋‹จ Hugging Face Hub ์— ๋กœ๊ทธ์ธ ๋ถ€ํ„ฐ ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  model ID ๋กœ ๋ ˆํฌ์ง€ํ† ๋ฆฌ์˜ ์ด๋ฆ„์„ ์ •ํ•ด์ค€๋‹ค.

from huggingface_hub import Repository, get_full_repo_name

model_name = "marian-finetuned-kde4-en-to-fr-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name
'sgugger/marian-finetuned-kde4-en-to-fr-accelerate'

Training loop

์ด์ œ full training loop ๋ฅผ ์“ธ ์ค€๋น„๊ฐ€ ๋˜์—ˆ๋‹ค. evaluation ๋ถ€๋ถ„์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด, predictions ์™€ labels ๋ฅผ metric ์ด ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฝ๊ฒŒ ๋ณ€ํ™˜ํ•˜๋Š” postprocess() ํ•จ์ˆ˜์ด๋‹ค.

def postprocess(predictions, labels):
    predictions = predictions.cpu().numpy()
    labels = labels.cpu().numpy()

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_preds, decoded_labels
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                max_length=128,
            )
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        generated_tokens = accelerator.pad_across_processes(
            generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
        )
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(generated_tokens)
        labels_gathered = accelerator.gather(labels)

        decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=decoded_preds, references=decoded_labels)

    results = metric.compute()
    print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )
epoch 0, BLEU score: 53.47
epoch 1, BLEU score: 54.24
epoch 2, BLEU score: 54.44

Using the fine-tuned model

pipeline ์œผ๋กœ fine-tuned ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” model identifier ๋งŒ ์ž˜ ์ง€์ •ํ•ด์ฃผ๋ฉด ๋œ๋‹ค.

from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "huggingface-course/marian-finetuned-kde4-en-to-fr"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")
[{'translation_text': 'Par dรฉfaut, dรฉvelopper les fils de discussion'}]

์ด์ œ pretrained model ์ด fine-tune ํ•œ corpus ์— ๋งž๊ฒŒ ์˜ˆ์ธกํ•œ๋‹ค. ์˜์–ด ๋‹จ์–ด "threads" ๋กœ ๊ทธ๋ƒฅ ๋‘์ง€ ์•Š๊ณ , ํ”„๋ž‘์Šค์–ด ๋‹จ์–ด๋กœ ๋ฒˆ์—ญํ•œ๋‹ค.
domain adaptation ์˜ ๋˜ ๋‹ค๋ฅธ ์˜ˆ์ด๋‹ค.

translator(
    "Unable to import %1 using the OFX importer plugin. This file is not the correct format."
)
[{'translation_text': "Impossible d'importer %1 en utilisant le module externe d'importation OFX. Ce fichier n'est pas le bon format."}]

์ฝ”๋“œ์— ํ‹€๋ฆฐ๋ถ€๋ถ„์ด ์žˆ๋Š”๋ฐ ์•„์ง ์ˆ˜์ • ์•ˆํ•จ. ์ •ํ™•ํ•˜๊ฒŒ๋Š” ์•„๋ž˜ ๋ ˆํผ ์ฐธ๊ณ  ํ•˜๊ธฐ...


[reference]

https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt#translation

https://wikidocs.net/166832

profile
๊ณต๋ถ€ํ•ด์„œ ๋‚จ์ฃผ์ž

0๊ฐœ์˜ ๋Œ“๊ธ€