[HF-Transformers] Trainer

ma-kjhΒ·2024λ…„ 8μ›” 4일
0

HuggingFace

λͺ©λ‘ 보기
2/7

Trainer - a PyTorch optimized training loop

All models are a standard torch.nn.Module so you can use them in any typical training loop. While you can write your own training loop, πŸ€— Transformers provides a Trainer class for PyTorch, which contains the basic training loop and adds additional functionality for features like distributed training, mixed precision and more.

Depending on your task, you'll typically pass the following parameters to Trainer:

  1. You'll start with a PreTrainedModel or a torch.nn.Module:
>>> from transformers import AutoModelForSequenceClassification
>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased")
  1. TrainingArtuments contains the model hyperparameters you can change like
  • learning rate
  • batch size
  • the number of epochs.

The default values are used if you don't specify any training arguments:

>>> from transformers import TrainingArguments

>>> training_args = TrainingArguments(
...		output_dir="path/to/save/folder/",
...		learning_rate=2e-5,
...		per_device_train_batch_size=8,
...		per_device_eval_batch_size=8,
...		num_train_epochs=2,
...)
  1. Load a preprocessing class like
  • a tokenizer
  • image processor
  • feature extractor
  • or processor.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
  1. Load a dataset:
>>> from datasets import load_dataset

>>> dataset = load_dataset("rotten_tomatoes") # doctest : +IGNORE_RESULT
  1. Create a function to tokenizer the dataset:
>>> def tokenize_datset(dataset):
...		return tokenizer(dataset["text"])

Then apply it over the entire dataset with map:

>>> dataset = dataset.map(tokenize_dataset, batched=True)
  1. A DataCollactorWithPadding to create a batch of examples from your dataset:
>>> from transformers import DataCollactorWithPadding

>>> data_collactor = DataCollactorWithPadding(tokenizer=tokenizer)

Now gather all these classes in Trainer:

>>> from transformers import Trainer

>>> trainer = Trainer(
...		model=model,
...		args=training_args,
...		train_dataset=dataset["train"],
...		eval_dataset=dataset["text"],
...		tokenizer=tokenizer,
...		data_collator=data_collator,
... ) # doctest : +SKIP

When you're ready, call train() to start training:

>>> trainer.train()

You can customize the training loop behavior by subclassing the methods inside Trainer. This allows you to customize features such as

  • the loss function,
  • optimizer,
  • and schduler.

Take a look at the Trainer reference for which methods can be subclassed.

The other way to customize the training loop is by using Callbacks. You can use callbacks to integrate with other libraries and inspect the training loop to report on progress or stop the training early. Callbacks do not modify anything in the training loop itself. To customize something like the loss function, you need to subclass the Trainer instead.

What's next ?

Now that you've completed the πŸ€— Transformers quick tour, check out our guides and learn how to do more specific things like

  • writing a custom model,
  • fine-tuning a model for a task,
  • and how to train a model with a script.
profile
거인의 어깨에 μ˜¬λΌμ„œμ„œ 더 넓은 세상을 바라보라 - μ•„μ΄μž‘ 뉴턴

0개의 λŒ“κΈ€