As a first experiment we will use the Trainer and train the model without any further modifications and a batch size of 4:
from transformers import TrainingArguments, Trainer, logging
loggig.set_verbosity_error()
training_args = TrainingArguments(per_device_train_batch_size=4, **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)
Time : 57.82
Samples/second: 8.86
GPU memory occupied: 14949 MB.
We see that already a relatively small batch size almost fills up our GPU's entire memory. However, a larger batch size can often result in faster model convergence or better and performance. So ideally we want to tune the batch size to our model's needs and not to the GPU limitations. A simple trick to effectively train larger batch size is gradient accumulation.
The idea behind gradient accumulation is to instead of calculating the gradients for the whole batch at once to do it in smaller steps. The way we do that is to calculate the gradients iteratively in smaller batches by doing a forward and backward pass through the model and accumulating the gradients in the process. When enough gradients are accumulated we run the model's optimization step. This way we can easily increase the overall batch size to numbers that would never fit into the GPU's memory. In turn, however, the added forward and backward passes can slow down the training a bit.
We can use gradient accumulation in the Trainer
by simply adding the gradient_accumulation_steps
arguments to TrainingArguments. Let's see how it impacts the models memory footprint:
training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)
Time: 66.03
Samples/second: 7.75
GPU memory occupied: 8681 MB.
We can see that the memory footprint was dramatically reduced at the cost of being only slightly slower than the vanilla run. Of course, this would change as you increase the number of accumulation steps. In general you would want to max out he GPU usage as much as possible. So in our case, the batch_size of 4 was already pretty close to the GPU's limit. If we wanted to train with a batch size of 64 we should not use per_device_train_batch_size=1
and gradient accumulation_steps=64
but instead per_device_train_batch_size=4
and gradient_accumulation_steps=16
which has the same effective batch size while making better use of the available GPU resources.