[PyTorch]Distributed Data Parallel - DataDistributed

MA·2022년 7월 24일

PyTorch

목록 보기

4/6

Reference : https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#comparison-between-dataparallel-and-distributeddataparallel

먼저 DDP(Distributed Data Parallel)에 대해 알기 위해서는 Distributed에 대해 알아야 한다.

PyTorch Distributed Overview

이 내용은 torch.distributed 패키지를 설명합니다.

$\bullet$ Distributed Data-Parallel Training(DDP) is a widely adopted single-program multiple-data training paradigm. With DDP, the model is replicated(복사하다) on every process, and every model replica will be fed with a different set of input data samples. DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.

하나의 프로그램으로 다수의 데이터 트레이닝이 가능하다.

Data Parallel Training

PyTorch는 data-parallel training의 몇가지 옵션들을 제공해준다. 응용단에서 간단한 것 부터 복잡한 것이 있으며, 시제품에서 제품까지, 공통의 개발 과정은 다음과 같다 :

Use single-device training if the data and model can fit in one GPU, and training speed is not a concern.
만약에 시간이 상관 없고, 한 GPU에만 적합한 모델과 데이터라면 하나의 device(GPU)로 학습한다.
Use single-machine multi-GPU DataParallel to make use of multiple GPUs on a single machine to speed up training with minimal code chanes.
하나의 기계로 여러 GPU를 사용하는 DataParallel은 여러대의 GPU를 하나의 기계에서 사용할 수 있도록 만들어 준다.
Use single-machine multi-GPU DistributedDataParallel, if you would like to further speed up training and are willing to write a little more code to set it up.
좀 더 코드를 조작해서 더 높은 스피드를 만들 수 있다.
Use multi-machine DistributedDataParallel and the launching script, if the application needs to scale across machine boundaries.
Use torch.distributed.elastic to launch distributed training if error (e.g., out-of-memory) are expected or if resources can join and leave dynamically during training.

Note
Data-parallel training also works with Automatic Mixed Precision (AMP)

torch.nn.DataParallel

그냥 DataParallel 패키지는 하나의 기계에서 여러대의 GPU를 사용할 수 있도록 만들어준다(아주 간단하게). 이는 한줄만 바꾸면 가능하다.

device = torch.device("cuda:0")
model.to(device)

이렇게 하면, model은 GPU로 올라간다.

mytensor = my_tensor.to(device)

마찬가지로 my_tensor가 GPU(cuda:0)으로 올라간다.

근데 아주 간단하게 여러대의 GPU를 사용해서 연산을 진행할 수 있다.

model = nn.DataParallel(model)

좀 더 구체적으로 보자

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100

Device

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Dummy DataSet

class RandomDataset(Dataset):
	
    def __init__(self, size, length):
    	self.len = length
        self.data = torch.randn(length, size)
        
    def __getitem__(self, index):
    	return self.data[index]
        
    def __len__(self):
    	return self.len
        
rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
						batch_size=batch_size, shuffle=True)

Simple Model

class Model(nn.Module):
	# Our model
    
    def __init__(self, input_size, output_size):
    	super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)
        
    def forward(self, input):
    	output = fc(input)
        print("\tIn Model: input size", input.size(),
        		"output size", output.size())
        return output

Create Model and DataParallel

이 부분이 가장 중요한 부분이다. 만약에 multiple GPU를 가지고 있다면, 모델을 nn.DataParallel로 감쌀 수 있다. 그러고 나서 모델을 GPU에 얹을 수 있다.

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
	print("Let's use", torch.cuda.device_count(), "GPUs!")
    # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
    model = nn.DataParallel(model)
    
model.to(device)

Run the Model

for data in rand_loader:
	input = data.to(device)
    output = model(input)
    print("Outside: input size", input.size(),
    	"output_size", output_size())

In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

만약에 GPU가 없거나, one GPU라면 30배치로 input이 들어가고 30개의 output이 나온다.

2 GPUs

# on 2 GPUs
Let's use 2 GPUs!
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
    In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

3 GPUs

Let's use 3 GPUs!
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

8 GPUs

Let's use 8 GPUs!
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

급할수록 돌아가라

이전 포스트

[PyTorch]The mechanics of learning

다음 포스트

[PyTorch]Distributed Data Parallel - DataDistributed

PyTorch

[PyTorch]The mechanics of learning

[PyTorch]Distributed Data Parallel - Practice

0개의 댓글