GPU에서 첫 실행 시 일정 시간 대기 발생하는 이유

양경모·2023년 2월 6일

TL;DR

Q: GPU로 실행 시 초기 대기가 왜 발생하는가?

A: CUDA 런타임이 런타임 함수가 처음 실행되었을 때 초기화되고, 초기화 과정에 호스트의 프로세스와 유사한 CUDA context, DLL과 유사한 CUDA module을 초기화하는 과정이 포함되어있기 때문에 초기 대기가 발생할 수 있다.

Content

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-general-purpose-parallel-computing-architecture

The CUDA parallel programming model is designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C.

At its core are three key abstractions - a hierarchy of thread groups, shared memories, and barrier synchronization - that are simply exposed to the programmer as a minimal set of language extensions.

These abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism. They guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block.

This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability. Indeed, each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a compiled CUDA program can execute on any number of multiprocessors as illustrated by Figure 3, and only the runtime system needs to know the physical multiprocessor count.

CPU와 GPU의 구조 개요

스레드 블록이 할당되는 방식의 개요

thread group, shared memory, barrier synchronization 을 활용함. 작업자는 이를 이용해서 문제를 더 작은 문제로 쪼개고, 각 문제를 스레드에서 병렬로 실행하게 할 수 있게 됨. 각 스레드 블록은 사용할 수 있는 GPU 멀티프로세서에 할당됨.

커널

CUDA C++ extends C++ by allowing the programmer to define C++ functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C++ functions.

커널은 C++ 함수고, N개의 서로 다른 CUDA 스레드(이하 스레드)에서 실행됨.

스레드 구조

For convenience, threadIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-dimensional, or three-dimensional block of threads, called a thread block. This provides a natural way to invoke computation across the elements in a domain such as a vector, matrix, or volume.

스레드 인덱스(스레드를 unique하게 참조하기 위한 값)은 요소 3개인 벡터로 구성된다. 1차원, 2차원, 3차원 스레드 인덱스로 표현할 수 있음. 이 방식으로 벡터, 행렬, 부피에 대한 연산을 자연스럽게 할 수 있음.

The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).

블록 안에서의 스레드의 위치를 블록 크기와 스레드 번호로 나타낼 수 있다.

스레드와 스레드 블록의 관계

동일한 크기의 블록이 여러개 있고, 블록 하나에 이런 식으로 스레드가 동일한 개수로 각각 있는 형태.

Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling the syncthreads() intrinsic function; syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed. Shared Memory gives an example of using shared memory. In addition to __syncthreads(), the Cooperative Groups API provides a rich set of thread-synchronization primitives.

For efficient cooperation, the shared memory is expected to be a low-latency memory near each processor core (much like an L1 cache) and __syncthreads() is expected to be lightweight.

블록 안의 스레드는 shared memory를 사용하고 동기화하면서 협동할 수 있음.

메모리 구조

CUDA threads may access data from multiple memory spaces during their execution as illustrated by Figure 5. Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block. All threads have access to the same global memory.

There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages (see Device Memory Accesses). Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats (see Texture and Surface Memory).

The global, constant, and texture memory spaces are persistent across kernel launches by the same application.

스레드 그룹 별 접근 가능한 메모리

각 스레드는 그림에 나타난대로 메모리에 접근할 수 있음. 모든 스레드가 글로벌 메모리에 접근할 수 있다.

글로벌 메모리 외에 읽기 전용 메모리가 constant, texture 메모리의 두 개가 있음. constant, texture 메모리는 서로 다른 목적을 위해서 최적화되어있다.

Heterogenous Programming

As illustrated by Figure 6, the CUDA programming model assumes that the CUDA threads execute on a physically separate device that operates as a coprocessor to the host running the C++ program. This is the case, for example, when the kernels execute on a GPU and the rest of the C++ program executes on a CPU.

The CUDA programming model also assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively. Therefore, a program manages the global, constant, and texture memory spaces visible to kernels through calls to the CUDA runtime (described in Programming Interface). This includes device memory allocation and deallocation as well as data transfer between host and device memory.

Unified Memory provides managed memory to bridge the host and device memory spaces. Managed memory is accessible from all CPUs and GPUs in the system as a single, coherent memory image with a common address space. This capability enables oversubscription of device memory and can greatly simplify the task of porting applications by eliminating the need to explicitly mirror data on host and device. See Unified Memory Programming for an introduction to Unified Memory.

CUDA 프로그래밍을 할 때는 CUDA 스레드가 C++ 프로그램을 실행시키는 장치(host)와 물리적으로 분리된 장치(device)에서 동작한다고 가정함.

host에는 host 메모리가, device에 device 메모리가 있는 것도 가정한다. host에서 실행하는 프로그램은 CUDA 런타임(host)을 통해서 global, constant, texture 메모리를 보거나 host-device 간 메모리 할당/제거를 한다.

host, device 간 메모리 외에 unified memory가 있다. host, device는 여기에 대해 공용 주소 공간을 사용해 둘 다 접근할 수 있음.

Programming Interface

The core language extensions have been introduced in Programming Model. They allow programmers to define a kernel as a C++ function and use some new syntax to specify the grid and block dimension each time the function is called. A complete description of all extensions can be found in C++ Language Extensions. Any source file that contains some of these extensions must be compiled with nvcc as outlined in Compilation with NVCC.

language extension이 kernel을 c++ 함수로 작성할 수 있게 해주는데, 이 extension을 포함하는 코드는 반드시 NVCC로 컴파일해야함.

The runtime is introduced in CUDA Runtime. It provides C and C++ functions that execute on the host to allocate and deallocate device memory, transfer data between host memory and device memory, manage systems with multiple devices, etc.

CUDA 런타임이 호스트 위에서 장치의 메모리를 할당/제거/송수신 등 할 수 있는 함수를 제공함.

The runtime is built on top of a lower-level C API, the CUDA driver API, which is also accessible by the application. The driver API provides an additional level of control by exposing lower-level concepts such as CUDA contexts - the analogue of host processes for the device - and CUDA modules - the analogue of dynamically loaded libraries for the device. Most applications do not use the driver API as they do not need this additional level of control and when using the runtime, context and module management are implicit, resulting in more concise code. As the runtime is interoperable with the driver API, most applications that need some driver API features can default to use the runtime API and only use the driver API where needed.

CUDA 런타임은 CUDA driver API 를 바탕으로 만들어졌음. CUDA driver API가 CUDA context나 CUDA 모듈을 제어하는 레이어를 노출시킴. 보통은 쓸 일이 없지만 쓸 일이 있다면 런타임을 통해서도 사용할 수 있다.

CUDA Runtime

There is no explicit initialization function for the runtime; it initializes the first time a runtime function is called (more specifically any function other than functions from the error handling and version management sections of the reference manual). One needs to keep this in mind when timing runtime function calls and when interpreting the error code from the first call into the runtime.

런타임은 처음 호출되었을 때 초기화되어서 알고있어야함.

The runtime creates a CUDA context for each device in the system (see Context for more details on CUDA contexts). This context is the primary context for this device and is initialized at the first runtime function which requires an active context on this device. It is shared among all the host threads of the application. As part of this context creation, the device code is just-in-time compiled if necessary (see Just-in-Time Compilation) and loaded into device memory.

런타임이 각 장치에 CUDA context를 생성한다. 이 context가 그 장치의 primary context가 되고 그 장치의 active context가 필요한 첫 런타임 함수 실행 시 초기화됨. application의 모든 스레드에 공유된다. context는 (필요하면)장치에서 JIT 컴파일되어서 장치의 메모리에 올라감.

CUDA Context

A CUDA context is analogous to a CPU process. All resources and actions performed within the driver API are encapsulated inside a CUDA context, and the system automatically cleans up these resources when the context is destroyed. Besides objects such as modules and texture or surface references, each context has its own distinct address space. As a result, CUdeviceptr values from different contexts reference different memory locations.

CUDA context는 CPU 프로세스같은 것으로, driver API에서 수행하는 자원과 행동들이 CUDA context 로 감싸져있음. Context가 제거되면 자원들도 제거됨.

CUDA module

Modules are dynamically loadable packages of device code and data, akin to DLLs in Windows, that are output by nvcc (see Compilation with NVCC). The names for all symbols, including functions, global variables, and texture or surface references, are maintained at module scope so that modules written by independent third parties may interoperate in the same CUDA context.

모듈에 패키지, 코드, 데이터 포함됨.

양경모

비효율을 개선해 세상을 더 좋은 곳으로 만들고싶은 백엔드 개발자입니다.

이전 포스트