딥러닝 환경 설치

Yougurt_Man·2024년 4월 14일

프로젝트

목록 보기

6/10

세팅할 서버에 알맞는 Python, cuDNN, CUDA, tensorflow를 설치해야 함
우선 코랩에서는, 파이썬은 3.1, 텐서플로우는 2.12.0 이다.
현재 내 우분투 20.04의 파이썬 버전은 Python 3.8.10이다
```
david@david-XH58:~$ python3 --version
Python 3.8.10
```

우선 그래픽 드라이버를 설치해주었다. 아래 명령어를 통해 권장되는 드라이버는 nvidia-driver-535 인걸로 확인된다.

david@david-XH58:~$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C8Dsv00001558sd000055A1bc03sc00i00
vendor   : NVIDIA Corporation
model    : GP107M [GeForce GTX 1050 Mobile]
manual_install: True
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-535 - distro non-free recommended
driver   : nvidia-driver-390 - distro non-free
driver   : nvidia-driver-418-server - distro non-free
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-535-server - distro non-free
driver   : nvidia-driver-470 - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

권장드라이버를 수동으로 설치할 수 있지만, 자동으로 설치를 해주는 명령어가 있기에 아래 명령어로 드라이버 설치 후 reboot했다. 권장 드라이버 버전으로 Driver Version: 535.171.04 이 정상적으로 설치되었다.
```
sudo ubuntu-drivers autoinstall
reboot
```
NVIDIA 드라이버 클린 설치
해당 드라이버 설치 시, 최대 설치가능한 cuda version은 12.2이다.

실제 설치되는 쿠다 버전을 확인해보자. nvcc --version 으로 했을때, CUDA의 버전은 10.1로 나오는걸 확인할 수 있다.

david@david-XH58:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

최신 cuda version인 12.0으로 설치하는게 좋다고 판단하기에, 재설치를 진행해주자.
우선, nvidia의 cudnn 공식 설치 홈페이지에서 Package Manager Installation를 설치하고, cuda 설치가 선행되어야 하는걸로 확인된다. 그래서, 아래 명령어를 통해 우선 Network Repo를 설치하고, cudnn9-cuda-12를 설치했다. 아래는 cudnn archive

NVIDIA cuDNN Archive
```
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cudnn9-cuda-12
```
cudnn이 9번대로 설치하는 것 같아서, 다시 설치한다. CUDA Toolkit 12.0 Downloads 을 수동으로 다운로드하자

그리고 cuda 버전이 혼재되어있으니, .bashrc 파일에 경로 등록

export PATH="/usr/local/cuda-12.4/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"

근데, 드라이버 버전을 인식할 수 없는 것 같다.

david@david-XH58:~$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 550.54
david@david-XH58:~$

그래서, 이전에 설치한 535 드라이버 삭제

sudo apt-get purge $(dpkg -l | grep '^rc' | grep nvidia | awk '{print $2}')
sudo apt-get autoremove

그리고 재부팅 하니, 다시 드라이버가 떳다.

그리고 다운로드 받은 cudnn 파일을 설치한다.

sudo dpkg -i cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb

그리고, key 설치를 위해 아래 등록

sudo cp /var/cudnn-local-repo-ubuntu2004-8.9.7.29/cudnn-local-30472A84-keyring.gpg /usr/share/keyrings/
sudo apt-get update

이후, 아래 명령어로 cudnn 설치

david@david-XH58:~/Downloads$ sudo apt-get install libcudnn8 libcudnn8-dev

그리고 cuda version을 확인해보았다 → 설치된 cuDNN 라이브러리의 버전은 8.9.7

david@david-XH58:/usr/local/cuda-12/include$ grep CUDNN_MAJOR /usr/include/cudnn_version.h -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 7
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */

즉 정리를 하자면

cuDNN 버전

주 버전 (Major Version): 8
부 버전 (Minor Version): 9
패치 레벨 (Patch Level): 7

전체 버전 (Complete Version): 8.9.7 (계산된 버전: 8097)

david@david-XH58:/usr/local/cuda-12/include$ grep CUDNN_MAJOR /usr/include/cudnn_version.h -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 7

#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
/* cannot use constexpr here since this is a C-only file */

CUDA 버전

CUDA 버전: 12.4 (이 정보는 nvidia-smi 명령어와 nvcc --version 명령어에서 모두 확인 가능)

david@david-XH58:/usr/local/cuda-12/include$ nvcc --version
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

NVIDIA 그래픽 드라이버 정보

드라이버 버전: 550.54.15
GPU 모델: NVIDIA GeForce GTX 1050

GPU 사용량: 거의 사용되지 않고 있음 (Memory Usage: 11MiB / 4096MiB)

david@david-XH58:/usr/local/cuda-12/include$ nvidia-smi
Sat Apr 13 21:15:41 2024

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1050        Off |   00000000:01:00.0 Off |                  N/A |
| N/A   41C    P8             N/A / ERR!  |      11MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
/* cannot use constexpr here since this is a C-only file */

이후 텐서 플로우 설치를 위해, 공식 문서를 참고하여 GPU 버전을 설치했다.
```
# For GPU users
pip install tensorflow[and-cuda]
# For CPU users
pip install tensorflow
```

아.. 왠지 현재 설치한 cuda version (12.4)과 tensorflow 2.13.1호환이 되지 않는 것 같다. 게속 Could not find cuda drivers on your machine, 이라는 명령이 뜬다.

david@david-XH58:~$ python3
Python 3.8.10 (default, Nov 22 2023, 10:22:35) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-04-13 22:28:19.152102: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-13 22:28:19.187685: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-13 22:28:19.188058: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-13 22:28:19.856671: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> tf.sysconfig.get_build_info() 
OrderedDict([('cpu_compiler', '/usr/lib/llvm-16/bin/clang'), ('cuda_compute_capabilities', ['sm_35', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'compute_80']), ('cuda_version', '11.8'), ('cudnn_version', '8'), ('is_cuda_build', True), ('is_rocm_build', False), ('is_tensorrt_build', True)])

그래서, 그냥 Tested build configurations을 참고해서 아래 환경으로 재설치를 진행할 예정이다.

Version Python version Compiler Build tools cuDNN CUDA
tensorflow-2.12.0 3.8-3.11 GCC 9.3.1 Bazel 5.3.0 8.6 11.8

Version	Python version	Compiler	Build tools	cuDNN	CUDA
tensorflow-2.12.0	3.8-3.11	GCC 9.3.1	Bazel 5.3.0	8.6	11.8

cuda를 삭제하고, CUDA Toolkit 11.8을 다시 설치하고 환경변수도 새로 지정했다.

export PATH=/usr/local/cuda-11.8/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

david@david-XH58:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

참고 자료 https://tes-b.github.io/ubuntu/tensorflow-ubuntu_2/

. cudnn도 다시해야하나 ㅠㅠ? → 느낌이 이전에 12 버전으로 했으니, 11 버전으로 해야할 것 같다.
우선 기존 cuda version인 12 버전을 삭제
```
sudo rm -rf /usr/local/cuda-12
```
또한 /usr/include/cudnn_version.h 에 헤더파일이 존재하기에, 아래 명령어를 통해 헤더파일을 제거함.
```
sudo rm /usr/lib/x86_64-linux-gnu/libcudnn*
sudo rm /usr/include/cudnn*.h
```

cudnn-linux-x86_64-8.6.0.163_cuda11-archive.tar.xz 다운로드

tar -xvf cudnn-linux-x86_64-8.6.0.163_cuda11-archive.tar.xz
sudo cp include/cudnn*.h /usr/local/cuda-11.8/include/
sudo cp include/cudnn*.h /usr/local/cuda-11.8/include/
sudo chmod a+r /usr/local/cuda-11.8/include/cudnn*.h /usr/local/cuda-11.8/lib64/libcudnn*
sudo ldconfig

그러나, ldconfig에서 심볼릭 링크 에러가 발생. 아래 과정을 통해 각 라이브러리 파일에 대한 정확한 버전을 식별 및 버전에 맞는 심볼릭 링크를 생성.

cd /usr/local/cuda-11.8/lib64/
ls libcudnn*
##위에서, 8.6.0 이 나오는걸 확인 
sudo ln -sf libcudnn.so.8.6.0 libcudnn.so.8
sudo ln -sf libcudnn_cnn_train.so.8.6.0 libcudnn_cnn_train.so.8
sudo ln -sf libcudnn_adv_train.so.8.6.0 libcudnn_adv_train.so.8
sudo ln -sf libcudnn_cnn_infer.so.8.6.0 libcudnn_cnn_infer.so.8
sudo ln -sf libcudnn_adv_infer.so.8.6.0 libcudnn_adv_infer.so.8
sudo ln -sf libcudnn_ops_train.so.8.6.0 libcudnn_ops_train.so.8
sudo ln -sf libcudnn_ops_infer.so.8.6.0 libcudnn_ops_infer.so.8d

ls -l libcudnn* 명령어로 각 심볼릭 링크가 올바르게 지정된 걸 확인할 수 있다.

그리고 cudnn 버전도 8.6.0 으로 설치된 것 같다.

david@david-XH58:/usr/local/cuda-11.8/include$ grep CUDNN_MAJOR cudnn_version.h  -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 0
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */

이후에 tensorflow도 2.12.0 버전으로 다시 설치를 해야할 것 같다. → 테스트를 해보니 tensorflow-2.13.0 인데, GPU 인식이 되었다. 굳이 안해도 될 것 같다.

david@david-XH58:/usr/local/cuda-11.8/include$ python3
Python 3.8.10 (default, Nov 22 2023, 10:22:35) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import  tensorflow as tf
2024-04-14 14:33:14.778861: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-14 14:33:19.608301: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> print(tf.__version__)
2.13.1
>>> print("Available GPUs:", tf.config.list_physical_devices('GPU'))
2024-04-14 14:34:36.745739: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-14 14:34:38.549097: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-14 14:34:38.549865: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
Available GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

간단히 visual studio code에서 테스트를 진행해보았다. 잘 출력되는 것 같다. 하지만 메모리 사용량이.. 3555MiB / 4096MiB ⇒ 86.79%를 차지함. 이걸로 학습이 될까..?
다행히 학습은 잘 되었고, 속도도 나쁘지않았다. 그냥 코랩 말고 여기서 돌려도 될 듯하다.

Yougurt_Man

Greek Yogurt

이전 포스트

TF1 -> TF2 네트워크 호환 작업 ! 그리고 다음 단계로 !!

다음 포스트

딥러닝 환경 설치

프로젝트

TF1 -> TF2 네트워크 호환 작업 ! 그리고 다음 단계로 !!

네트워크

0개의 댓글