ubuntu에 kubeflow 환경 구축

Seungsoo Lee·2023년 7월 20일

ML kubeflow tensorflow extend tsx ubuntu

data science

목록 보기

4/4

참고로 이 설치 과정은 Ubuntu 환경에서 진행됩니다.

1. 도커 설치하기

설치를 시작하기 전에 Docker 공식 문서의 (여기)를 확인해보세요.

- apt repository 설정하기

먼저, apt 패키지 인덱스를 업데이트하고 HTTPS로 리포지토리를 사용할 수 있도록 패키지들을 설치해야 합니다.

sudo apt-get update
sudo apt-get install ca-certificates curl gnupg

다음으로, Docker의 공식 GPG 키를 추가합니다.

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

마지막으로, 아래의 명령어를 이용하여 리포지토리를 설정합니다.

echo \
  "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

- Docker Engine 설치

먼저, apt 패키지 인덱스를 업데이트합니다.

sudo apt-get update

그 후, Docker Engine, containerd, 그리고 Docker Compose를 설치합니다.

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

docker를 daemon으로 실행시키기위한 daemon.json 파일 쓰기.

sudo vi /etc/docker/daemon.json
# daemon.json 파일 쓸 내용
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}
# 저장 (:wq)

mkdir -p /etc/systemd/system/docker.service.d
systemctl daemon-reload
systemctl restart docker

2. nvidia-docker 설치

설치를 시작하기 전에 nvidia-docker 공식 문서의 (여기)를 확인해보세요.

- apt repository 설정 및 GPG키 추가

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

- nvidia-container-toolkit 패키지 설치하기

먼저, apt 패키지 인덱스를 업데이트합니다.

sudo apt-get update

그 후, nvidia-container-toolkit 패키지를 설치합니다.

sudo apt-get install -y nvidia-container-toolkit

- Docker 데몬이 NVIDIA 컨테이너 런타임을 인식하도록 설정하기

sudo nvidia-ctk runtime configure --runtime=docker

- 기본 런타임 설정 후 Docker 데몬 재시작하기

sudo systemctl restart docker

이 시점에서, 기본 CUDA 컨테이너를 실행하여 설정이 제대로 작동하는지 테스트할 수 있습니다:

sudo docker run --runtime=nvidia --rm nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

3. 쿠버네티스 설치

설치를 시작하기 전에 쿠버네티스 공식 문서의 (여기)를 확인해보세요.

- apt repository 설정하기

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/

먼저, apt 패키지 인덱스를 업데이트합니다.

sudo apt-get update
# 앞에서 설치 했을 것 이다.
# sudo apt-get install -y apt-transport-https ca-certificates curl

Google Cloud에서 GPG 키를 추가합니다.

curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-archive-keyring.gpg

마지막으로, 아래의 명령어를 이용하여 리포지토리를 설정합니다.

echo "deb [signed-by=/etc/apt/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list

- kubelet kubeadm kubectl 패키지 설치하기

sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

- 브리지된 네트워크 트래픽이 iptables 방화벽 규칙에 의해 처리되도록 설정

sudo sysctl net.bridge.bridge-nf-call-iptables=1

- kubeadm 으로 cluster 생성

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/

kubeadm init --pod-network-cidr=172.16.0.0/16

저는 이 부분에서 에러가 났습니다

root@gpu:/etc/docker# kubeadm init
[init] Using Kubernetes version: v1.27.3
[preflight] Running pre-flight checks
	[WARNING Swap]: swap is enabled; production deployments should disable swap unless testing the NodeSwap feature gate of the kubelet
error execution phase preflight: [preflight] Some fatal errors occurred:
	[ERROR CRI]: container runtime is not running: output: time="2023-07-19T12:09:44Z" level=fatal msg="validate service connection: CRI v1 runtime API is not implemented for endpoint \"unix:///var/run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

만약 여러분도 에러가 나면 다음을 따라하세요 (https://forum.linuxfoundation.org/discussion/862825/kubeadm-init-error-cri-v1-runtime-api-is-not-implemented)
root에서 실행해야합니다.

echo \
  "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

apt remove containerd

apt update
apt install containerd.io

rm /etc/containerd/config.toml

systemctl restart containerd

이러면 거의 해결인듯 했지만 또

[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.

Unfortunately, an error has occurred:
	timed out waiting for the condition

This error is likely caused by:
	- The kubelet is not running
	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
	- 'systemctl status kubelet'
	- 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
	- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
	Once you have found the failing container, you can inspect its logs with:
	- 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher

다른 에러가 뜨길래 (https://stackoverflow.com/a/52196985) 이 링크를 참고해서

sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
kubeadm reset
kubeadm init

이걸 하고 나니 해결되었습니다.

아무튼 계속 하도록 하겠습니다.

root 유저가 아닌 일반 유저이면 이 명령어를 실행합니다(kubeadm init output에도 있습니다)

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

root 유저이면 다음 명령어를 실행합니다.

export KUBECONFIG=/etc/kubernetes/admin.conf

master node 의 scheduling

kubectl taint nodes --all node-role.kubernetes.io/master-

이 명령어를 쓰다가 또 에러가 발생하였습니다.
The connection to the server 192.168.45.26:6443 was refused - did you specify the right host or port? 뭐 명령어 쓸때마다 이런 에러가 뜨는것이다. endpoint를 맞춰주지 않아서 그런거 같아서 다음 파일을 생성해서 저장하였다.

sudo vi /etc/crictl.yaml
# 다음 입력 후 저장
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 10
debug: true
# wq

이렇게 하니까 에러가 해결되었다

kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/tigera-operator.yaml

Seungsoo Lee

이전 포스트

M1(ARM64) Tensorflow 설치중 에러 Node: 'StatefulPartitionedCall_212' could not find registered platform with id: 0x10879c220 [[{{node StatefulPartitionedCall_212}}]] [Op:__inference_train_function_23355]

ubuntu에 kubeflow 환경 구축

data science

1. 도커 설치하기

- apt repository 설정하기

- Docker Engine 설치

2. nvidia-docker 설치

- apt repository 설정 및 GPG키 추가

- nvidia-container-toolkit 패키지 설치하기

- Docker 데몬이 NVIDIA 컨테이너 런타임을 인식하도록 설정하기

- 기본 런타임 설정 후 Docker 데몬 재시작하기

3. 쿠버네티스 설치

- apt repository 설정하기

- kubelet kubeadm kubectl 패키지 설치하기

- 브리지된 네트워크 트래픽이 iptables 방화벽 규칙에 의해 처리되도록 설정

- kubeadm 으로 cluster 생성

M1(ARM64) Tensorflow 설치중 에러 Node: 'StatefulPartitionedCall_212' could not find registered platform with id: 0x10879c220 [[{{node StatefulPartitionedCall_212}}]] [Op:__inference_train_function_23355]

0개의 댓글