[K8S] Pod 장애 진단 방법

bradley·2023년 2월 4일
1

Trouble Shooting

목록 보기
8/12

현상


아래와 같이 반복적으로 Init:~ 현상이 지속되거나, Error를 나타내는 문구를 나타내면 Init Container 기동에 문제가 발생한 것이다.


원인 분석


Kubernetes는 Pod라는 배포 최소단위를 쓰고, 이 Pod 안에는 1개 또는 그 이상의 Container가 포함된다.
이 Container는 아래와 같은 유형으로 나눠볼 수 있다.

  • Init Container : 기동 시점에 처리하고 종료하는 Container
  • Runtime Container : 실제 업무를 처리하는 Container
  • SideCar Container : 보조 역할하는 Container

Init Container는 Pod의 Runtime Container가 실행되기 전에 실행되는 초기화 컨테이너이다.
Pod가 Init:1/2와 같은 Status라면 2개의 Init Container 중 1개가 성공적으로 완료되었음을 나타낸다.
Init Container가 실패한다면 k8s는 기본적으로 Init Container가 성공할 때까지 Pod를 반복적으로 재시작한다.
즉, Init: ~ 상태가 지속되고 에러를 나타내는 Status를 나타낸다면 해당 Pod의 어떤 Container에서 문제가 발생했는지 조사해볼 필요가 있다.

Pod Log 조회

우선 Error Pod의 Log를 살펴보자.
Error 부분에 scheduler Container가 대기중이고, PodInitalizing 상태라고 나와있다. 앞단에서 Error가 나서 대기중 인 것으로 보인다.
다만 Pod Log 조회만으로는 한계가 있다.

kubectl logs <pod_id>

Pod 상세내역 조회

Pod를 좀 더 자세히 살펴보자.

kubectl describe pod <pod_id>

아래와 같은 상세내역이 출력된다.

Name:             airflow-scheduler-5c9d5d7d69-qssbr
Namespace:        airflow
Priority:         0
Service Account:  airflow-scheduler
Node:             airflow-cluster-worker3/172.18.0.4
Start Time:       Fri, 03 Feb 2023 00:45:35 +0900
Labels:           component=scheduler
                  pod-template-hash=5c9d5d7d69
                  release=airflow
                  tier=airflow
Annotations:      checksum/airflow-config: 7c087ba34ba46da1bc27e008d659d87d9afe6d39dccd0b7ddcf7287caa66e105
                  checksum/extra-configmaps: 2e44e493035e2f6a255d08f8104087ff10d30aef6f63176f1b18f75f73295598
                  checksum/extra-secrets: bb91ef06ddc31c0c5a29973832163d8b0b597812a793ef911d33b622bc9d1655
                  checksum/metadata-secret: dcbb26b06a9d686bf5fedceff6d4024447053fded58a37271cdfef14f8c8c800
                  checksum/pgbouncer-config-secret: da52bd1edfe820f0ddfacdebb20a4cc6407d296ee45bcb500a6407e2261a5ba2
                  checksum/result-backend-secret: 74e3e99feee51248d44224665d60fab543dd6b25ba95f04e6fcb0e5758342056
                  cluster-autoscaler.kubernetes.io/safe-to-evict: true
Status:           Pending
IP:               10.244.1.4
IPs:
  IP:           10.244.1.4
Controlled By:  ReplicaSet/airflow-scheduler-5c9d5d7d69
Init Containers:
  wait-for-airflow-migrations:
    Container ID:  containerd://55cbaa8d8a05cf937488701aa144959c0997f2a7ae0983a003cb4580e431f612
    Image:         airflow-custom:1.0.0
    Image ID:      docker.io/library/import-2023-02-02@sha256:f3854eb3d766f2b7814942d41403e064d9b61674a76ea7f8945a8b42c77c1308
    Port:          <none>
    Host Port:     <none>
    Args:
      airflow
      db
      check-migrations
      --migration-wait-timeout=60
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 03 Feb 2023 14:18:50 +0900
      Finished:     Fri, 03 Feb 2023 14:19:49 +0900
    Ready:          True
    Restart Count:  1
    Environment Variables from:
      airflow-variables  ConfigMap  Optional: false
    Environment:
      AIRFLOW__CORE__FERNET_KEY:            <set to the key 'fernet-key' in secret 'airflow-fernet-key'>                      Optional: false
      AIRFLOW__CORE__SQL_ALCHEMY_CONN:      <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN:  <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW_CONN_AIRFLOW_DB:              <set to the key 'connection' in secret 'airflow-airflow-metadata'>                Optional: false
      AIRFLOW__WEBSERVER__SECRET_KEY:       <set to the key 'webserver-secret-key' in secret 'airflow-webserver-secret-key'>  Optional: false
      AIRFLOW__CELERY__BROKER_URL:          <set to the key 'connection' in secret 'airflow-broker-url'>                      Optional: false
    Mounts:
      /opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j4w45 (ro)
  git-sync-init:
    Container ID:   containerd://13a385b88ec5c5549f6f057c53e2758c9894606a7ab455457ed9c4c0a51a3683
    Image:          k8s.gcr.io/git-sync/git-sync:v3.4.0
    Image ID:       k8s.gcr.io/git-sync/git-sync@sha256:a470676e946f1060815f89dadad4f2c3e4f9d1ab36a46f4423e00f44170fc80c
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 03 Feb 2023 14:41:00 +0900
      Finished:     Fri, 03 Feb 2023 14:41:00 +0900
    Ready:          False
    Restart Count:  9
    Environment:
      GIT_SSH_KEY_FILE:            /etc/git-secret/ssh
      GIT_SYNC_SSH:                true
      GIT_KNOWN_HOSTS:             false
      GIT_SYNC_REV:                HEAD
      GIT_SYNC_BRANCH:             main
      GIT_SYNC_REPO:               ssh://git@github.com:jeongseok912/airflow_dags.git
      GIT_SYNC_DEPTH:              1
      GIT_SYNC_ROOT:               /git
      GIT_SYNC_DEST:               repo
      GIT_SYNC_ADD_USER:           true
      GIT_SYNC_WAIT:               60
      GIT_SYNC_MAX_SYNC_FAILURES:  0
      GIT_SYNC_ONE_TIME:           true
    Mounts:
      /etc/git-secret/ssh from git-sync-ssh-key (ro,path="gitSshKey")
      /git from dags (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j4w45 (ro)
Containers:
  scheduler:
    Container ID:
    Image:         airflow-custom:1.0.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      bash
      -c
      exec airflow scheduler
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Liveness:       exec [sh -c CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --job-type SchedulerJob --hostname $(hostname)
] delay=10s timeout=20s period=60s #success=1 #failure=5
    Environment Variables from:
      airflow-variables  ConfigMap  Optional: false
    Environment:
      AIRFLOW__CORE__FERNET_KEY:            <set to the key 'fernet-key' in secret 'airflow-fernet-key'>
  Optional: false
      AIRFLOW__CORE__SQL_ALCHEMY_CONN:      <set to the key 'connection' in secret 'airflow-airflow-metadata'>
  Optional: false
      AIRFLOW__DATABASE__SQL_ALCHEMY_CONN:  <set to the key 'connection' in secret 'airflow-airflow-metadata'>
  Optional: false
      AIRFLOW_CONN_AIRFLOW_DB:              <set to the key 'connection' in secret 'airflow-airflow-metadata'>
  Optional: false
      AIRFLOW__WEBSERVER__SECRET_KEY:       <set to the key 'webserver-secret-key' in secret 'airflow-webserver-secret-key'>  Optional: false
      AIRFLOW__CELERY__BROKER_URL:          <set to the key 'connection' in secret 'airflow-broker-url'>
  Optional: false
    Mounts:
      /opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
      /opt/airflow/config/airflow_local_settings.py from config (ro,path="airflow_local_settings.py")
      /opt/airflow/dags from dags (ro)
      /opt/airflow/logs from logs (rw)
      /opt/airflow/pod_templates/pod_template_file.yaml from config (ro,path="pod_template_file.yaml")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j4w45 (ro)
  git-sync:
    Container ID:
    Image:          k8s.gcr.io/git-sync/git-sync:v3.4.0
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      GIT_SSH_KEY_FILE:            /etc/git-secret/ssh
      GIT_SYNC_SSH:                true
      GIT_KNOWN_HOSTS:             false
      GIT_SYNC_REV:                HEAD
      GIT_SYNC_BRANCH:             main
      GIT_SYNC_REPO:               ssh://git@github.com:jeongseok912/airflow_dags.git
      GIT_SYNC_DEPTH:              1
      GIT_SYNC_ROOT:               /git
      GIT_SYNC_DEST:               repo
      GIT_SYNC_ADD_USER:           true
      GIT_SYNC_WAIT:               60
      GIT_SYNC_MAX_SYNC_FAILURES:  0
    Mounts:
      /etc/git-secret/ssh from git-sync-ssh-key (ro,path="gitSshKey")
      /git from dags (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j4w45 (ro)
  scheduler-log-groomer:
    Container ID:
    Image:         airflow-custom:1.0.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      bash
      /clean-logs
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      AIRFLOW__LOG_RETENTION_DAYS:  15
    Mounts:
      /opt/airflow/logs from logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j4w45 (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      airflow-airflow-config
    Optional:  false
  dags:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  git-sync-ssh-key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  airflow-ssh-git-secret
    Optional:    false
  logs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-j4w45:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                  From     Message
  ----     ------          ----                 ----     -------
  Normal   Pulled          13h                  kubelet  Successfully pulled image "k8s.gcr.io/git-sync/git-sync:v3.4.0" in 1m4.951487006s
  Normal   Started         13h (x4 over 13h)    kubelet  Started container git-sync-init
  Normal   Created         13h (x5 over 13h)    kubelet  Created container git-sync-init
  Normal   Pulled          13h (x4 over 13h)    kubelet  Container image "k8s.gcr.io/git-sync/git-sync:v3.4.0" already present on machine
  Warning  BackOff         12h (x271 over 13h)  kubelet  Back-off restarting failed container
  Normal   Pulled          26m                  kubelet  Container image "airflow-custom:1.0.0" already present on machine
  Normal   SandboxChanged  26m                  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Created         26m                  kubelet  Created container wait-for-airflow-migrations
  Normal   Started         26m                  kubelet  Started container wait-for-airflow-migrations
  Normal   Started         24m (x4 over 25m)    kubelet  Started container git-sync-init
  Normal   Pulled          23m (x5 over 25m)    kubelet  Container image "k8s.gcr.io/git-sync/git-sync:v3.4.0" already present on machine
  Normal   Created         23m (x5 over 25m)    kubelet  Created container git-sync-init
  Warning  BackOff         68s (x111 over 25m)  kubelet  Back-off restarting failed container

Init Containers라는 항목과 Containers라는 항목이 보이고, 각각 아래와 같은 Container들이 보인다.

  • Init Containers : wait-for-airflow-migrations, git-sync-init
  • Containers : sceduler, git-sync, scheduler-log-groomer

각 Container의 State, Reason을 보면 git-sync-init Container에 Waiting/CrashLoopBackOff가 떴고, Ready: False, Restart Count: 9인 것을 볼 수 있다.
그래서 sceduler Container에 Waiting/PodInitializing 상태이다.

Pod의 Container Log 조회

Error Pod의 Error Container의 Log를 조회하면 Error 원인을 파악할 수 있다.

kubectl logs <pod_id> -c <container_id>



해결


GitHub에서 Repo를 SSH를 이용해 Clone 해 본 결과 SSH Key 등록도 정상적으로 된 상태이다.

뭔가 미심쩍은 부분이 예상됐다.
SSH로 Clone 시 ssh://를 안붙이는데 가이드와 주석대로 ssh:// 붙인 형식이 혹시?
역시나 였다.
ssh:// 제거 후 재배포 해보니 Pod Status도 Running으로 정상화되고, GitHub 상에 있는 DAG도 정상적으로 Sync 되었다.



참고


Pod Status

https://kubernetes.io/ko/docs/tasks/debug/debug-application/debug-init-containers/

profile
데이터 엔지니어링에 관심이 많은 홀로 삽질하는 느림보

2개의 댓글

comment-user-thumbnail
2023년 4월 12일

감사합니다 도움을 받았습니다. 원인이 맞는지는 모르겠지만 master를 쓰는 repo에 연결했을 때는 "ssh://"를 붙였었는데, main으로 새로 repo를 만들고 세팅하면서 문제를 겪었습니다. "ssh://"를 붙여도 됐었어서 더 헷갈렸네요..;

1개의 답글