ceph HEALTH_WARN n daemons have recently crashed

김건호·2023년 1월 26일
0

0. 개요

0.1. 에러 상황

0.1.1 ceph osd pod 비정상적 재시작 횟수

ceph osd pod의 재시작 횟수가 비정상적으로 많은 상태를 확인하였습니다.

NAME                                                    READY   STATUS          RESTARTS
csi-cephfsplugin-fkqc5                                  3/3     Running         0
csi-cephfsplugin-h2r9z                                  3/3     Running         2
csi-cephfsplugin-provisioner-6c6bc95d4b-kjl8f           6/6     Running         16
csi-cephfsplugin-provisioner-6c6bc95d4b-rxk6d           6/6     Running         0
csi-cephfsplugin-qn4jp                                  3/3     Running         3
csi-rbdplugin-hknnk                                     3/3     Running         0
csi-rbdplugin-kk6m6                                     3/3     Running         3
csi-rbdplugin-provisioner-6ff4d774d4-6bwn5              6/6     Running         0
csi-rbdplugin-provisioner-6ff4d774d4-vttsx              6/6     Running         17
csi-rbdplugin-sq8ws                                     3/3     Running         1
rook-ceph-crashcollector-k8snode01ps-7bf76fb968-6r6j7   1/1     Running         0
rook-ceph-crashcollector-k8snode02ps-6564549d8f-vxxrj   1/1     Running         0
rook-ceph-crashcollector-k8snode03ps-7b5c88744-j97dh    1/1     Running         0
rook-ceph-mds-myfs-a-69cbd955cc-zg6mc                   1/1     Running         0
rook-ceph-mds-myfs-b-67b559dd77-56rlq                   1/1     Running         0
rook-ceph-mgr-a-5b98749d8-sx8hl                         1/1     Running         0
rook-ceph-mon-a-54dc9758cc-b74gc                        1/1     Running         1
rook-ceph-operator-6df54ddc6b-9zgwf                     1/1     Running         0
rook-ceph-osd-0-7cbb8c7b86-x27n5                        1/1     Running         0
rook-ceph-osd-1-6976f8fb4b-s8v8b                        1/1     Running         1463
rook-ceph-osd-2-74cf858d5c-94jtp                        1/1     Running         0
rook-ceph-osd-3-566dbbb44d-tk6xq                        1/1     Running         3172
rook-ceph-osd-prepare-devshic01ps-xpq5d                 0/1     Completed       0
rook-ceph-osd-prepare-devshic02ps-6n95c                 0/1     Completed       0
rook-ceph-tools-676879fd44-slsrn                        1/1     Running         0
rook-discover-2xbf7                                     1/1     Running         0
rook-discover-9hxqw                                     1/1     Running         1
rook-discover-ldvks                                     1/1     Running         0

0.1.2. ceph stauts 확인

kubectl exec -it -n rook-ceph-tools-676879fd44-slsrn -- /bin/bash

HEALTH_WARN을 확인하였습니다.

ceph -s
  cluster:
    id:     a7
    health: HEALTH_WARN
            **1 daemons have recently crashed**

  services:
    mon: 1 daemons, quorum a (age 4d)
    mgr: a(active, since 2w)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 4 osds: 4 up (since 4d), 4 in (since 4d)

  task status:
    scrub status:
        mds.myfs-a: idle
        mds.myfs-b: idle

  data:
    pools:   4 pools, 97 pgs
    objects: 7.93k objects, 19 GiB
    usage:   42 GiB used, 3.9 TiB / 3.9 TiB avail
    pgs:     97 active+clean

  io:
    client:   16 KiB/s rd, 117 KiB/s wr, 3 op/s rd, 3 op/s wr

1. 에러 해결 방법

ceph의 crash 모듈이 daemon crashdumps 정보를 수집하고 ceph cluster에 저장합니다.

1.1. crashed daemons 조회

# ceph crash ls
ID                                                                ENTITY  NEW
2023-01-02T00:31:54.551063Z_bfde5f0b-3510-4c38-831f-e43833fc3f79  mon.a    *
2023-01-20T04:20:42.360745Z_8e8f9804-2d7a-45f6-859f-cdd6bf4c7e28  mon.a    *

1.2. archive 추가

ceph crash archive 2023-01-20T04:20:42.360745Z_8e8f9804-2d7a-45f6-859f-cdd6bf4c7e28

1.3. new가 해제된 걸 확인

# ceph crash ls
ID                                                                ENTITY  NEW
2023-01-02T00:31:54.551063Z_bfde5f0b-3510-4c38-831f-e43833fc3f79  mon.a    *
2023-01-20T04:20:42.360745Z_8e8f9804-2d7a-45f6-859f-cdd6bf4c7e28  mon.a

1.4. ceph health check

# ceph health
HEALTH_OK

---

# ceph -s
  cluster:
    id:     a7
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum a (age 4d)
    mgr: a(active, since 2w)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 4 osds: 4 up (since 4d), 4 in (since 4d)

  task status:
    scrub status:
        mds.myfs-a: idle
        mds.myfs-b: idle

  data:
    pools:   4 pools, 97 pgs
    objects: 7.93k objects, 19 GiB
    usage:   42 GiB used, 3.9 TiB / 3.9 TiB avail
    pgs:     97 active+clean

  io:
    client:   17 KiB/s rd, 137 KiB/s wr, 3 op/s rd, 8 op/s wr

참고자료

yumserv님의 블로그/ceph 1 daemons have recently crashed
mb00g님의 github/1 daemons have recently crashed proxmox.md
docs.ceph.mgr/crash

profile
Ken, 🔽🔽 거노밥 유튜브(house icon) 🔽🔽

0개의 댓글