[Prometheus] AlertManager + Slack Hook 설정

91Savage·2022년 12월 19일

Grafana alertmanager monitoring prometheus slack

Server

목록 보기

24/24

Slack Incoming Webhook 설정

Slack 앱관리 - > webhook 검색 -> 수신 웹후크 선택 Slack에 추가 선택
-> 채널 선택 -> 설정 저장

Alertmanager.yml 파일 수정

vim /alertmanager/config

global:
  slack_api_url: "https://hooks.slack.com/services/~~~~~~~~"
  // Slack Web Hook URL 넣기 

route:
  receiver: 'Slack id'
  group_by: ['alertTest']
  group_wait: 30s
  group_interval: 5s
  repeat_interval: 3h
  routes:
  - receiver: 'SLack id'
    group_wait: 10s
    match_re:
      service: dev

receivers:
- name: 'Slack id'
  slack_configs:
  - channel: "alert_test" // slack Channel ID or Slack ID
    username: "Gouter De Roi"
    title: 'Emergency Emergency!!'  // Title 제목
    text: "summary: {{ .CommonAnnotations.summary }}\ndescription: {{ .CommonAnnotations.description }}"  // 내용 Text
templates:
- './slack.tmpl'  // slack Hook Template

slack.tmpl 파일 수정

vim slack.tmpl

{{ define "__single_message_title" }}{{ range .Alerts.Firing }}{{ .Labels.alertname }} @ {{ .Annotations.identifier }}{{ end }}{{ range .Alerts.Resolved }}{{ .Labels.alertname }} @ {{ .Annotations.identifier }}{{ end }}{{ end }}
{{ define "custom_title" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ if or (and (eq (len .Alerts.Firing) 1) (eq (len .Alerts.Resolved) 0)) (and (eq (len .Alerts.Firing) 0) (eq (len .Alerts.Resolved) 1)) }}{{ template "__single_message_title" . }}{{ end }}{{ end }}
{{ define "custom_slack_message" }}
{{ if or (and (eq (len .Alerts.Firing) 1) (eq (len .Alerts.Resolved) 0)) (and (eq (len .Alerts.Firing) 0) (eq (len .Alerts.Resolved) 1)) }}
{{ range .Alerts.Firing }}{{ .Annotations.description }}{{ end }}{{ range .Alerts.Resolved }}{{ .Annotations.description }}{{ end }}
{{ else }}
{{ if gt (len .Alerts.Firing) 0 }}
*Alerts Firing:*
{{ range .Alerts.Firing }}- {{ .Annotations.identifier }}: {{ .Annotations.description }}
{{ end }}{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
*Alerts Resolved:*
{{ range .Alerts.Resolved }}- {{ .Annotations.identifier }}: {{ .Annotations.description }}
{{ end }}{{ end }}
{{ end }}
{{ end }}

alert_rules.yml 파일 수정

vim alert_rules.yml

groups:
- name: alert.rules
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: "critical"
    annotations:
      summary: "Endpoint {{ $labels.instance }}"
      identifier: "{{ $labels.instance }}"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

  - alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Host out of memory (instance {{ $labels.instance }})"
      description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostMemoryUnderMemoryPressure
    expr: rate(node_vmstat_pgmajfault[1m]) > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Host memory under memory pressure (instance {{ $labels.instance }})"
      description: "The node is under heavy memory pressure. High rate of major page faults\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  # Please add ignored mountpoints in node_exporter parameters like
  # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
  # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
  - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Host out of disk space (instance {{ $labels.instance }})"
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostHighCpuLoad
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Host high CPU load (instance {{ $labels.instance }})"
      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

prometheus.yml 파일 수정

vim prometheus.yml

global:
  scrape_interval: 60s
  evaluation_interval: 60s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093


rule_files:
  - "alert_rules.yml"
...................
...................
................... 밑에 추가

5가지 항목 Alarmanager 체크

1.InstanceDown
2.HostOutOfMemory
3.HostMemoryUnderMemoryPressure
4.HostOutOfDiskSpace
5.HostHighCpuLoad

prometheus, Alertmanager 재시작

docker restart alertmanager
docker restart prometheus

Web에서 확인
http://"IP":9090 -> Alert 메뉴

1.InstanceDown
2.HostOutOfMemory
3.HostMemoryUnderMemoryPressure
4.HostOutOfDiskSpace
5.HostHighCpuLoad

Node_exporter 테스트

systemctl stop node_exporter
curl localhost:9100/metrics 메트릭 체크

Slack 채널에서 InstanceDown 알람 오는지 체크하기

91Savage

이전 포스트

[Prometheus] AlertManager + Slack Hook 설정

Server

Slack Incoming Webhook 설정

Alertmanager.yml 파일 수정

slack.tmpl 파일 수정

alert_rules.yml 파일 수정

prometheus.yml 파일 수정

Node_exporter 테스트

[ELK 스택] filebeat 설정 (deb 환경)

0개의 댓글