프로메테우스 알람

프로메테우스 가이드북

Jun 10, 2024

Contents

개요 구조 Alert 설정하기 AlertManager 설정하기 예제 - Alert Prometheus Record Rule 작성하기 Proemtheus Alert 설정하기 예제 - AlertManager Prometheus Rule 설정하기 Prometheus AlertManager 설정하기 Prometheus AlertManager Template 설정하기

프로메테우스 가이드북은 A to Z Metnros (Udemy) — Prometheus | The Complete Hands-On for Monitoring & Alerting를 듣고 작성한 가이드북입니다.

가이드북의 전체 목차 및 인덱싱은 프로메테우스 가이드북 — 소개 페이지를 참고해주세요.

개요

Prometheus allows you to define some conditions/logics in the form of PromQL expressions that continuously get evaluated and when those conditions are met, they become alerts.
e.g.
Free node memory should not be less than 10%
CPU load not more than 95%
Alreting rules are written YAML format

구조

Prometheus
AlertManager
Pagerduty, Email, Slack

[호출 관계]

Prometheus → AlertManager : Alerts fired
AlertManager → PagerDuty, Email, Slack : Notifications

Alert 설정하기

Prometheus Record Rule이 기록된 YAML 파일에 Alert를 설정할 수 있습니다.

AlertManager 설정하기

AlertManager는 모든 Prometheus Server에서 발신 알림 경고를 받아서 알림으로 변환하는 도구입니다.

알림을 분리, 그룹화할 수 있습니다.
PagerDuty, OpsGenies와 같은 올바른 수신자를 통합으로 라우팅하는 작업을 처리합니다.

“AlertManager is a tool that takes the firing alerts alerts from all Prometheus servers and converts them into notifications. It takes care of decouplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenies.”

이를 Prometheus (Docs) | Installation에서 AlertManager 탭에 있는 리소스를 생성할 수 있습니다.

예제 - Alert

Prometheus Record Rule 작성하기
Proemtheus Alert 설정하기

Prometheus Record Rule 작성하기

아래와 같이 Prometheus Record Rule을 작성합시다.

groups:
  - name: my-rules
    rules:
    - record: job:node_cpu_seconds:avg_idle
      expr: avg without(cpu)(rate(node_cpu_seconds_total{mode="idle"}[5m]))

Proemtheus Alert 설정하기

아래와 같이 Prometheus Alert 설정하기

# ...

    - alert: NodeExporterDown
      expr: up{job="node_exporter"} == 0

특정 기간 동안 작동하는 Alert 설정하기

# ...

    - alert: NodeExporterDown
      expr: up{job="node_exporter"} == 0
      for: 1m

이후에 몇가지 옵션을 추가해서 Record Rule & Alert를 강화할 수 있습니다.

groups:
  - name: my-rules
    rules:
    - record: job:node_cpu_seconds:avg_idle
      expr: avg without(cpu)(rate(node_cpu_seconds_total{mode="idle"}[5m]))

    - alert: NodeExporterDown
      expr: up{job="node_exporter"} == 0
      for: 1m

    - record: job:app_response_latency_seconds:rate1m
      expr: rate(app_response_latency_seconds_sum[1m]) / rate(app_response_latency_seconds_count[1m])

    - alert: AppLatencyAbove5sec
      expr: job:app_response_latency_seconds:rate1m >= 5
      for: 2m
      labels:
        severity: critical

    - alert: AppLatencyAbove2sec
      expr: 2 < job:app_response_latency_seconds:rate1m < 5
      for: 2m
      labels:
        severity: warning

예제 - AlertManager

예졔 - Alert를 이어받아서 진행해야 합니다.

Prometheus Rule 설정하기

groups:
  - name: my-rules
    rules:
    - record: job:node_cpu_seconds:avg_idle
      expr: avg without(cpu)(rate(node_cpu_seconds_total{mode="idle"}[5m]))

    - alert: NodeExporterDown
      expr: up{job="node_exporter"} == 0
      for: 1m

    - record: job:app_response_latency_seconds:rate1m
      expr: rate(app_response_latency_seconds_sum[1m]) / rate(app_response_latency_seconds_count[1m])

    - alert: AppLatencyAbove5sec
      expr: job:app_response_latency_seconds:rate1m >= 5
      for: 2m
      labels:
        severity: critical

    - alert: AppLatencyAbove2sec
      expr: 2 < job:app_response_latency_seconds:rate1m < 5
      for: 2m
      labels:
        severity: warning

Prometheus AlertManager 설정하기

route:
   receiver: admin

receivers:
- name: admin
  email_configs:
  - to: 'unchaptered@gmail.com'
    from: 'example@gmail.com'
    smarthost: step.gmail.com:587

    auth_username: 'example@example.com'
    auth_identity: 'example@example.com'
    auth_password: '<PASSOWRD>'

Prometheus AlertManager Template 설정하기

위에서 작성한 Prometheus Record Rule에서 AppLatencyAboveSsec을 변경해주세요.

# ...

    - alert: AppLatencyAbove5sec
      expr: job:app_response_latency_seconds:rate1m >= 5
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: 'Python app latency is over 5 seconds'
        description: 'app latency {{ $labels.instance }} of job {{ $label.job }} is {{ $value }} for more tha
        app_link: 'http://localhost:8000'
# ...