프로메테우스 알람 사례 : Routing Tree for Alerts

프로메테우스 가이드북
이민석's avatar
Jun 11, 2024
프로메테우스 알람 사례 : Routing Tree for Alerts

프로메테우스 가이드북은 A to Z Metnros (Udemy) — Prometheus | The Complete Hands-On for Monitoring & Alerting를 듣고 작성한 가이드북입니다.

가이드북의 전체 목차 및 인덱싱은 프로메테우스 가이드북 — 소개 페이지를 참고해주세요.

개요

프로메테우스 알람에서 특정한 상황에서 이메일을 발송하는 구성을 살펴보았습니다.
하지만 알람이 N~NNN개 이상 많아지면 관리자(담당자)는 엄청난 양의 메일을 받고 이를 처리해야 할 것입니다. 따라서 이를 다양한 대상으로 라우팅하는 기법이 필요합니다.

Metric의 구분

Prometheus Metric은 크게 2종류로 구분될 수 있을 것입니다.

  • Star Solution (e.g. Linux, Windows)

  • PEC Techologies (e.g. Python, Golang)

AlertManager의 구분

앞서 Metric을 구분한 것에 세부 개발팀으로 이를 라우팅할 수 있을 것입니다.

  • Star Solution (e.g. Linux, Windows)

    • Linux Team (e.g. Linux OS)

      • Linux Manager (e.g. Critical Metric)

      • Linux Team Lead (e.g. Waring Metric)

    • Windows Team (e.g. Windows OS)

      • Windows Manager (e.g. Critical Metric)

      • Windows Team Lead (e.g. Warning Metric)

  • PEC Technologies (e.g. Python, Golang)

    • Python Team (e.g. Python OS)

      • Python Manager (e.g Critical Metric)

      • Python Team Lead (e.g Warning Metric)

    • Golang Team (e.g. Golang OS)

      • Golang Manager (e.g. Critical Metric)

      • Python Team Lead (e.g Warning Metric)

Recording Rule : web-rules

  • 파일명 : ./rules/web-rules.yaml

  • 파일 예제

groups:
  - name: python-app-rules
    rules:

    # Python Application Alerts

    - record: job:app_request_latency_seconds:rate1m
      expr: rate(app_response_latency_seconds_sum[1m]) / rate(app_response_latency_seconds_count[1m])

    - alert: AppLatencyAbove2sec
      expr: 2 < job:app_request_latency_seconds:rate1m < 5
      for: 2m
      labels:
        severity: warning
        app_type: python
      annotations:
        summary: 'Python app latency is going high'
        description: 'App latency of instance {{ $labels.instance }} of job {{ $labels.job }} is {{ $value }} for more than 5 minutes.'
        app_link: 'http://localhost:8000/'

    - alert: AppLatencyAbove5sec
      expr: job:app_request_latency_seconds:rate1m >= 5
      for: 2m
      labels:
        severity: critical
        app_type: python
      annotations:
        summary: 'Python app latency is over 5 seconds.'
        description: 'App latency of instance {{ $labels.instance }} of job {{ $labels.job }} is {{ $value }} for more than 5 minutes.'
        app_link: 'http://localhost:8000/'


  - name: go-app-rules
    rules:

    # Go Application Alerts

    - record: job:go_app_request_latency_seconds:rate1m
      expr: rate(go_app_response_latency_seconds[1m]) / rate(go_app_response_latency_seconds[1m])

    - alert: GoAppLatencyAbove2sec
      expr: 2 < job:go_app_request_latency_seconds:rate1m < 5
      for: 2m
      labels:
        severity: warning
        app_type: go
      annotations:
        summary: 'Go app latency is unusual'
        description: 'App latency of instance {{ $labels.instance }} of job {{ $labels.job }} is {{ $value }} for more than 5 minutes.'
        app_link: 'http://localhost:8000/'

    - alert: GoAppLatencyAbove5sec
      expr: job:go_app_request_latency_seconds:rate1m >= 5
      for: 2m
      labels:
        severity: critical
        app_type: go
      annotations:
        summary: 'Go app latency is over 5 seconds.'
        description: 'App latency of instance {{ $labels.instance }} of job {{ $labels.job }} is {{ $value }} for more than 5 minutes.'
        app_link: 'http://localhost:8000/'

Recording Rule : linux-rules

  • 파일명 : ./rules/linux-rules.yaml

  • 파일 예제

# stress tool to increase CPU usage
# stress -c 1 -v -timeout 100s

# stress-ng to increase memory usage
# stress-ng --vm-bytes $(awk '/MemFree/{printf "%d\n", $2 * 0.9;}' < /proc/meminfo)k --vm-keep -m 1

# combined 
# stress-ng -c 1 -v --vm-bytes $(awk '/MemAvailable/{printf "%d\n", $2 * 0.85;}' < /proc/meminfo)k --vm-keep -m 1 --timeout 300s

# to increase disk usage space
# fallocate -l 30G file

groups:
  - name: linux-rules
    rules:

    - alert: NodeExporterDown
      expr: up{job="node_exporter"} == 0
      for: 2m
      labels:
        severity: critical
        app_type: linux
        category: server
      annotations:
        summary: "Node Exporter is down"
        description: "Node Exporter is down for more than 2 minutes"

    - record: job:node_memory_Mem_bytes:available
      expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
      
    - alert: NodeMemoryUsageAbove60%
      expr: 60 < (100 - job:node_memory_Mem_bytes:available) < 75
      for: 2m
      labels:
        severity: warning
        app_type: linux
        category: memory
      annotations:
        summary: "Node memory usage is going high"
        description: "Node memory for instance {{ $labels.instance }} has reached {{ $value }}%"

    - alert: NodeMemoryUsageAbove75%
      expr: (100 - job:node_memory_Mem_bytes:available) >= 75
      for: 2m
      labels:
        severity: critical
        app_type: linux
        category: memory
      annotations:
        summary: "Node memory usage is very HIGH"
        description: "Node memory for instance {{ $labels.instance }} has reached {{ $value }}%"

    - alert: NodeCPUUsageHigh
      expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 80
      for: 2m
      labels:
        severity: critical
        app_type: linux
        category: cpu
      annotations:
        summary: "Node CPU usage is HIGH"
        description: "CPU load for instance {{ $labels.instance }} has reached {{ $value }}%"

    - alert: NodeCPU_0_High
      expr: 100 - (avg by(instance, cpu) (irate(node_cpu_seconds_total{mode="idle", cpu="0"}[1m])) * 100) > 80
      for: 2m
      labels:
        severity: critical
        app_type: linux
        category: cpu
      annotations:
        summary: "Node CPU_0 usage is HIGH"
        description: "CPU_0 load for instance {{ $labels.instance }} has reached {{ $value }}%"

    - alert: NodeCPU_1_High
      expr: 100 - (avg by(instance, cpu) (irate(node_cpu_seconds_total{mode="idle", cpu="1"}[1m])) * 100) > 80
      for: 2m
      labels:
        severity: critical
        app_type: linux
        category: cpu
      annotations:
        summary: "Node CPU_1 usage is HIGH"
        description: "CPU_1 load for instance {{ $labels.instance }} has reached {{ $value }}%"

    - alert: NodeFreeDiskSpaceLess30%
      expr: (sum by (instance) (node_filesystem_free_bytes) / sum by (instance) (node_filesystem_size_bytes)) * 100 < 30
      for: 2m
      labels:
        severity: warning
        app_type: linux
        category: disk
      annotations:
        summary: "Node free disk space is running out"
        description: "Node disk is going to full (< 30% left)\n  Current free disk space is {{ $value }}"

Recording Rule : window-rules

  • 파일명 : ./rules/window-rules.yaml

  • 파일 예제

groups:
  - name: windows-rules
    rules:

    - alert: WMIExporterDown
      expr: up{job="wmi_exporter"} == 0
      for: 2m
      labels:
        severity: critical
        app_type: windows
        category: server
      annotations:
        summary: "WMI Exporter is down"
        description: "WMI Exporter is down for more than 2 minutes"

    - record: job:wmi_physical_memory_bytes:free
      expr: (wmi_os_physical_memory_free_bytes / wmi_cs_physical_memory_bytes) * 100

    - alert: WindowsMemoryUsageAbove60%
      expr: 60 < (100 - job:wmi_physical_memory_bytes:free) < 75
      for: 2m
      labels:
        severity: warning
        app_type: windows
        category: memory
      annotations:
        summary: "Windows memory usage is going high"
        description: "Windows memory for instance {{ $labels.instance }} has left only {{ $value }}%"

    - alert: WindowsMemoryUsageAbove75%
      expr: (100 - job:wmi_physical_memory_bytes:free) >= 75
      for: 2m
      labels:
        severity: critical
        app_type: windows
        category: memory
      annotations:
        summary: "Windows memory usage is HIGH"
        description: "Windows memory for instance {{ $labels.instance }} has left only {{ $value }}%"

    - alert: WindowsCPUUsageHigh
      expr: 100 - (avg by (instance) (rate(wmi_cpu_time_total{mode="idle"}[1m])) * 100) > 80
      for: 2m
      labels:
        severity: warning
        app_type: windows
        category: cpu
      annotations:
        summary: "Windows CPU usage is HIGH"
        description: "CPU load for instance {{ $labels.instance }} has reached {{ $value }}"

    - alert: WindowsDiskSpaceUsageAbove80%
      expr: 100 - ((wmi_logical_disk_free_bytes / wmi_logical_disk_size_bytes) * 100) > 80
      for: 2m
      labels:
        severity: error
        app_type: windows
        category: disk
      annotations:
        summary: "Windows disk space usage is HIGH"
        description: "Windows disk usage is more than 80% with value = {{ $value }}"

prometheus.yaml 파일 수정하기

global:
  scrap_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanager:
  - static_configs:
    - targets:
      - localhost:9093

rule_files:
  - "rules/linux-rules.yaml"
  - "rules/window-rules.yaml"
  - "rules/web-rules.yaml"

scrap_configs:
  - job_name: "prometheus"
    static_configs:
    - targets: ['localhost:9090']

# ...

기본 AlertManager 설치하기

route:
  receiver: admin

receivers:
- name: admin
  email_configs:
  - to: "unchaptered@gmail.com"
    from: "example@gmail.com"
    smarthost: smpt.gmail.com:587
    auth_username: "example@gmail.com"
    auth_identity: "example@gmail.com"
    auth_password: "example_password"

Star Solution, Route Tree

Star Solution (e.g. Linux, Windows)를 위한 AlertManager Route Tree 설정법입니다.

global:
  smtp_from: 'example@gmail.com'
  smtp_smarthost: smtp.gmail.com:587
  smtp_auth_username: 'example@gmail.com'
  smtp_auth_identity: 'example@gmail.com'
  smtp_auth_password: 'fqkvkumorgaqgkat'

route:
  # fallback receiver
  receiver: admin
  routes:
    # Star Solutions.
  - match_re:
      app_type: (linux|windows)
    # fallback receiver 
    receiver: ss-admin

receivers:
- name: admin
  email_configs:
  - to: 'example@gmail.com'

- name: ss-admin
  email_configs:
  - to: 'example@gmail.com'

Star Solution - Linux&Window Team, Route Tree

Star Solution, Route Tree 하위의 개별 Linux & Window Team을 위한 AlertManager Route Tree 설정법입니다.

global:
  smtp_from: 'example@gmail.com'
  smtp_smarthost: smtp.gmail.com:587
  smtp_auth_username: 'example@gmail.com'
  smtp_auth_identity: 'example@gmail.com'
  smtp_auth_password: 'fqkvkumorgaqgkat'

route:
  # fallback receiver
  receiver: admin
  routes:
    # Star Solutions.
  - match_re:
      app_type: (linux|windows)
    # fallback receiver 
    receiver: ss-admin
    routes:
    # Linux team
    - match:
        app_type: linux
      # fallback receiver
      receiver: linux-team-admin
      routes:
      - match:
          severity: critical
        receiver: linux-team-manager
      - match:
          severity: warning
        receiver: linux-team-lead

    # Windows team
    - match:
        app_type: windows
      # fallback receiver
      receiver: windows-team-admin
      routes:
      - match:
          severity: critical
        receiver: windows-team-manager
      - match:
          severity: warning
        receiver: windows-team-lead


receivers:
- name: admin
  email_configs:
  - to: 'example@gmail.com'

- name: ss-admin
  email_configs:
  - to: 'example@gmail.com'

- name: linux-team-admin
  email_configs:
  - to: 'example@gmail.com'

- name: linux-team-lead
  email_configs:
  - to: 'example@gmail.com'

- name: linux-team-manager
  email_configs:
  - to: 'example@gmail.com'

- name: windows-team-admin
  email_configs:
  - to: 'example@gmail.com'

- name: windows-team-lead
  email_configs:
  - to: 'example@gmail.com'

- name: windows-team-manager
  email_configs:
  - to: 'example@gmail.com'

PEC Technologies, Route Tree

PEC Technologies (e.g. Pyhton, Golang)를 위한 AlertManager Route Tree 설정법입니다.

lobal:
  smtp_from: 'example@gmail.com'
  smtp_smarthost: smtp.gmail.com:587
  smtp_auth_identity: 'example@gmail.com'
  smtp_auth_password: 'abcdefghijkl'

route:
  receiver: admin
  routes:
  - match_re:
      app_type: (python|go)
    receiver: pec-admin
    routes:
    - match: 
        app_type: python
      receiver: python-team-admin
      routes:
      - match: 
          severity: critical
        receiver: python-manager
      - match:
          severity: warning
        receiver: python-team-lead
    - match: 
        app_type: go
      receiver: go-team-admin
      routes:
      - match: 
          severity: critical
        receiver: go-manager
      - match:
          severity: warning
        receiver: go-team-lead

Prometheus Routing Tree Editor

Prometheus Routing Tree Editor를 이용해서 이를 시각화할 수 있습니다.

global:
  smtp_from: 'example@gmail.com'
  smtp_smarthost: smtp.gmail.com:587
  smtp_auth_username: 'example@gmail.com'
  smtp_auth_identity: 'example@gmail.com'
  smtp_auth_password: 'fqkvkumorgaqgkat'

route:
  # fallback receiver
  receiver: admin
  routes:
    # Star Solutions.
  - match_re:
      app_type: (linux|windows)
    # fallback receiver 
    receiver: ss-admin
    routes:
    # Linux team
    - match:
        app_type: linux
      # fallback receiver
      receiver: linux-team-admin
      routes:
      - match:
          severity: critical
        receiver: linux-team-manager
      - match:
          severity: warning
        receiver: linux-team-lead

    # Windows team
    - match:
        app_type: windows
      # fallback receiver
      receiver: windows-team-admin
      routes:
      - match:
          severity: critical
        receiver: windows-team-manager
      - match:
          severity: warning
        receiver: windows-team-lead

    # PEC Technologies.
  - match_re:
      app_type: (python|go)
    # fallback receiver 
    receiver: pec-admin
    routes:
    # Python team
    - match:
        app_type: python
      # fallback receiver
      receiver: python-team-admin
      routes:
      - match:
          severity: critical
        receiver: python-team-manager
      - match:
          severity: warning
        receiver: python-team-lead

    # Go team
    - match:
        app_type: go
      # fallback receiver
      receiver: go-team-admin
      routes:
      - match:
          severity: critical
        receiver: go-team-manager
      - match:
          severity: warning
        receiver: go-team-lead


receivers:
- name: admin
  email_configs:
  - to: 'example@gmail.com'

- name: ss-admin
  email_configs:
  - to: 'example@gmail.com'

- name: linux-team-admin
  email_configs:
  - to: 'example@gmail.com'

- name: linux-team-lead
  email_configs:
  - to: 'example@gmail.com'

- name: linux-team-manager
  email_configs:
  - to: 'example@gmail.com'

- name: windows-team-admin
  email_configs:
  - to: 'example@gmail.com'

- name: windows-team-lead
  email_configs:
  - to: 'example@gmail.com'

- name: windows-team-manager
  email_configs:
  - to: 'example@gmail.com'

- name: pec-admin
  email_configs:
  - to: 'example@gmail.com'

- name: python-team-admin
  email_configs:
  - to: 'example@gmail.com'

- name: python-team-lead
  email_configs:
  - to: 'example@gmail.com'

- name: python-team-manager
  email_configs:
  - to: 'example@gmail.com'

- name: go-team-admin
  email_configs:
  - to: 'example@gmail.com'

- name: go-team-lead
  email_configs:
  - to: 'example@gmail.com'

- name: go-team-manager
  email_configs:
  - to: 'example@gmail.com'


 

Prometheus, AlertManager 테스트

  1. Prometheus 가동하기

    cd <Prometheus Dir>
    
    ./prometheus
  2. Prometheus AlertManager 가동하기

    cd <Prometheus AlertManager Dir>
    
    ./alertmanager
  3. stree-ng 설치하기
    아래에서 정의한 CPU 점유율 테스트를 위해서 특별한 도구를 활용할 수 있습니다.

        - alert: NodeMemoryUsageAbove75%
          expr: (100 - job:node_memory_Mem_bytes:available) >= 75

    그 도구는 stree-ng 로 지정한 서버에 대한 CPU 사용량을 급증시킬 수 있습니다.

    sudo apt-get install stress-ng
  4. stress-ng를 이용해서 CPU, MEM 등의 부하를 가할 수 있습니다.

    stress-ng -c 2 -v --vm-bytes $(awk '/MemAvailable/{printf "%d\n", $2 * 0.85;' < /proc/meminfo)k --vm-keep -m 1 --timeout 300s
    1. -c 2: 2개의 CPU 작업을 수행합니다.

    2. -v: 상세 출력을 활성화합니다.

    3. --vm-bytes $(awk '/MemAvailable/{printf "%d\n", $2 * 0.85;' < /proc/meminfo)k:

      • awk 명령을 사용하여 현재 사용 가능한 메모리의 85%를 계산합니다.

      • /proc/meminfo 파일에서 MemAvailable 값을 읽어와 85%로 계산한 값을 stress-ng에 전달합니다.

      • 예를 들어, 사용 가능한 메모리가 1000000KB라면, 850000KB를 사용하게 됩니다.

    4. --vm-keep: 가상 메모리를 할당한 후 계속 유지합니다.

    5. -m 1: 1개의 가상 메모리 스트레스 작업을 수행합니다.

    6. --timeout 300s: 300초(5분) 동안 테스트를 실행합니다.

  5. fallocate를 이용해서 파일 시스템 상에 파일을 할당할 수 있습니다.

    fallocate -l 15G temp_file
    1. -l 15G: 생성할 파일의 크기를 15GB로 지정합니다. 여기서 -l은 파일 길이를 지정하는 옵션입니다.

    2. temp_file: 생성할 파일의 이름입니다. 이 경우 temp_file이라는 이름의 파일이 생성됩니다.

Throttling & Repretition

group_wait, group_interval을 이용해서 잦은 알람을 예방할 수 있습니다.

  • group_wait
    How long to initially wait for the other alerts to send a notification for a group of alerts.
    default = 30seconds

  • group_interval

    How long to wait before sending a notification about new alerts that are added to group of alerts for which an initial notification has already been sent.

    default = 5 minutes

  • repaet_interval

    How long wait before sending a notification if it has already sent a notification for that alert

    default = 4 hours

Inhibit Rules

Prometheus에서는 심각한 알람이 발생하면 덜 심각한 알람이 중복 발생되는 것을 막기 위한 Inhibit Rules가 존재합니다.

# ...
   
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['app_type', 'category']

# ...

Silence & Conitue

Silence는 특정 알림을 일시적으로 무시하도록 설정하는 기능입니다.
이는 특정 조건을 만족하는 알림에 대해 정해진 시간 동안 알림을 보내지 않도록 합니다.

yaml
코드 복사
silences:
  - matchers:
      - name: alertname
        value: HighCPUUsage
    startsAt: '2024-06-12T00:00:00Z'
    endsAt: '2024-06-13T00:00:00Z'
    createdBy: 'admin'
    comment: 'Silencing High CPU Usage alerts for maintenance'

Continue 기능은 룰의 평가가 특정 조건에 맞는 경우, 다음 룰을 계속해서 평가하도록 합니다. 이를 통해 보다 복잡한 알림 조건을 구성할 수 있습니다.

Share article

Unchaptered