프로메테우스 가이드북은 A to Z Metnros (Udemy) — Prometheus | The Complete Hands-On for Monitoring & Alerting를 듣고 작성한 가이드북입니다.
가이드북의 전체 목차 및 인덱싱은 프로메테우스 가이드북 — 소개 페이지를 참고해주세요.
개요
프로메테우스 알람에서 특정한 상황에서 이메일을 발송하는 구성을 살펴보았습니다.
하지만 알람이 N~NNN개 이상 많아지면 관리자(담당자)는 엄청난 양의 메일을 받고 이를 처리해야 할 것입니다. 따라서 이를 다양한 대상으로 라우팅하는 기법이 필요합니다.
Metric의 구분
Prometheus Metric은 크게 2종류로 구분될 수 있을 것입니다.
Star Solution (e.g. Linux, Windows)
PEC Techologies (e.g. Python, Golang)
AlertManager의 구분
앞서 Metric을 구분한 것에 세부 개발팀으로 이를 라우팅할 수 있을 것입니다.
Star Solution (e.g. Linux, Windows)
Linux Team (e.g. Linux OS)
Linux Manager (e.g. Critical Metric)
Linux Team Lead (e.g. Waring Metric)
Windows Team (e.g. Windows OS)
Windows Manager (e.g. Critical Metric)
Windows Team Lead
(e.g. Warning Metric)
PEC Technologies (e.g. Python, Golang)
Python Team (e.g. Python OS)
Python Manager (e.g Critical Metric)
Python Team Lead (e.g Warning Metric)
Golang Team (e.g. Golang OS)
Golang Manager (e.g. Critical Metric)
Python Team Lead (e.g Warning Metric)
Recording Rule : web-rules
파일명 : ./rules/web-rules.yaml
파일 예제
groups:
- name: python-app-rules
rules:
# Python Application Alerts
- record: job:app_request_latency_seconds:rate1m
expr: rate(app_response_latency_seconds_sum[1m]) / rate(app_response_latency_seconds_count[1m])
- alert: AppLatencyAbove2sec
expr: 2 < job:app_request_latency_seconds:rate1m < 5
for: 2m
labels:
severity: warning
app_type: python
annotations:
summary: 'Python app latency is going high'
description: 'App latency of instance {{ $labels.instance }} of job {{ $labels.job }} is {{ $value }} for more than 5 minutes.'
app_link: 'http://localhost:8000/'
- alert: AppLatencyAbove5sec
expr: job:app_request_latency_seconds:rate1m >= 5
for: 2m
labels:
severity: critical
app_type: python
annotations:
summary: 'Python app latency is over 5 seconds.'
description: 'App latency of instance {{ $labels.instance }} of job {{ $labels.job }} is {{ $value }} for more than 5 minutes.'
app_link: 'http://localhost:8000/'
- name: go-app-rules
rules:
# Go Application Alerts
- record: job:go_app_request_latency_seconds:rate1m
expr: rate(go_app_response_latency_seconds[1m]) / rate(go_app_response_latency_seconds[1m])
- alert: GoAppLatencyAbove2sec
expr: 2 < job:go_app_request_latency_seconds:rate1m < 5
for: 2m
labels:
severity: warning
app_type: go
annotations:
summary: 'Go app latency is unusual'
description: 'App latency of instance {{ $labels.instance }} of job {{ $labels.job }} is {{ $value }} for more than 5 minutes.'
app_link: 'http://localhost:8000/'
- alert: GoAppLatencyAbove5sec
expr: job:go_app_request_latency_seconds:rate1m >= 5
for: 2m
labels:
severity: critical
app_type: go
annotations:
summary: 'Go app latency is over 5 seconds.'
description: 'App latency of instance {{ $labels.instance }} of job {{ $labels.job }} is {{ $value }} for more than 5 minutes.'
app_link: 'http://localhost:8000/'
Recording Rule : linux-rules
파일명 : ./rules/linux-rules.yaml
파일 예제
# stress tool to increase CPU usage
# stress -c 1 -v -timeout 100s
# stress-ng to increase memory usage
# stress-ng --vm-bytes $(awk '/MemFree/{printf "%d\n", $2 * 0.9;}' < /proc/meminfo)k --vm-keep -m 1
# combined
# stress-ng -c 1 -v --vm-bytes $(awk '/MemAvailable/{printf "%d\n", $2 * 0.85;}' < /proc/meminfo)k --vm-keep -m 1 --timeout 300s
# to increase disk usage space
# fallocate -l 30G file
groups:
- name: linux-rules
rules:
- alert: NodeExporterDown
expr: up{job="node_exporter"} == 0
for: 2m
labels:
severity: critical
app_type: linux
category: server
annotations:
summary: "Node Exporter is down"
description: "Node Exporter is down for more than 2 minutes"
- record: job:node_memory_Mem_bytes:available
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
- alert: NodeMemoryUsageAbove60%
expr: 60 < (100 - job:node_memory_Mem_bytes:available) < 75
for: 2m
labels:
severity: warning
app_type: linux
category: memory
annotations:
summary: "Node memory usage is going high"
description: "Node memory for instance {{ $labels.instance }} has reached {{ $value }}%"
- alert: NodeMemoryUsageAbove75%
expr: (100 - job:node_memory_Mem_bytes:available) >= 75
for: 2m
labels:
severity: critical
app_type: linux
category: memory
annotations:
summary: "Node memory usage is very HIGH"
description: "Node memory for instance {{ $labels.instance }} has reached {{ $value }}%"
- alert: NodeCPUUsageHigh
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 80
for: 2m
labels:
severity: critical
app_type: linux
category: cpu
annotations:
summary: "Node CPU usage is HIGH"
description: "CPU load for instance {{ $labels.instance }} has reached {{ $value }}%"
- alert: NodeCPU_0_High
expr: 100 - (avg by(instance, cpu) (irate(node_cpu_seconds_total{mode="idle", cpu="0"}[1m])) * 100) > 80
for: 2m
labels:
severity: critical
app_type: linux
category: cpu
annotations:
summary: "Node CPU_0 usage is HIGH"
description: "CPU_0 load for instance {{ $labels.instance }} has reached {{ $value }}%"
- alert: NodeCPU_1_High
expr: 100 - (avg by(instance, cpu) (irate(node_cpu_seconds_total{mode="idle", cpu="1"}[1m])) * 100) > 80
for: 2m
labels:
severity: critical
app_type: linux
category: cpu
annotations:
summary: "Node CPU_1 usage is HIGH"
description: "CPU_1 load for instance {{ $labels.instance }} has reached {{ $value }}%"
- alert: NodeFreeDiskSpaceLess30%
expr: (sum by (instance) (node_filesystem_free_bytes) / sum by (instance) (node_filesystem_size_bytes)) * 100 < 30
for: 2m
labels:
severity: warning
app_type: linux
category: disk
annotations:
summary: "Node free disk space is running out"
description: "Node disk is going to full (< 30% left)\n Current free disk space is {{ $value }}"
Recording Rule : window-rules
파일명 : ./rules/window-rules.yaml
파일 예제
groups:
- name: windows-rules
rules:
- alert: WMIExporterDown
expr: up{job="wmi_exporter"} == 0
for: 2m
labels:
severity: critical
app_type: windows
category: server
annotations:
summary: "WMI Exporter is down"
description: "WMI Exporter is down for more than 2 minutes"
- record: job:wmi_physical_memory_bytes:free
expr: (wmi_os_physical_memory_free_bytes / wmi_cs_physical_memory_bytes) * 100
- alert: WindowsMemoryUsageAbove60%
expr: 60 < (100 - job:wmi_physical_memory_bytes:free) < 75
for: 2m
labels:
severity: warning
app_type: windows
category: memory
annotations:
summary: "Windows memory usage is going high"
description: "Windows memory for instance {{ $labels.instance }} has left only {{ $value }}%"
- alert: WindowsMemoryUsageAbove75%
expr: (100 - job:wmi_physical_memory_bytes:free) >= 75
for: 2m
labels:
severity: critical
app_type: windows
category: memory
annotations:
summary: "Windows memory usage is HIGH"
description: "Windows memory for instance {{ $labels.instance }} has left only {{ $value }}%"
- alert: WindowsCPUUsageHigh
expr: 100 - (avg by (instance) (rate(wmi_cpu_time_total{mode="idle"}[1m])) * 100) > 80
for: 2m
labels:
severity: warning
app_type: windows
category: cpu
annotations:
summary: "Windows CPU usage is HIGH"
description: "CPU load for instance {{ $labels.instance }} has reached {{ $value }}"
- alert: WindowsDiskSpaceUsageAbove80%
expr: 100 - ((wmi_logical_disk_free_bytes / wmi_logical_disk_size_bytes) * 100) > 80
for: 2m
labels:
severity: error
app_type: windows
category: disk
annotations:
summary: "Windows disk space usage is HIGH"
description: "Windows disk usage is more than 80% with value = {{ $value }}"
prometheus.yaml 파일 수정하기
global:
scrap_interval: 15s
evaluation_interval: 15s
alerting:
alertmanager:
- static_configs:
- targets:
- localhost:9093
rule_files:
- "rules/linux-rules.yaml"
- "rules/window-rules.yaml"
- "rules/web-rules.yaml"
scrap_configs:
- job_name: "prometheus"
static_configs:
- targets: ['localhost:9090']
# ...
기본 AlertManager 설치하기
route:
receiver: admin
receivers:
- name: admin
email_configs:
- to: "unchaptered@gmail.com"
from: "example@gmail.com"
smarthost: smpt.gmail.com:587
auth_username: "example@gmail.com"
auth_identity: "example@gmail.com"
auth_password: "example_password"
Star Solution, Route Tree
Star Solution (e.g. Linux, Windows)를 위한 AlertManager Route Tree 설정법입니다.
global:
smtp_from: 'example@gmail.com'
smtp_smarthost: smtp.gmail.com:587
smtp_auth_username: 'example@gmail.com'
smtp_auth_identity: 'example@gmail.com'
smtp_auth_password: 'fqkvkumorgaqgkat'
route:
# fallback receiver
receiver: admin
routes:
# Star Solutions.
- match_re:
app_type: (linux|windows)
# fallback receiver
receiver: ss-admin
receivers:
- name: admin
email_configs:
- to: 'example@gmail.com'
- name: ss-admin
email_configs:
- to: 'example@gmail.com'
Star Solution - Linux&Window Team, Route Tree
Star Solution, Route Tree 하위의 개별 Linux & Window Team을 위한 AlertManager Route Tree 설정법입니다.
global:
smtp_from: 'example@gmail.com'
smtp_smarthost: smtp.gmail.com:587
smtp_auth_username: 'example@gmail.com'
smtp_auth_identity: 'example@gmail.com'
smtp_auth_password: 'fqkvkumorgaqgkat'
route:
# fallback receiver
receiver: admin
routes:
# Star Solutions.
- match_re:
app_type: (linux|windows)
# fallback receiver
receiver: ss-admin
routes:
# Linux team
- match:
app_type: linux
# fallback receiver
receiver: linux-team-admin
routes:
- match:
severity: critical
receiver: linux-team-manager
- match:
severity: warning
receiver: linux-team-lead
# Windows team
- match:
app_type: windows
# fallback receiver
receiver: windows-team-admin
routes:
- match:
severity: critical
receiver: windows-team-manager
- match:
severity: warning
receiver: windows-team-lead
receivers:
- name: admin
email_configs:
- to: 'example@gmail.com'
- name: ss-admin
email_configs:
- to: 'example@gmail.com'
- name: linux-team-admin
email_configs:
- to: 'example@gmail.com'
- name: linux-team-lead
email_configs:
- to: 'example@gmail.com'
- name: linux-team-manager
email_configs:
- to: 'example@gmail.com'
- name: windows-team-admin
email_configs:
- to: 'example@gmail.com'
- name: windows-team-lead
email_configs:
- to: 'example@gmail.com'
- name: windows-team-manager
email_configs:
- to: 'example@gmail.com'
PEC Technologies, Route Tree
PEC Technologies (e.g. Pyhton, Golang)를 위한 AlertManager Route Tree 설정법입니다.
lobal:
smtp_from: 'example@gmail.com'
smtp_smarthost: smtp.gmail.com:587
smtp_auth_identity: 'example@gmail.com'
smtp_auth_password: 'abcdefghijkl'
route:
receiver: admin
routes:
- match_re:
app_type: (python|go)
receiver: pec-admin
routes:
- match:
app_type: python
receiver: python-team-admin
routes:
- match:
severity: critical
receiver: python-manager
- match:
severity: warning
receiver: python-team-lead
- match:
app_type: go
receiver: go-team-admin
routes:
- match:
severity: critical
receiver: go-manager
- match:
severity: warning
receiver: go-team-lead
Prometheus Routing Tree Editor
Prometheus Routing Tree Editor를 이용해서 이를 시각화할 수 있습니다.
global:
smtp_from: 'example@gmail.com'
smtp_smarthost: smtp.gmail.com:587
smtp_auth_username: 'example@gmail.com'
smtp_auth_identity: 'example@gmail.com'
smtp_auth_password: 'fqkvkumorgaqgkat'
route:
# fallback receiver
receiver: admin
routes:
# Star Solutions.
- match_re:
app_type: (linux|windows)
# fallback receiver
receiver: ss-admin
routes:
# Linux team
- match:
app_type: linux
# fallback receiver
receiver: linux-team-admin
routes:
- match:
severity: critical
receiver: linux-team-manager
- match:
severity: warning
receiver: linux-team-lead
# Windows team
- match:
app_type: windows
# fallback receiver
receiver: windows-team-admin
routes:
- match:
severity: critical
receiver: windows-team-manager
- match:
severity: warning
receiver: windows-team-lead
# PEC Technologies.
- match_re:
app_type: (python|go)
# fallback receiver
receiver: pec-admin
routes:
# Python team
- match:
app_type: python
# fallback receiver
receiver: python-team-admin
routes:
- match:
severity: critical
receiver: python-team-manager
- match:
severity: warning
receiver: python-team-lead
# Go team
- match:
app_type: go
# fallback receiver
receiver: go-team-admin
routes:
- match:
severity: critical
receiver: go-team-manager
- match:
severity: warning
receiver: go-team-lead
receivers:
- name: admin
email_configs:
- to: 'example@gmail.com'
- name: ss-admin
email_configs:
- to: 'example@gmail.com'
- name: linux-team-admin
email_configs:
- to: 'example@gmail.com'
- name: linux-team-lead
email_configs:
- to: 'example@gmail.com'
- name: linux-team-manager
email_configs:
- to: 'example@gmail.com'
- name: windows-team-admin
email_configs:
- to: 'example@gmail.com'
- name: windows-team-lead
email_configs:
- to: 'example@gmail.com'
- name: windows-team-manager
email_configs:
- to: 'example@gmail.com'
- name: pec-admin
email_configs:
- to: 'example@gmail.com'
- name: python-team-admin
email_configs:
- to: 'example@gmail.com'
- name: python-team-lead
email_configs:
- to: 'example@gmail.com'
- name: python-team-manager
email_configs:
- to: 'example@gmail.com'
- name: go-team-admin
email_configs:
- to: 'example@gmail.com'
- name: go-team-lead
email_configs:
- to: 'example@gmail.com'
- name: go-team-manager
email_configs:
- to: 'example@gmail.com'
Prometheus, AlertManager 테스트
Prometheus 가동하기
cd <Prometheus Dir> ./prometheus
Prometheus AlertManager 가동하기
cd <Prometheus AlertManager Dir> ./alertmanager
stree-ng 설치하기
아래에서 정의한 CPU 점유율 테스트를 위해서 특별한 도구를 활용할 수 있습니다.- alert: NodeMemoryUsageAbove75% expr: (100 - job:node_memory_Mem_bytes:available) >= 75
그 도구는 stree-ng 로 지정한 서버에 대한 CPU 사용량을 급증시킬 수 있습니다.
sudo apt-get install stress-ng
stress-ng를 이용해서 CPU, MEM 등의 부하를 가할 수 있습니다.
stress-ng -c 2 -v --vm-bytes $(awk '/MemAvailable/{printf "%d\n", $2 * 0.85;' < /proc/meminfo)k --vm-keep -m 1 --timeout 300s
-c 2
: 2개의 CPU 작업을 수행합니다.-v
: 상세 출력을 활성화합니다.--vm-bytes $(awk '/MemAvailable/{printf "%d\n", $2 * 0.85;' < /proc/meminfo)k
:awk
명령을 사용하여 현재 사용 가능한 메모리의 85%를 계산합니다./proc/meminfo
파일에서MemAvailable
값을 읽어와 85%로 계산한 값을stress-ng
에 전달합니다.예를 들어, 사용 가능한 메모리가 1000000KB라면, 850000KB를 사용하게 됩니다.
--vm-keep
: 가상 메모리를 할당한 후 계속 유지합니다.-m 1
: 1개의 가상 메모리 스트레스 작업을 수행합니다.--timeout 300s
: 300초(5분) 동안 테스트를 실행합니다.
fallocate를 이용해서 파일 시스템 상에 파일을 할당할 수 있습니다.
fallocate -l 15G temp_file
-l 15G
: 생성할 파일의 크기를 15GB로 지정합니다. 여기서-l
은 파일 길이를 지정하는 옵션입니다.temp_file
: 생성할 파일의 이름입니다. 이 경우temp_file
이라는 이름의 파일이 생성됩니다.
Throttling & Repretition
group_wait, group_interval을 이용해서 잦은 알람을 예방할 수 있습니다.
group_wait
How long to initially wait for the other alerts to send a notification for a group of alerts.
default = 30secondsgroup_interval
How long to wait before sending a notification about new alerts that are added to group of alerts for which an initial notification has already been sent.
default = 5 minutes
repaet_interval
How long wait before sending a notification if it has already sent a notification for that alert
default = 4 hours
Inhibit Rules
Prometheus에서는 심각한 알람이 발생하면 덜 심각한 알람이 중복 발생되는 것을 막기 위한 Inhibit Rules가 존재합니다.
# ...
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['app_type', 'category']
# ...
Silence & Conitue
Silence는 특정 알림을 일시적으로 무시하도록 설정하는 기능입니다.
이는 특정 조건을 만족하는 알림에 대해 정해진 시간 동안 알림을 보내지 않도록 합니다.
yaml
코드 복사
silences:
- matchers:
- name: alertname
value: HighCPUUsage
startsAt: '2024-06-12T00:00:00Z'
endsAt: '2024-06-13T00:00:00Z'
createdBy: 'admin'
comment: 'Silencing High CPU Usage alerts for maintenance'
Continue 기능은 룰의 평가가 특정 조건에 맞는 경우, 다음 룰을 계속해서 평가하도록 합니다. 이를 통해 보다 복잡한 알림 조건을 구성할 수 있습니다.