프로메테우스 기록 규칙

프로메테우스 가이드북

Jun 09, 2024

Contents

개요 기록 규칙(Record Rule)의 필요성 기록 규칙이란?기록 규칙의 작동 원리 기록 규칙의 이점 프로메테우스 Graceful Reload SIGHUP 시그널 사용하기 리로드 핸들러에 POST 요청 보내기 예제 프로메테우스 기록 규칙 작성하기 프로메테우스 기록 규칙에 네이밍 규칙 적용하기 복수의 기록 규칙 적용하기 모범 사례

프로메테우스 가이드북은 A to Z Metnros (Udemy) — Prometheus | The Complete Hands-On for Monitoring & Alerting를 듣고 작성한 가이드북입니다.

가이드북의 전체 목차 및 인덱싱은 프로메테우스 가이드북 — 소개 페이지를 참고해주세요.

개요

기록 규칙(Record Rule)의 필요성

지금까지 Promehteus를 구성하고 PromQL로 이를 질의 및 집계하면서 강력한 기능을 엿보았습니다.

Prometheus Node/WMI Exporter가 본인에게 매트릭을 기록하는 방식
Prometheus Client를 이용해서 Python/Golang Application의 매트릭을 기록하는 방식
Prometheus Server가 기록된 매트릭들을 끌어당겨서 수집하는 방식
PromQL을 통해서 수집된 매트릭에 대한 쿼리 질의 및 분석을 실행하는 방식

하지만 수천대의 서버에 대한 매트릭 수집이 이루어지고 있다면, 약간의 렉이 발생할 것입니다. 이 때 대량 연산이 필요한 PromQL을 돌리면 더 심한 렉이 걸릴 것입니다. 특히, 이런 PromQL 쿼리가 주기적으로 실행되어야 한다면, 그 문제가 심각해집니다.

이를 해결하기 위한 기술이 기록 규칙(Recording Rule)입니다.

기록 규칙이란?

기록 규칙을 사용하면 자주 필요하거나 비용이 많이 드는 식을 미리 계산하고 그 결과를 새로운 시계열 집합(Time series set)으로 Prometheus TSDB(→ HDD/SSD)에 저장할 수 있습니다.
Prometheus는 구성한 다음 정기적으로 평가할 수 있는 두가지 유형의 규칙을 지원합니다.
1. Record Rules
2. Alerting Rules
기록 규칙은 YAML으로 작성됩니다.

Recording rules allow you to precompute frequently needed or compute expensive expressions and save their result as a new set of time series in Prometheus storage.
Prometheus supports two types of rules which may be configured and then evaluated at regular intervals: Recording rules and Alerting rules.
Recording rules are written in YAML.

기록 규칙의 작동 원리

rules.yaml 파일 정의하기
Prometheus가 이를 반영

rules.yaml 예제

sum without (instance)(rate(prometheus_http_requests_total{job="prometheus"}[5m])) = job:prometheus_http_request:rate5m

기록 규칙의 이점

미리 계산된 결과를 쿼리하는 것이 원래 표현식을 실행하는 것보다 훨씬 빠릅니다.
기록 규칙을 사용하면 새로 고침할 때마다, 동일한 표현식을 반복적으로 쿼리해야 하는 대쉬보드에 매우 유용합니다.

Querying the precomputed result is much faster than executing the original expression
Using recording rules becomes very helpful for dashboards, which need to query the same expression repeatedly every time they refresh.

프로메테우스 Graceful Reload

지금까지는 프로메테우스 설정 파일이 변경되면, 프로메테우스를 재시작해야 했습니다.
이 경우, 프로메테우스 서버가 중단되는 치명적인 문제가 있습니다.

중단없이 프로메테우스의 최신화를 위해서는 2가지 방법을 사용할 수 있습니다.

SIGHUP 시그널 사용하기
리로드 핸들러에 POST 요청 보내기

Reload Configurations without killing prometheus
SIGHUP signal
Send POST request for reload handler

SIGHUP 시그널 사용하기

ps ax | grep prometheus

kill -HUP <PROMETHEUS_IP>

리로드 핸들러에 POST 요청 보내기

SIGHUP 시그널을 사용하지 않고 프로메테우스 내부 기능으로 새로고침이 가능합니다.

하지만 이 기능을 쓰기 위해서는 프로메테우스를 키는 시점에 reload 기능을 활성화 해야 합니다.

리로드 활성화하고 프로메테우스 실행

<~>/prometheus --web.anble.lifecycle

리로드 API 호출하기

curl -X POST http://localhost:9090/./reload

예제

프로메테우스 기록 규칙 작성하기

아래의 내용을 넣어서 myrules.yml 파일을 생성해주세요.

groups:
  - name: my-rules
    rules:
    - record: <RECORD_NAME>
      expr: avg without(cpu)(rate(node_cpu_seconds_total{mode="idle"}[5m]))

해당 파일에 있는 <RECORD_NAME>에는 어떤 이름을 지어주어야 할까요?

프로메테우스 기록 규칙에 네이밍 규칙 적용하기

직전에 새 프로메테우스 기록 규칙을 작성하였습니다.

이 기록 규칙에 올바른 네이밍 규칙을 적용하면 어떻게 될까요:

Recording rules should be of the general from level:metric:operations
e.g. job:node_cpu_seconds:avg_idle
Level : Level represents the aggregation level of the metric and labels of the rule output.
Metric : Metric is just the same metric name under evaluation.
Operations : Operations is a list of operations that were applied to the metric under evaluation. Newest operation comes first.

실제로 올바른 이름은 다음과 같습니다.

groups:
  - name: my-rules
    rules:
    - record:job:node_cpu_seconds:avg_idle
      expr: avg without(cpu)(rate(node_cpu_seconds_total{mode="idle"}[5m]))

복수의 기록 규칙 적용하기

하나의 기록 규칙 파일에는 여러 개의 룰 그룹(Rule Group)을 적용할 수 있습니다.

하나의 룰 그룹에는 여러 개의 룰들(Rules)이 기록될 수 있습니ㅏㄷ.

groups:
  - name: my-rules
    rules:
    - record: job:node_cpu_seconds:avg_idle
      expr: avg without(cpu)(rate(node_cpu_seconds_total{mode="idle"}[5m]))
    - record: job:node_cpu_seconds:avg_not_idle

      expr: avg without(cpu, mode)(rate(node_cpu_seconds_total{mode!="idle"}[5m]))

  - name: my-rule_new
    rules:
    - record: job:node_cpu_seconds:avg_not_idle_new
      expr : avg without(cpu, mode)(rate(node_cpu_seconds_total{mode!="idle"}[5m]))

모범 사례

프로메테우스 기록 규칙을 다루는데 아래의 4가지 모범 사례를 준수하려고 노력해봅시다.

긴 범위의 Range Vector에 대한 기록 규칙은 아래 이유로 사용하지 마세요.
1. 쿼리 비용이 많이 듭니다.
2. 정기적으로 실행하면 성능 문제가 발생할 수 있습니다.
장기간(수개월 또는 수년 이상) 매트릭 데이터를 저장할 때는 기록 규칙을 사용하세요.
작업에 따라 여러 그룹으로 규칙을 정의하세요.
기록 규칙의 이름을 지정할 때 이름 규칙을 따르세요.

How to wisely use Recording Rules
Avoid using rules for long vector ranges, as such queries tend to be expensive, and running them regularly can cause performance problems.
Use rules while storing metrics data for the long-term basis (for over months or years)
Define rules in different groups based on the jobs.
Follow the naming convension while naming recording rules.