프로메테우스 가이드북은 A to Z Metnros (Udemy) — Prometheus | The Complete Hands-On for Monitoring & Alerting를 듣고 작성한 가이드북입니다.
가이드북의 전체 목차 및 인덱싱은 프로메테우스 가이드북 — 소개 페이지를 참고해주세요.
개요
개별 Application에 대한 세부 Metrics 수집을 하기 위해서는 Application level의 수정이 필요합니다. 이와 관련된 개념을 Client Libraries라고 부릅니다.
다양한 언어에 대한 공식/비공식 Promethes Client Library가 지원되고 있습니다.
Using client libraries, with usually adding tow or three lines of code, you add your desired instrumentations to your code and define custom metric to be exposed.
Three are a number of client libraries available for all the major language and runtimes.
Prometheus project officially provide client libraries in Go, Java or Scalar, Python and Ruby.
Unofficial third-party client libraries: Bash, C, C++, PHP and more
Client libraries take care of all the bookkepping and producing the Prometheus format metrics.
Metric Type
카운터(Counter Type)
게이지(Gauge Type)
요약(Summary Type)
카운터(Counter Type)
A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or it can be reset to zero on restart.
Counters are mainly used to track how often a particular code path is executed.
e.g. use counters to represent the number of requests served, tasks completed, or errors
Counters have one main method:
inc()
that increase the counter value by one.Do ont use the counters to expose a value that can decrease
e.g. Temperature, the number of currently running process, etc
게이지(Gauge Type)
A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
Gauge represent a snapshot of some current state.
e.g. used for measured values like temperature, current memory usage or anything whose value can go both up and down.
Gauges have thress main methods:
inc()
,dec()
,set()
that increase, decreases value by one and set the gauge to an arbitrary value respectively.
요약(Summary Type)
A summary samples observations like request durations - how long your application took the respond to a request, latency and request sizes.
Summary track the size and number of events
Summary has one primary method observe() to which we pass the size of the event
Summary exposes multiple time series during a scrape:
The total sum (<base_name>_num) of all observed values
The count (<base_name>_count) of events that has been observed.
Summary metrics may also include quantiles over a sliding time window.
히스토그램(Histogram Type)
A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets.
The instrumentation for histograms is the same as for Summary.
Histogram exposes multiple time series during a scrape:
The total sum (<base_name>_sum) of all observed values
The count (<base_name>_count) of events that have been observed.
The main purpose of using Histogram is calculating quantiles.
Metric Naming Convention
Metric names should start with a letter, and can be followed with any number of letters, numbers, and underscores.
Metrics must have unique names, and client libraries would report an error if you try to register the same metric twice for your application.
If applicable, when exposing the time series for Counter type metric, a ‘_total’ suffix is added automatically in the exposed metric.
Should represent the same logical thing-being-measured across all label dimensions.
예제
Python Application
아래와 같은 Python Application이 있습니다.
import http.server
APP_PORT = 8000
class HandleRequests(http.server.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_handler("Content-Type", "text/html")
self.end_headers()
self.wfile.wrtie(bytes("<html><head><title>FirstApplication</title></head><body style=color: #333; margin-top: 30px;'><center><h2>Welcome to our first Prometheus-Python application.</center></body></html>", "utf-8"))
self.wfile.close()
if __name__ == "__main__":
server = http.server.HTTPServer(('localhost', APP_PORT)< HandleRequests)
server.serve_forever()
Prometheus Client
해당 Python Application에 Prometheus Client를 추가합니다.
import http.server
from prometheus_client import start_http_server
APP_PORT = 8000
METRICS_PORT = 8001
# ...
if __name__ == "__main__":
start_httP-server(METRIC_PORT)
Prometheus Server
이미 가동 중인 Prometheus Server에 다음과 같이 scrap_configs
에 신규 잡을 추가합니다.
scrap_configs
# ...
# ...
- job_name: "prom_python_app"
static_confgis:
- targets: ["localhost:8001"]
Prometheus Client - add new COUNTER
이제 수집 준비가 완료되었다면, Prometheus Client에 COUNTER를 추가해볼 것입니다.
import http.server
from prometheus_client import start_http_server, Counter
APP_PORT = 8000
METRICS_PORT = 8001
REQUEST_COUNTER = Counter('app_requests_count',
'total all http request count')
# ...
또한 COUNTER 값을 증가시킬 메서드를 호출합시다.
# ...
REQUEST_COUNTER = Counter('app_requests_count',
'total all http reqeust count')
class HandleRequests(http.server.BaseHTTPRequestHandler):
def do_GET(self):
REQUEST_COUNTER.inc()
# ...
# ...
Prometheus Client - add new COUNTER with LABELS
다양한 경로(path)에 대해서 COUNTER 매트릭을 제대로 보기 위해서는 라벨(LABEL)을 활용할 수 있습니다.
# ...
REQUEST_COUNTER = Counter('app_requests_count',
'total all http reqeust count',
["app_name", "endpoint"])
class HandleRequests(http.server.BaseHTTPRequestHandler):
def do_GET(self):
REQUEST_COUNTER.labels("prom_python_app", self.path).inc()
# ...
# ...
Prometheus Client - add new GAUGE
다음과 같이 Gauge Metric을 수집하도록 구성된 Python Application이 있습니다.
아래에는 REQUEST_INPROGRESS
, REQUEST_LAST_SERVED
라는 2개의 Gauge Libraries가 설정되어 있습니다.
import http.server
import random
import time
from proemtheus_client import start_http_server, Gauge
REQUEST_INPROGRESS = Gauge('app_requests_inprogress',
'number of application requests in progress') # ⛳️
REQUEST_LAST_SERVED = Gauge('app_last_served',
'Time the application was last served') # ⛳️
APP_PORT = 8000
METRIC_PORT = 8001
class HandleRequests(http.server.BaseHTTPRequestHandler):
def do_GET(self):
REQUEST_INPROGRESS.inc() # ⛳️
time.sleep(5) # ⛳️
self.send_response(200)
self.send_handler("Content-Type", "text/html")
self.end_headers()
self.wfile.wrtie(bytes("<html><head><title>FirstApplication</title></head><body style=color: #333; margin-top: 30px;'><center><h2>Welcome to our first Prometheus-Python application.</center></body></html>", "utf-8"))
self.wfile.close()
REQUEST_LAST_SERVED.set(time.time())
REQUEST_INPROGRESS.dec() # ⛳️
if __name__ == "__main__":
start_http_server(METRICS_PORT) # ⛳️
server = http.server.HTTPServer(('localhost', APP_PORT)< HandleRequests)
server.serve_forever()
물론 위에서 작성한 부분들은 @REQUEST_INPGORESS.track_inprogress()
와 REQUEST_LAST_SERVED.set_to_current_time()
를 사용하여 간략하게 작성할 수 있습니다.
# ...
class HandleRequests(http.server.BaseHTTPRequestHandler):
@REQUEST_INPROGRESS.track_inprogress()
def do_GET(self):
REQUEST_LAST_SERVED.set_to_current_time()
Prometheus Client - add new SUMMARY
다음과 같이 Summary Metric을 수집하도록 구성된 Python Application이 있습니다.
아래에는 REQUEST_RESPOND_TIME
이라는 1개의 Summary Libraries가 설정되어 있습니다.
import http.server
import time
from prometheus_client import start_http_server, Summary
REQUEST_RESPOND_TIME = Summary('app_response_latency_seconds',
'Response latency in seconds')
APP_PORT = 8000
METRICS_PORT = 8001
class HandleRequests(http.server.BaseHTTPRequestHandler):
def do_GET(self):
start_time = time.time() # ⛳️
time.sleep() # ⛳️
self.send_response(200)
self.send_handler("Content-Type", "text/html")
self.end_headers()
self.wfile.wrtie(bytes("<html><head><title>FirstApplication</title></head><body style=color: #333; margin-top: 30px;'><center><h2>Welcome to our first Prometheus-Python application.</center></body></html>", "utf-8"))
self.wfile.close()
end_time = time.time() # ⛳️
taken_time = end_time - start_time. # ⛳️
REQUEST_RESPOND_TIME.observe(taken_time) # ⛳️
if __name__ == "__main__":
start_http_server(METRICS_PORT)
server = http.server.HTTPServer(('localhost', APP_PORT)< HandleRequests)
server.serve_forever()
이 메서드를 사용하면 총 2개의 매트릭이 수집되는 것을 알 수 있습니다.
app_response_latency_seconds_count
app_response_latency_seconds_sum
이 매트릭에 대하여 다음과 같은 PromQL 쿼리문을 실행할 수 있습니다.
rate(app_response_latency_seconds_sum[5m]) \
/ rate(app_response_latecny_seconds_count[5m])
하지만 역시 위에서 적힌 부분을 @REQUEST_RESPOND_TIME.time()
을 이용해서 간단하게 작성할 수 있습니다.
# ...
class HandleRequests(http.server.BaseHTTPRequestHandler):
@REQEUST_RESPOND_TIME.time() # ⛳️
def do_GET(self):
# ...
# ...
Prometheus Client - add new HISTOGRAM
다음과 같이 Histogram Metric을 수집하도록 구성된 Python Application이 있습니다.
아래에는 REQUEST_RESPOND_TIME
이라는 1개의 Histogram Libraries가 설정되어 있습니다.
import http.server
import time
from prometheus_client import start_http_server, Histogram
REQUEST_RESPOND_TIME = Histogram('app_response_latency_seconds',
'Response latency in seconds') # ⛳️
APP_PORT = 8000
METRICS_PORT = 8001
class HandleRequests(http.server.BaseHTTPRequestHandler):
#REQUEST_RESPOND_TIME.time() # ⛳️
def do_GET(self):
# ...
# ...
수집하는 데이터는 SUMMARY 데이터와 동일하게 시간(Duration)에 대한 값입니다.
HISTOGRAM 데이터는 시간(Duration)을 각 구간별로 나누어서 아래와 같이 계단식 범위를 가지는 매트릭이 수집됩니다.
이때 수집되는 히스토그램 구간을 버킷(bucket)이라고 부르며, 기본값은 다음과 같습니다.
app_response_latency_seconds_bucket{le="0.005"}
app_response_latency_seconds_bucket{le="0.01"}
app_response_latency_seconds_bucket{le="0.025"}
app_response_latency_seconds_bucket{le="0.05"}
app_response_latency_seconds_bucket{le="0.075"}
app_response_latency_seconds_bucket{le="1.0"}
app_response_latency_seconds_bucket{le="2.5"}
app_response_latency_seconds_bucket{le="5.0"}
app_response_latency_seconds_bucket{le="7.5"}
app_response_latency_seconds_bucket{le="10.0"}
app_response_latency_seconds_bucket{le="+Inf"}
app_response_latency_seconds_count 1.0
app_response_latency_seconds_sum 5.007552497001598
하지만 새로운 사용자 지정 버킷(Custom Bucket)을 설정하고 싶다면 아래와 같이 할 수 있습니다. 이 버킷은 다양하게 구성될수록 유용하지만, 동시에 많은 리소스가 소모됩니다.
따라서 제공하고자 하는 앱과 기능의 유형에 따라서 적절한 버킷의 크기를 정해야 합니다.
import http.server
import time
from prometheus_client import start_http_server, Histogram
REQUEST_RESPOND_TIME = Histogram('app_response_latency_seconds',
'Response latency in seconds',
buckets=[0.1, 0.5, 1,2,3,4,5,10]) # ⛳️
APP_PORT = 8000
METRICS_PORT = 8001
class HandleRequests(http.server.BaseHTTPRequestHandler):
#REQUEST_RESPOND_TIME.time() # ⛳️
def do_GET(self):
# ...
# ...
위에서 버킷의 종류를 변경했기 때문에, 수집되는 데이터의 유형도 변경되었습니다.
app_response_latency_seconds_bucket{le="0.1"} 0.0
app_response_latency_seconds_bucket{le="0.5"} 0.0
app_response_latency_seconds_bucket{le="1.0"} 0.0
app_response_latency_seconds_bucket{le="2.0"} 1.0
app_response_latency_seconds_bucket{le="3.0"} 1.0
app_response_latency_seconds_bucket{le="4.0"} 1.0
app_response_latency_seconds_bucket{le="5.0"} 1.0
app_response_latency_seconds_bucket{le="10.0"} 1.0
app_response_latency_seconds_bucket{le="+Inf"} 1.0
app_response_latency_seconds_count 1.0
app_response_latency_seconds_sum 1.007552497001598