Effective Metric Collection#
Key Metric Types#
- Counter Metrics
# Example counter metric
http_requests_total{status="200", handler="/api/v1"}
- Gauge Metrics
# Memory usage example
process_resident_memory_bytes
PromQL Best Practices#
Rate Calculations#
# Request rate over 5 minutes
rate(http_requests_total[5m])
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
Alert Configuration#
Alert Rules Example#
groups:
- name: example
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: High HTTP error rate
description: "Error rate is {{ $value }}%"
Recording Rules#
groups:
- name: example
rules:
- record: job:http_inprogress_requests:sum
expr: sum by (job) (http_inprogress_requests)
Retention and Storage#
- Storage Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
storage:
tsdb:
retention.time: 15d
retention.size: 512GB
Production Example#
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-monitor
spec:
selector:
matchLabels:
app: api
endpoints:
- port: metrics
interval: 30s
path: /metrics
- port: metrics
interval: 10s
path: /metrics/critical
metricRelabelings:
- sourceLabels: [__name__]
regex: 'http_requests_total'
action: keep