Prometheus Monitoring: SRE Best Practices and Implementation

Effective Metric Collection Key Metric Types Counter Metrics # Example counter metric http_requests_total{status="200", handler="/api/v1"} Gauge Metrics # Memory usage example process_resident_memory_bytes PromQL Best Practices Rate Calculations # Request rate over 5 minutes rate(http_requests_total[5m]) # Error rate percentage sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 Alert Configuration Alert Rules Example groups: - name: example rules: - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5 for: 5m labels: severity: critical annotations: summary: High HTTP error rate description: "Error rate is {{ $value }}%" Recording Rules groups: - name: example rules: - record: job:http_inprogress_requests:sum expr: sum by (job) (http_inprogress_requests) Retention and Storage Storage Configuration global: scrape_interval: 15s evaluation_interval: 15s storage: tsdb: retention.time: 15d retention.size: 512GB Production Example apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: api-monitor spec: selector: matchLabels: app: api endpoints: - port: metrics interval: 30s path: /metrics - port: metrics interval: 10s path: /metrics/critical metricRelabelings: - sourceLabels: [__name__] regex: 'http_requests_total' action: keep

1 min · Me