Monitoring and Observability Guide¶
Comprehensive guide to monitoring, metrics, logging, and alerting for the FastAPI HTTP/WebSocket application.
Table of Contents¶
- Overview
- Metrics Collection
- Grafana Dashboards
- Prometheus Alerts
- Log Aggregation
- Distributed Tracing
- Performance Monitoring
Overview¶
Monitoring Stack¶
Metrics Flow:
graph TB
subgraph "Application Components"
FastAPI[FastAPI<br/>:8000]
Keycloak[Keycloak<br/>:9000]
Traefik[Traefik<br/>:8080]
end
FastAPI -->|/metrics| Prometheus
Keycloak -->|/metrics| Prometheus
Traefik -->|/metrics| Prometheus
Prometheus[Prometheus<br/>Metrics DB<br/>:9090]
Grafana[Grafana<br/>Visualization<br/>:3000]
Prometheus -->|Query| Grafana
style FastAPI fill:#f9f,stroke:#333,stroke-width:2px
style Keycloak fill:#fbf,stroke:#333,stroke-width:2px
style Traefik fill:#bbf,stroke:#333,stroke-width:2px
style Prometheus fill:#fbb,stroke:#333,stroke-width:2px
style Grafana fill:#bfb,stroke:#333,stroke-width:2px Logs Flow:
graph TB
subgraph "Application Logs"
FastAPILogs[FastAPI<br/>JSON logs]
DockerLogs[Docker<br/>logs]
TraefikLogs[Traefik<br/>logs]
end
FastAPILogs -->|stdout| Alloy
DockerLogs -->|stdout| Alloy
TraefikLogs -->|stdout| Alloy
Alloy[Grafana Alloy<br/>Log Collector<br/>:12345]
Loki[Loki<br/>Log Aggregation<br/>:3100]
GrafanaLogs[Grafana<br/>Log Queries<br/>:3000]
Alloy -->|Push| Loki
Loki -->|Query| GrafanaLogs
style FastAPILogs fill:#f9f,stroke:#333,stroke-width:2px
style DockerLogs fill:#ddf,stroke:#333,stroke-width:2px
style TraefikLogs fill:#bbf,stroke:#333,stroke-width:2px
style Alloy fill:#ffd,stroke:#333,stroke-width:2px
style Loki fill:#dff,stroke:#333,stroke-width:2px
style GrafanaLogs fill:#bfb,stroke:#333,stroke-width:2px Metrics Collection¶
Application Metrics¶
The application exposes Prometheus metrics at /metrics endpoint.
Key Metric Types:
- Counters: Cumulative values (requests, errors)
- Gauges: Point-in-time values (connections, queue size)
- Histograms: Distributions (latency, request size)
- Summaries: Quantiles (percentiles)
Available Metrics¶
HTTP Metrics¶
# Total HTTP requests by method, endpoint, status
http_requests_total{method="GET",endpoint="/authors",status_code="200"}
# Request duration histogram (seconds)
http_request_duration_seconds{method="POST",endpoint="/authors"}
# Percentiles
http_request_duration_seconds{method="GET",endpoint="/authors",quantile="0.99"}
# In-progress requests
http_requests_in_progress{method="GET",endpoint="/authors"}
WebSocket Metrics¶
# Active WebSocket connections
ws_connections_active
# Total connections by status
ws_connections_total{status="accepted"}
ws_connections_total{status="rejected_auth"}
ws_connections_total{status="rejected_limit"}
# Messages received/sent
ws_messages_received_total
ws_messages_sent_total
# Message processing duration by handler
ws_message_processing_duration_seconds{pkg_id="1"}
Database Metrics¶
# Query duration by operation
db_query_duration_seconds{operation="select"}
# Active database connections
db_connections_active
# Database errors
db_errors_total{operation="insert",error_type="integrity_error"}
Rate Limiting Metrics¶
# Rate limit hits by type
rate_limit_hits_total{limit_type="http"}
rate_limit_hits_total{limit_type="websocket_connection"}
rate_limit_hits_total{limit_type="websocket_message"}
Authentication Metrics¶
# Auth attempts by status
auth_attempts_total{status="success"}
auth_attempts_total{status="failed"}
auth_attempts_total{status="token_expired"}
# Token validation
token_validation_total{status="valid"}
token_validation_total{status="invalid"}
Application Info¶
# Application version and environment
app_info{version="1.0.0",python_version="3.11.0",environment="production"}
Traefik Metrics¶
# Requests per service
traefik_service_requests_total{service="fastapi@docker"}
# Request duration
traefik_service_request_duration_seconds{service="fastapi@docker"}
# Backend server status
traefik_service_server_up{service="fastapi@docker"}
# Open connections
traefik_service_open_connections{service="fastapi@docker"}
Keycloak Metrics¶
# JVM heap memory
jvm_memory_used_bytes{area="heap"}
jvm_memory_max_bytes{area="heap"}
# Garbage collection
jvm_gc_pause_seconds_sum
jvm_gc_pause_seconds_count
# Thread count
jvm_threads_current
jvm_threads_peak
PostgreSQL Metrics¶
If using PostgreSQL exporter:
# Database size
pg_database_size_bytes{datname="fastapi_prod"}
# Active connections
pg_stat_database_numbackends{datname="fastapi_prod"}
# Transactions per second
rate(pg_stat_database_xact_commit{datname="fastapi_prod"}[5m])
# Cache hit ratio
pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read)
Redis Metrics¶
If using Redis exporter:
# Connected clients
redis_connected_clients
# Memory usage
redis_memory_used_bytes
redis_memory_max_bytes
# Commands per second
rate(redis_commands_processed_total[5m])
# Keyspace hits/misses
redis_keyspace_hits_total
redis_keyspace_misses_total
# Hit ratio
redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)
Grafana Dashboards¶
Existing Dashboards¶
The application includes pre-configured Grafana dashboards:
- FastAPI Metrics (
fastapi-metrics.json) - Request rates and latency
- WebSocket connections
- Error rates
-
Rate limiting
-
Traefik Metrics (
traefik-metrics.json) - Request distribution
- Backend health
- Response times
-
Status codes
-
Keycloak Metrics (
keycloak-metrics.json) - JVM metrics
- Memory usage
- GC activity
-
Thread count
-
Application Logs (
application-logs.json) - Log volume
- Error logs
- HTTP requests
- Rate limits
Accessing Dashboards¶
# Access Grafana
https://grafana.example.com
# Login via Keycloak (auto-redirect)
# Dashboards location
Home → Dashboards → Browse
# Or direct URLs
https://grafana.example.com/d/fastapi-metrics
https://grafana.example.com/d/traefik-metrics
https://grafana.example.com/d/keycloak-metrics
https://grafana.example.com/d/application-logs
Creating Custom Dashboards¶
Via UI: 1. Grafana → Dashboards → New Dashboard 2. Add Panel 3. Select Prometheus data source 4. Enter PromQL query 5. Configure visualization 6. Save dashboard
Via JSON (recommended for version control):
{
"dashboard": {
"title": "Custom Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{endpoint}}"
}
],
"type": "graph"
}
]
}
}
Save to docker/grafana/provisioning/dashboards/custom.json and set permissions to 644.
Prometheus Alerts¶
Alert Rules Configuration¶
Create docker/prometheus/alerts/application.yml:
groups:
- name: application
interval: 30s
rules:
# High Error Rate
- alert: HighErrorRate
expr: |
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
component: fastapi
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
# Slow Response Time
- alert: SlowResponseTime
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels:
severity: warning
component: fastapi
annotations:
summary: "Slow response time (p99 > 1s)"
description: "99th percentile latency is {{ $value }}s"
# WebSocket Connection Limit
- alert: HighWebSocketConnections
expr: ws_connections_active > 1000
for: 5m
labels:
severity: warning
component: fastapi
annotations:
summary: "High number of WebSocket connections"
description: "{{ $value }} active connections (threshold: 1000)"
# Rate Limit Abuse
- alert: RateLimitAbuse
expr: rate(rate_limit_hits_total[5m]) > 100
for: 5m
labels:
severity: warning
component: fastapi
annotations:
summary: "High rate of rate limit hits"
description: "{{ $value }} rate limit hits per second"
# Database Connection Pool Exhaustion
- alert: DatabaseConnectionPoolExhausted
expr: db_connections_active / db_connections_max > 0.9
for: 5m
labels:
severity: critical
component: database
annotations:
summary: "Database connection pool nearly exhausted"
description: "{{ $value | humanizePercentage }} of connections in use"
- name: infrastructure
interval: 30s
rules:
# Service Down
- alert: ServiceDown
expr: up{job=~"fastapi|keycloak|traefik"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "Service has been down for 1 minute"
# Database Down
- alert: DatabaseDown
expr: up{job="postgres"} == 0
for: 1m
labels:
severity: critical
component: database
annotations:
summary: "PostgreSQL is down"
description: "Database has been unreachable for 1 minute"
# Redis Down
- alert: RedisDown
expr: up{job="redis"} == 0
for: 1m
labels:
severity: critical
component: redis
annotations:
summary: "Redis is down"
description: "Redis has been unreachable for 1 minute"
# High Memory Usage
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{name="hw-server"}
/ container_spec_memory_limit_bytes{name="hw-server"} > 0.9
for: 5m
labels:
severity: warning
component: fastapi
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value | humanizePercentage }}"
# High CPU Usage
- alert: HighCPUUsage
expr: |
rate(container_cpu_usage_seconds_total{name="hw-server"}[5m]) > 0.8
for: 5m
labels:
severity: warning
component: fastapi
annotations:
summary: "High CPU usage"
description: "CPU usage is {{ $value | humanizePercentage }}"
- name: security
interval: 30s
rules:
# High Failed Login Rate
- alert: HighFailedLoginRate
expr: rate(auth_attempts_total{status="failed"}[5m]) > 10
for: 5m
labels:
severity: warning
component: security
annotations:
summary: "High rate of failed login attempts"
description: "{{ $value }} failed logins per second"
# Unauthorized Access Attempts
- alert: UnauthorizedAccessAttempts
expr: rate(http_requests_total{status_code="403"}[5m]) > 5
for: 5m
labels:
severity: warning
component: security
annotations:
summary: "High rate of unauthorized access attempts"
description: "{{ $value }} 403 responses per second"
Alert Manager Configuration¶
Create docker/prometheus/alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#alerts'
title: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
description: '{{ .GroupLabels.alertname }}'
Testing Alerts¶
# Trigger high error rate alert
for i in {1..1000}; do
curl -X POST https://api.example.com/nonexistent
done
# Trigger slow response alert
# (Requires endpoint that sleeps)
# Check alert status
https://prometheus.example.com/alerts
# Check AlertManager
https://alertmanager.example.com
Log Aggregation¶
Structured Logging¶
The application uses structured JSON logging (see app/logging.py).
Log Format:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO",
"logger": "app.api.http.author",
"message": "Author created successfully",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"endpoint": "/authors",
"method": "POST",
"status_code": 201,
"duration_ms": 45.2,
"environment": "production"
}
LogQL Queries¶
Common Queries:
# Recent error logs
{service="shell"} | json | level="ERROR"
# Logs for specific user
{service="shell"} | json | user_id="user123"
# HTTP requests to specific endpoint
{service="shell"} | json | endpoint=~"/api/authors.*"
# Failed authentication attempts
{service="shell"} | json | logger=~"app.auth.*" |~ "(?i)(error|failed|invalid)"
# Rate limit violations
{service="shell"} | json |~ "(?i)(rate limit|too many requests)"
# WebSocket logs
{service="shell"} | json | logger=~"app.api.ws.*"
# Slow operations (> 100ms)
{service="shell"} | json | duration_ms > 100
# Correlate by request ID
{service="shell"} | json | request_id="550e8400-e29b-41d4-a716-446655440000"
# Error rate over time
rate({service="shell"} | json | level="ERROR"[5m])
# Top 10 error messages
topk(10, sum by (message) (count_over_time({service="shell"} | json | level="ERROR"[1h])))
Log Retention¶
Configure in docker/loki/loki-config.yml:
limits_config:
retention_period: 744h # 31 days
table_manager:
retention_deletes_enabled: true
retention_period: 744h
Distributed Tracing¶
Correlation ID Tracing (Built-in)¶
The application uses correlation IDs for distributed tracing without requiring OpenTelemetry. This provides equivalent functionality for monolithic services and simple microservices architectures.
How It Works:
- X-Correlation-ID Header: Automatically added to all requests (8-char UUID)
- Request Propagation: Correlation ID flows through entire request lifecycle
- Structured Logging: All logs include
request_idfield - Audit Logs: Database records include
request_idcolumn - Grafana Queries: Filter logs by correlation ID for request tracing
Architecture:
Client Request
│
├─> X-Correlation-ID: abc12345
│
v
┌─────────────────────────────────┐
│ CorrelationIDMiddleware │
│ - Extract/generate correlation │
│ - Store in request.state │
│ - Set context variable │
└─────────────┬───────────────────┘
│
├─> HTTP Handler
│ └─> logger.info("...", extra={"request_id": "abc12345"})
│
├─> WebSocket Handler
│ └─> RequestModel(req_id="abc12345")
│
├─> Database Query
│ └─> audit_log(request_id="abc12345")
│
└─> Response
└─> X-Correlation-ID: abc12345
Accessing Correlation ID:
from app.middlewares.correlation_id import get_correlation_id
# In any handler or middleware
correlation_id = get_correlation_id()
logger.info(f"Processing request {correlation_id}")
# Automatically included in structured logs
logger.info("User action", extra={
"user_id": "123",
"action": "create_author"
})
# Output: {"request_id": "abc12345", "user_id": "123", "action": "create_author", ...}
Tracing Request Flow in Grafana:
# 1. Find request by correlation ID
{service="shell"} | json | request_id="abc12345"
# 2. Trace complete request lifecycle
{service="shell"} | json | request_id="abc12345"
| line_format "{{.timestamp}} [{{.level}}] {{.logger}}: {{.message}}"
# 3. Filter by specific component
{service="shell"} | json | request_id="abc12345" | logger=~"app.api.*"
# 4. Show error logs only
{service="shell"} | json | request_id="abc12345" | level="ERROR"
# 5. Correlate with audit logs (PostgreSQL dashboard)
SELECT * FROM user_actions WHERE request_id = 'abc12345' ORDER BY timestamp;
Example: Tracing Failed Request
1. Find error in logs:
2. Extract correlation ID from error log:
{
"timestamp": "2025-01-29T10:15:30Z",
"level": "ERROR",
"request_id": "abc12345",
"message": "Author not found: id=999"
}
3. Trace complete request flow:
Output:
10:15:29 [INFO] app.middlewares.correlation_id: Request received
10:15:29 [INFO] app.auth: User authenticated: user_id=u123
10:15:29 [INFO] app.api.http.author: GET /authors/999
10:15:29 [DEBUG] app.repositories.author: Query: SELECT * FROM authors WHERE id=999
10:15:30 [ERROR] app.api.http.author: Author not found: id=999
10:15:30 [INFO] app.middlewares.audit: Audit log created: outcome=error
4. Check audit log in PostgreSQL:
SELECT timestamp, username, action_type, resource, outcome, error_message
FROM user_actions
WHERE request_id = 'abc12345';
Cross-Service Tracing:
For microservices, propagate correlation ID via HTTP headers:
# Service A: Extract correlation ID
from app.middlewares.correlation_id import get_correlation_id
async def call_service_b():
correlation_id = get_correlation_id()
# Pass to downstream service
response = await httpx.get(
"http://service-b/api/resource",
headers={"X-Correlation-ID": correlation_id}
)
return response
# Service B: Receives same correlation ID
# CorrelationIDMiddleware extracts it automatically
# All logs in Service B will have same request_id
Correlation ID vs OpenTelemetry:
| Feature | Correlation ID | OpenTelemetry |
|---|---|---|
| Request tracing | ✅ Via logs | ✅ Via spans |
| Cross-service tracking | ✅ Via headers | ✅ Via context propagation |
| Timeline visualization | ❌ Logs only | ✅ Jaeger UI |
| Span-level timing | ❌ | ✅ |
| Implementation complexity | Low | High |
| Dependencies | None | Jaeger, OTLP exporter |
| Best for | Monolithic services | Microservices |
When to Use OpenTelemetry:
Consider OpenTelemetry if: - Running complex microservices architecture (5+ services) - Need span-level timing within handlers - Want visualized trace graphs (Jaeger UI) - Require standard instrumentation across polyglot services
OpenTelemetry Integration (Optional):
If you need OpenTelemetry for advanced tracing:
# app/tracing.py
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
def setup_tracing():
"""Configure OpenTelemetry tracing."""
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
return tracer
# Usage in request handler
from app.tracing import tracer
@router.post("/authors")
async def create_author(data: CreateAuthorInput):
with tracer.start_as_current_span("create_author"):
# Your code here
pass
Best Practice: Start with correlation IDs. Add OpenTelemetry only when scaling to complex microservices.
Performance Monitoring¶
Key Performance Indicators (KPIs)¶
| Metric | Target | Critical |
|---|---|---|
| Response Time (p99) | < 500ms | > 1s |
| Error Rate | < 1% | > 5% |
| Availability | > 99.9% | < 99% |
| WebSocket Connections | < 5000 | > 10000 |
| Database Connections | < 80% | > 95% |
| CPU Usage | < 70% | > 90% |
| Memory Usage | < 80% | > 95% |
Performance Queries¶
# Average response time by endpoint
avg(rate(http_request_duration_seconds_sum[5m]))
by (endpoint)
/
avg(rate(http_request_duration_seconds_count[5m]))
by (endpoint)
# Request throughput (req/s)
rate(http_requests_total[5m])
# Error rate percentage
rate(http_requests_total{status_code=~"5.."}[5m])
/
rate(http_requests_total[5m]) * 100
# Apdex score (Application Performance Index)
# Target: 100ms, Tolerating: 400ms
(
sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
+ sum(rate(http_request_duration_seconds_bucket{le="0.4"}[5m])) / 2
)
/
sum(rate(http_request_duration_seconds_count[5m]))
Load Testing¶
Use tools like Locust or k6:
# locustfile.py
from locust import HttpUser, task, between
class WebsiteUser(HttpUser):
wait_time = between(1, 3)
@task(3)
def get_authors(self):
self.client.get("/authors")
@task(1)
def create_author(self):
self.client.post("/authors", json={
"name": "Test Author",
"bio": "Test bio"
})
# Run load test
locust -f locustfile.py --host=https://api.example.com
Best Practices¶
Monitoring Checklist¶
- All services expose /metrics endpoint
- Prometheus scraping all targets
- Grafana dashboards configured
- Alert rules defined
- AlertManager configured with receivers
- Log aggregation working (Loki)
- Structured JSON logging enabled
- Retention policies configured
- Performance baselines established
- On-call rotation defined
Alert Best Practices¶
- Actionable: Every alert should require action
- Clear: Descriptions should explain what's wrong
- Prioritized: Use severity levels (critical, warning, info)
- Tested: Test alerts before deploying
- Documented: Runbooks for each alert
Dashboard Best Practices¶
- Overview first: Start with high-level metrics
- Drill-down: Link to detailed views
- Time range: Include time range selector
- Variables: Use template variables for filtering
- Auto-refresh: Enable for real-time monitoring