Troubleshooting Guide¶
This guide provides solutions to common issues encountered when deploying and operating the FastAPI HTTP/WebSocket application.
Table of Contents¶
- Deployment Issues
- Service Connectivity
- Authentication & Authorization
- Performance Issues
- Database Problems
- Redis Issues
- Traefik Routing
- Docker Container Issues
- WebSocket Connection Problems
- Rate Limiting Issues
- Log Analysis
- Emergency Procedures
Deployment Issues¶
Container Fails to Start¶
Symptoms: - Container exits immediately after starting - docker ps shows container not running - Exit code non-zero
Diagnosis:
# Check container logs
docker logs hw-server
# Check exit code
docker inspect hw-server --format='{{.State.ExitCode}}'
# Check recent events
docker events --since 10m
Common Causes & Solutions:
- Missing Environment Variables:
Fix: Ensure all required variables are set in environment files.
- Port Already in Use:
Fix: Stop conflicting service or change port mapping.
- Volume Permission Issues:
Database Migration Failures¶
Symptoms: - Migration command fails - "Target database is not up to date" error - Duplicate column/table errors
Diagnosis:
# Check current migration version
docker exec hw-server alembic current
# View migration history
docker exec hw-server alembic history
# Check for pending migrations
docker exec hw-server alembic heads
Solutions:
-
Database Out of Sync:
# Check which migrations are applied docker exec hw-db psql -U prod_user -d fastapi_prod \ -c "SELECT * FROM alembic_version;" # Stamp database at current code version docker exec hw-server alembic stamp head # Or downgrade and re-apply docker exec hw-server alembic downgrade -1 docker exec hw-server alembic upgrade head -
Migration Conflicts:
-
Failed Partial Migration:
SSL Certificate Issues¶
Symptoms: - "Certificate verify failed" errors - HTTPS connections rejected - Let's Encrypt challenge fails
Diagnosis:
# Check Traefik logs
docker logs hw-traefik | grep -i certificate
# Check certificate status
docker exec hw-traefik ls -la /letsencrypt/
# Test certificate
curl -vI https://api.example.com 2>&1 | grep -A 10 "SSL certificate"
Solutions:
- Let's Encrypt Rate Limiting:
- Wait for rate limit reset (weekly limit: 50 certs per domain)
-
Use staging environment for testing:
-
DNS Not Propagated:
-
Port 80 Not Accessible:
Service Connectivity¶
Cannot Connect to Application¶
Symptoms: - "Connection refused" errors - "No route to host" - Timeout errors
Diagnosis:
# Check if service is running
docker ps | grep hw-server
# Check if port is listening
docker exec hw-server netstat -tulpn | grep 8000
# Check health status
curl http://localhost:8000/health
# Check Traefik routing
curl http://localhost:8080/api/http/routers
Solutions:
-
Service Not Running:
-
Network Issues:
-
Firewall Blocking:
Inter-Service Communication Fails¶
Symptoms: - Application cannot reach database - Redis connection errors - Keycloak unreachable
Diagnosis:
# Check all services are on same network
docker network inspect hw-network | jq '.[0].Containers'
# Test DNS resolution
docker exec hw-server nslookup hw-db
docker exec hw-server nslookup hw-redis
# Test port connectivity
docker exec hw-server nc -zv hw-db 5432
docker exec hw-server nc -zv hw-redis 6379
docker exec hw-server nc -zv hw-keycloak 8080
Solutions:
-
Services Not on Same Network:
-
Wrong Service Names:
-
Restart All Services:
Authentication & Authorization¶
Keycloak Authentication Fails¶
Symptoms: - "Invalid token" errors - "Unauthorized" (401) responses - "Token signature verification failed"
Diagnosis:
# Check Keycloak is running
docker logs hw-keycloak | tail -50
# Test Keycloak health
curl http://localhost:8080/health
# Verify token endpoint
curl http://localhost:8080/realms/production/.well-known/openid-configuration
# Check application logs for auth errors
docker logs hw-server | grep -i "auth\|token\|keycloak"
Solutions:
-
Token Expired:
# Check token expiration settings in Keycloak # Admin Console → Realm Settings → Tokens # Access Token Lifespan: 5 minutes (default) # Refresh Token Lifespan: 30 minutes (default) # Get new token curl -X POST http://localhost:8080/realms/production/protocol/openid-connect/token \ -d "client_id=fastapi-app" \ -d "client_secret=YOUR_SECRET" \ -d "grant_type=password" \ -d "username=user@example.com" \ -d "password=password" -
Wrong Keycloak Configuration:
-
Client Secret Mismatch:
Permission Denied Errors¶
Symptoms: - "Permission denied" (403) responses - "Insufficient permissions" errors - User cannot access expected endpoints
Diagnosis:
# Check user roles in Keycloak
# Admin Console → Users → <user> → Role Mappings
# Check handler code for required roles
# WebSocket: @pkg_router.register(PkgID.*, roles=["role-name"])
# HTTP: dependencies=[Depends(require_roles("role-name"))]
# Check application logs
docker logs hw-server | grep -i "permission\|rbac"
Solutions:
-
User Missing Required Role:
-
Check Handler Role Requirements:
# Example WebSocket handler @pkg_router.register( PkgID.CREATE_AUTHOR, roles=["create-author", "admin"] # Requires BOTH roles ) # Example HTTP endpoint @router.post( "/authors", dependencies=[Depends(require_roles("create-author", "admin"))] ) # User must have ALL specified roles to access the endpoint -
Token Not Decoded Properly:
Performance Issues¶
Slow Response Times¶
Symptoms: - API requests take > 1 second - WebSocket messages delayed - Timeout errors
Diagnosis:
# Check application metrics
curl http://localhost:8000/metrics | grep http_request_duration
# Check database query performance
docker exec hw-db psql -U prod_user -d fastapi_prod \
-c "SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;"
# Check CPU/memory usage
docker stats hw-server hw-db hw-redis
# Check network latency
docker exec hw-server ping hw-db
docker exec hw-server time nc -zv hw-db 5432
Solutions:
-
Database Query Optimization:
-- Enable pg_stat_statements CREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Find slow queries SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10; -- Add indexes CREATE INDEX idx_author_name ON author(name); CREATE INDEX idx_book_author_id ON book(author_id); -
Increase Connection Pool:
-
Scale Horizontally:
-
Enable Caching:
# Add Redis caching for expensive queries from app.storage.redis import RRedis async def get_popular_authors(): cache_key = "popular_authors" cached = await redis.get(cache_key) if cached: return json.loads(cached) authors = await fetch_from_db() await redis.setex(cache_key, 300, json.dumps(authors)) return authors
High Memory Usage¶
Symptoms: - OOM (Out of Memory) errors - Container restarts - Memory usage > 90%
Diagnosis:
# Check memory usage
docker stats hw-server --no-stream
# Check container memory limit
docker inspect hw-server | jq '.[0].HostConfig.Memory'
# Check Python memory usage
docker exec hw-server python -c "import psutil; print(psutil.virtual_memory())"
Solutions:
-
Increase Memory Limit:
-
Check for Memory Leaks:
-
Reduce Workers:
Database Problems¶
Cannot Connect to Database¶
Symptoms: - "could not connect to server" errors - "FATAL: password authentication failed" - "database does not exist"
Diagnosis:
# Check PostgreSQL is running
docker ps | grep hw-db
# Check PostgreSQL logs
docker logs hw-db | tail -50
# Test connection from application container
docker exec hw-server psql -h hw-db -U prod_user -d fastapi_prod -c "SELECT 1;"
# Check connection string
docker exec hw-server printenv DATABASE_URL
Solutions:
-
Database Not Ready:
-
Wrong Credentials:
# Verify credentials match # .env.production: DATABASE_URL=postgresql://prod_user:PASSWORD@hw-db:5432/fastapi_prod # docker/.pg_env: POSTGRES_USER=prod_user, POSTGRES_PASSWORD=PASSWORD # Reset password if needed docker exec hw-db psql -U postgres \ -c "ALTER USER prod_user WITH PASSWORD 'new_password';" -
Database Does Not Exist:
Database Locks/Deadlocks¶
Symptoms: - "deadlock detected" errors - Queries hanging indefinitely - "could not obtain lock" errors
Diagnosis:
-- Check active locks
SELECT locktype, relation::regclass, mode, granted, pid
FROM pg_locks
WHERE NOT granted;
-- Check blocking queries
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
Solutions:
-
Kill Blocking Query:
-
Prevent Long Transactions:
-
Set Statement Timeout:
Redis Issues¶
Cannot Connect to Redis¶
Symptoms: - "Connection refused" errors - "NOAUTH Authentication required" - Rate limiting not working
Diagnosis:
# Check Redis is running
docker ps | grep hw-redis
# Test connection
docker exec hw-redis redis-cli ping
# Test from application container
docker exec hw-server redis-cli -h hw-redis ping
# Check Redis logs
docker logs hw-redis | tail -50
Solutions:
-
Redis Not Running:
-
Authentication Required:
-
Wrong Redis DB:
Redis Memory Issues¶
Symptoms: - "OOM command not allowed" errors - Redis crashes - High memory usage
Diagnosis:
# Check Redis memory usage
docker exec hw-redis redis-cli INFO memory
# Check max memory setting
docker exec hw-redis redis-cli CONFIG GET maxmemory
Solutions:
-
Increase Max Memory:
-
Configure Eviction Policy:
-
Clear Unused Keys:
Traefik Routing¶
404 Not Found Errors¶
Symptoms: - Traefik returns 404 for valid endpoints - "Service not found" errors
Diagnosis:
# Check Traefik dashboard
curl http://localhost:8080/api/http/routers | jq
# Check container labels
docker inspect hw-server | jq '.[0].Config.Labels'
# Check Traefik logs
docker logs hw-traefik | grep -i error
Solutions:
-
Missing Labels:
-
Restart Traefik:
-
Check Service Discovery:
SSL/TLS Redirect Loop¶
Symptoms: - Browser shows "too many redirects" - Infinite redirect between HTTP and HTTPS
Solutions:
# docker-compose.yml - Ensure Traefik knows it's behind a proxy
services:
hw-server:
labels:
- "traefik.http.middlewares.secure-headers.headers.sslproxyheaders.X-Forwarded-Proto=https"
- "traefik.http.routers.fastapi.middlewares=secure-headers"
Docker Container Issues¶
Container Keeps Restarting¶
Symptoms: - Container in restart loop - docker ps shows "Restarting" status
Diagnosis:
# Check restart count
docker inspect hw-server | jq '.[0].RestartCount'
# Check last exit code
docker inspect hw-server | jq '.[0].State.ExitCode'
# View all logs (before restart)
docker logs hw-server --timestamps
Solutions:
-
Application Crash:
-
Health Check Failing:
-
Resource Limits:
Volume Permission Issues¶
Symptoms: - "Permission denied" errors when writing files - Cannot create directories
Solutions:
# Fix volume ownership
docker exec --user root hw-server chown -R appuser:appuser /app
# Or set ownership on host
sudo chown -R 1000:1000 /path/to/volume
# Ensure user ID matches
docker exec hw-server id
# uid=1000(appuser) gid=1000(appuser)
WebSocket Connection Problems¶
WebSocket Connection Rejected¶
Symptoms: - "Connection closed: 1006" - "Connection closed: 1008 (policy violation)" - Cannot establish WebSocket connection
Diagnosis:
# Check application logs
docker logs hw-server | grep -i websocket
# Test WebSocket connection
wscat -c ws://localhost:8000/web?access_token=TOKEN
# Check Traefik WebSocket configuration
curl http://localhost:8080/api/http/routers | jq '.[] | select(.name=="fastapi")'
Solutions:
-
Missing Access Token:
-
Connection Limit Reached:
-
Traefik Not Forwarding WebSocket:
WebSocket Messages Not Received¶
Symptoms: - Messages sent but no response - Connection stays open but silent
Diagnosis:
# Check application logs for message processing
docker logs hw-server | grep "pkg_id\|req_id"
# Check rate limiting
docker logs hw-server | grep "rate limit"
# Test with wscat
wscat -c "ws://localhost:8000/web?access_token=TOKEN"
> {"pkg_id": 1, "req_id": "test-123", "data": {}}
Solutions:
-
Invalid Message Format:
-
Handler Not Registered:
-
Rate Limit Hit:
Rate Limiting Issues¶
False Positive Rate Limits¶
Symptoms: - Users getting 429 errors incorrectly - Rate limit triggers too quickly
Diagnosis:
# Check rate limit settings
docker exec hw-server printenv | grep RATE_LIMIT
# Check Redis rate limit keys
docker exec hw-redis redis-cli KEYS "rate_limit:*"
# Check specific user's rate limit
docker exec hw-redis redis-cli GET "rate_limit:user:user123"
Solutions:
-
Increase Rate Limits:
-
Clear Rate Limit Keys:
-
Exclude Specific Endpoints:
Log Analysis¶
Finding Errors in Logs¶
Common LogQL Queries:
# Recent errors
{service="shell"} | json | level="ERROR"
# Authentication failures
{service="shell"} | json | logger=~"app.auth.*" |~ "(?i)(failed|invalid|denied)"
# Database errors
{service="shell"} | json |~ "(?i)(database|postgres|sqlalchemy)" | level="ERROR"
# Slow queries (requires duration_ms field)
{service="shell"} | json | duration_ms > 1000
# WebSocket errors
{service="shell"} | json | logger=~"app.api.ws.*" | level="ERROR"
# Rate limit violations
{service="shell"} | json |~ "(?i)(rate limit|429|too many requests)"
# Specific user activity
{service="shell"} | json | user_id="user123"
# Specific endpoint
{service="shell"} | json | endpoint=~"/api/authors.*"
Analyzing Performance Issues¶
# HTTP request duration
{service="shell"} | json | logfmt | line_format "{{.method}} {{.endpoint}} {{.duration_ms}}ms"
# Database query performance
{service="shell"} | json |~ "(?i)query" | line_format "{{.message}} {{.duration_ms}}ms"
# Top error messages
{service="shell"} | json | level="ERROR" | line_format "{{.message}}" | count by message
Emergency Procedures¶
Application Down¶
Immediate Actions:
-
Check service health:
-
Restart failed services:
-
Check recent logs:
-
If restart fails, rollback:
Database Corruption¶
Immediate Actions:
-
Stop application:
-
Check database integrity:
-
Restore from backup:
-
Restart application:
Security Incident¶
Immediate Actions:
-
Isolate affected services:
-
Review audit logs:
-
Block malicious IPs (if applicable):
-
Rotate credentials:
Complete System Failure¶
Recovery Steps:
-
Document current state:
-
Stop all services:
-
Restore from backups:
-
Start services gradually:
-
Verify system health:
Getting Help¶
If issues persist after trying these solutions:
- Check application logs in Grafana:
- http://localhost:3000/d/application-logs
-
Filter by service, level, endpoint
-
Review metrics in Prometheus:
- http://localhost:9090
-
Check for anomalies
-
Consult documentation:
- Monitoring Guide
- Backup/Recovery Guide
-
Contact support:
- GitHub Issues: https://github.com/acikabubo/fastapi-http-websocket/issues
- Internal documentation: Confluence/Wiki
- On-call rotation: PagerDuty