README
¶
EVE Distributed Tracing - Docker Compose Setup
Complete observability stack for EVE distributed tracing system.
Quick Start
# Start the entire stack
docker-compose up -d
# View logs
docker-compose logs -f
# Stop the stack
docker-compose down
# Stop and remove all data
docker-compose down -v
Services
The Docker Compose stack includes:
Data Storage
-
PostgreSQL (port 5432) - TimescaleDB for trace metadata
- Database:
eve_traces - User:
eve_user - Password:
eve_password
- Database:
-
MinIO (ports 9000, 9001) - S3-compatible storage for trace payloads
- Console: http://localhost:9001
- User:
minioadmin - Password:
minioadmin - Bucket:
eve-traces
Monitoring & Observability
-
Prometheus (port 9090) - Metrics collection
- UI: http://localhost:9090
- Scrapes metrics from example-service every 15s
-
AlertManager (port 9093) - Alert routing
- UI: http://localhost:9093
- 43 pre-configured alert rules
-
Grafana (port 3000) - Dashboards and visualization
- UI: http://localhost:3000
- User:
admin - Password:
admin - 3 pre-loaded dashboards:
- Operational (SRE/DevOps)
- Business (Product metrics)
- Compliance (GDPR/Privacy)
-
Loki (port 3100) - Log aggregation
- 30-day retention
- Automatic compaction
-
Promtail - Log shipping to Loki
Example Service
- example-service (ports 8080, 9091)
- REST API: http://localhost:8080
- Metrics: http://localhost:9091/metrics
- Demonstrates EVE tracing features
Access URLs
| Service | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | - |
| AlertManager | http://localhost:9093 | - |
| MinIO Console | http://localhost:9001 | minioadmin / minioadmin |
| Example Service | http://localhost:8080 | - |
| Metrics | http://localhost:9091/metrics | - |
Example Service API
The example service provides several endpoints to demonstrate tracing:
# Home / API documentation
curl http://localhost:8080/
# Create a normal workflow (fast, sampled at 10% base rate)
curl -X POST http://localhost:8080/v1/api/workflow/create
# Create a slow workflow (>5s, always sampled)
curl -X POST http://localhost:8080/v1/api/workflow/slow
# Create a failing workflow (always sampled)
curl -X POST http://localhost:8080/v1/api/workflow/error
# Health check
curl http://localhost:8080/health
# View Prometheus metrics
curl http://localhost:9091/metrics
Viewing Traces
In Grafana
- Open http://localhost:3000
- Login with admin/admin
- Navigate to Dashboards → EVE folder
- Open one of the 3 dashboards:
- Operational - Request rate, errors, latency, queue metrics
- Business - Workflow execution patterns, success rates
- Compliance - GDPR metrics, PII detection, audit access
In PostgreSQL
# Connect to PostgreSQL
docker exec -it eve-postgres psql -U eve_user -d eve_traces
# View all traces
SELECT correlation_id, operation_id, service_id, action_type, action_status, duration_ms
FROM action_executions
ORDER BY started_at DESC
LIMIT 10;
# Get workflow trace
SELECT * FROM get_workflow_trace('wf-demo-001');
# Get workflow statistics
SELECT * FROM get_workflow_stats('wf-demo-001');
In Prometheus
- Open http://localhost:9090
- Go to Graph tab
- Example queries:
# Request rate rate(eve_tracing_actions_total[5m]) # Error rate rate(eve_tracing_action_errors_total[5m]) # Average latency (p95) histogram_quantile(0.95, rate(eve_tracing_action_duration_seconds_bucket[5m])) # Queue size eve_tracing_exporter_queue_size # Sampling decisions rate(eve_tracing_sampling_decisions_total[5m])
In Loki (Logs)
- Open http://localhost:3000
- Go to Explore
- Select Loki datasource
- Example queries:
# All logs from example-service {service="example-service"} # Error logs only {service="example-service", level="error"} # Logs for specific correlation_id {service="example-service"} |= "wf-demo-001" # Action execution logs {service="example-service", event_type="action_execution"}
Generate Test Traffic
Create a script to generate continuous traffic:
#!/bin/bash
# generate-traffic.sh
while true; do
# Normal requests (90% of traffic)
for i in {1..9}; do
curl -X POST http://localhost:8080/v1/api/workflow/create &
done
# Slow request (sampled)
curl -X POST http://localhost:8080/v1/api/workflow/slow &
# Random error (sampled)
if [ $((RANDOM % 10)) -eq 0 ]; then
curl -X POST http://localhost:8080/v1/api/workflow/error &
fi
sleep 1
done
Run it:
chmod +x generate-traffic.sh
./generate-traffic.sh
Monitoring Alerts
AlertManager is configured with 43 alert rules across 3 categories:
Operational Alerts (12 rules)
- Queue overflow (CRITICAL)
- Traces dropped (CRITICAL)
- High error rate (CRITICAL)
- Database failures (CRITICAL)
- S3 failures (CRITICAL)
- Queue high (WARNING)
- High latency (WARNING)
Business Alerts (15 rules)
- Low success rate <90% (CRITICAL)
- SLO violation p95 >120s (CRITICAL)
- Degraded success <95% (WARNING)
- Slow workflows (WARNING)
- High throughput (INFO)
Compliance Alerts (16 rules)
- Data residency violations (CRITICAL)
- Unredacted PII (CRITICAL)
- Erasure failures (CRITICAL)
- Low redaction rate (WARNING)
- Slow erasure (WARNING)
- Unauthorized access (WARNING)
View active alerts:
# AlertManager UI
open http://localhost:9093
# Or via API
curl http://localhost:9093/api/v2/alerts
Architecture
┌─────────────────┐
│ Example Service │ ──► Traces ──► PostgreSQL (metadata)
│ (Port 8080) │ │
└────────┬────────┘ │
│ │
├──► Metrics ──► Prometheus ────┤
│ │ │
│ ▼ │
│ AlertManager │
│ │
└──► Logs ──► Promtail ──► Loki│
│
▼
Grafana
(Dashboards)
Configuration
Tracing Features Enabled
- ✅ Async Export: 10,000 queue size, 100 batch size, 4 workers
- ✅ Sampling: 10% base rate, always sample errors/slow
- ✅ Metrics: 23 Prometheus metrics
- ✅ Logging: JSON structured logs with correlation_id
- ✅ Dashboards: 3 Grafana dashboards with 44 panels
- ✅ Alerts: 43 AlertManager rules
Sampling Configuration
Base Rate: 10% # Normal traffic sampled at 10%
Always Sample Errors: true # Keep 100% of failed traces
Always Sample Slow: true # Keep traces >5000ms
Slow Threshold: 5000ms # Definition of "slow"
Deterministic: true # Consistent sampling per correlation_id
Storage Configuration
PostgreSQL:
Retention: Unlimited (configure via archival)
Partitioning: Daily chunks (TimescaleDB)
Indexes: 7 indexes for fast queries
MinIO (S3):
Bucket: eve-traces
Structure: eve-traces/{correlation_id}/{operation_id}/
Lifecycle: Not configured (add via archival manager)
Loki:
Retention: 30 days
Compaction: Every 10 minutes
Troubleshooting
Services not starting
# Check service status
docker-compose ps
# View logs for specific service
docker-compose logs postgres
docker-compose logs example-service
# Restart a service
docker-compose restart example-service
Database connection errors
# Check PostgreSQL is ready
docker exec eve-postgres pg_isready -U eve_user
# Connect manually
docker exec -it eve-postgres psql -U eve_user -d eve_traces
# View initialization logs
docker-compose logs postgres | grep "initialized successfully"
MinIO not accessible
# Check MinIO health
curl http://localhost:9000/minio/health/live
# View MinIO logs
docker-compose logs minio
# Recreate bucket
docker-compose restart minio-init
No metrics in Prometheus
# Check Prometheus targets
open http://localhost:9090/targets
# Check example service metrics endpoint
curl http://localhost:9091/metrics | grep eve_tracing
# Reload Prometheus config
curl -X POST http://localhost:9090/-/reload
No data in Grafana
# Check datasource configuration
open http://localhost:3000/datasources
# Test Prometheus connection
curl http://prometheus:9090/api/v1/query?query=up
# Check if dashboards loaded
ls -la grafana/dashboards/
Performance Tuning
For Production
Update docker-compose.yml with:
example-service:
environment:
# Increase async export capacity
ASYNC_EXPORT_QUEUE_SIZE: "50000"
ASYNC_EXPORT_BATCH_SIZE: "500"
ASYNC_EXPORT_WORKERS: "8"
# Adjust sampling for lower overhead
SAMPLING_BASE_RATE: "0.01" # 1% instead of 10%
Database Optimization
-- Add to init.sql for production
-- Enable parallel query execution
ALTER SYSTEM SET max_parallel_workers_per_gather = 4;
-- Increase shared buffers (25% of RAM)
ALTER SYSTEM SET shared_buffers = '256MB';
-- Reload config
SELECT pg_reload_conf();
Cleanup
# Stop all services
docker-compose down
# Remove all data volumes
docker-compose down -v
# Remove all images
docker-compose down --rmi all
# Complete cleanup
docker-compose down -v --rmi all --remove-orphans
Next Steps
- Add More Services: Add containerservice, workflowstorageservice, etc.
- Enable Archival: Configure S3 Glacier lifecycle policies
- Add Dependency Mapping: Build service dependency graph
- Configure Alerts: Update AlertManager with email/Slack webhooks
- Production Deployment: Use Kubernetes or Docker Swarm
Related Documentation
- Main Tracing Documentation:
/home/opunix/eve/tracing/README.md - Grafana Dashboards:
/home/opunix/eve/grafana/README.md - AlertManager Setup:
/home/opunix/eve/grafana/alertmanager/README.md - Archival Guide:
/home/opunix/eve/tracing/ARCHIVAL.md - Dependency Mapping:
/home/opunix/eve/tracing/DEPENDENCIES.md - Sampling Guide:
/home/opunix/eve/tracing/SAMPLING.md
Click to show internal directories.
Click to hide internal directories.