Metrics & Monitoring
Metrics collection and visualization for Semantic Router using Prometheus and Grafana.
1. Metrics & Endpoints​
| Component | Endpoint | Notes |
|---|---|---|
| Router metrics | :9190/metrics | Prometheus format (flag: --metrics-port) |
| Router health | :8080/health | HTTP readiness/liveness |
| Envoy metrics (optional) | :19000/stats/prometheus | If Envoy is enabled |
Configuration location: tools/observability/
Dashboard: tools/observability/llm-router-dashboard.json
2. Local Mode (Router on Host)​
Run router natively on host, observability in Docker.
Quick Start​
# Start router
make run-router
# Start observability
make o11y-local
Access:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
Verify targets:
# Check Prometheus scrapes localhost:9190
open http://localhost:9090/targets
Stop:
make stop-observability
Configuration​
All configs in tools/observability/:
prometheus.yaml- Scrapes the target from theROUTER_TARGETenv var (default:localhost:9190)grafana-datasource.yaml- Points tolocalhost:9090grafana-dashboard.yaml- Dashboard provisioningllm-router-dashboard.json- Dashboard definition
Troubleshooting​
| Issue | Fix |
|---|---|
| Target DOWN | Start router: make run-router |
| No metrics | Generate traffic, check :9190/metrics |
| Port conflict | Change port or stop conflicting service |
3. Docker Compose Mode​
All services in Docker containers.
Quick Start​
# Start full stack (includes observability)
docker compose -f deploy/docker-compose/docker-compose.yml up --build
# Or with testing profile
docker compose -f deploy/docker-compose/docker-compose.yml --profile testing up --build
Access:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
Expected targets:
semantic-router:9190envoy-proxy:19000(optional)
Configuration​
Same configs as local mode (tools/observability/), but:
ROUTER_TARGET=semantic-router:9190PROMETHEUS_URL=prometheus:9090- Uses
semantic-networkbridge network
4. Kubernetes Mode​
Production-ready Prometheus + Grafana for K8s clusters.
Namespace:
vllm-semantic-router-system
Components​
| Component | Purpose | Location |
|---|---|---|
| Prometheus | Scrapes router metrics, 15d retention | deploy/kubernetes/observability/prometheus/ |
| Grafana | Dashboard visualization | deploy/kubernetes/observability/grafana/ |
| Ingress | Optional external access | deploy/kubernetes/observability/ingress.yaml |
Deploy​
# Apply manifests
kubectl apply -k deploy/kubernetes/observability/
# Verify
kubectl get pods -n vllm-semantic-router-system
Access​
Port-forward:
kubectl port-forward svc/prometheus 9090:9090 -n vllm-semantic-router-system
kubectl port-forward svc/grafana 3000:3000 -n vllm-semantic-router-system
Ingress: Customize ingress.yaml with your domain and TLS
Key Configuration​
Prometheus uses Kubernetes service discovery:
scrape_configs:
- job_name: semantic-router
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [vllm-semantic-router-system]
Grafana credentials (change in production):
kubectl create secret generic grafana-admin \
--namespace vllm-semantic-router-system \
--from-literal=admin-user=admin \
--from-literal=admin-password='your-password'
5. Key Metrics​
| Metric | Type | Description |
|---|---|---|
llm_category_classifications_count | counter | Category classifications |
llm_model_completion_tokens_total | counter | Tokens per model |
llm_model_routing_modifications_total | counter | Model routing changes |
llm_model_completion_latency_seconds | histogram | Completion latency |
Example queries:
rate(llm_model_completion_tokens_total[5m])
histogram_quantile(0.95, rate(llm_model_completion_latency_seconds_bucket[5m]))
6. Windowed Endpoint Metrics (Load Balancing)​
Enhanced time-windowed metrics for endpoint and model performance tracking, useful for load balancing decisions.
Configuration​
Enable windowed metrics in config.yaml:
observability:
metrics:
windowed_metrics:
enabled: true
time_windows: ["1m", "5m", "15m", "1h", "24h"]
update_interval: "10s"
endpoint_metrics: true
queue_depth_estimation: true
max_endpoints: 100
Endpoint-Level Metrics​
| Metric | Type | Labels | Description |
|---|---|---|---|
llm_endpoint_latency_windowed_seconds | gauge | endpoint, model, time_window | Average latency per time window |
llm_endpoint_requests_windowed_total | gauge | endpoint, model, time_window | Request count per time window |
llm_endpoint_tokens_windowed_total | gauge | endpoint, model, token_type, time_window | Token throughput per window |
llm_endpoint_utilization_percentage | gauge | endpoint, time_window | Estimated utilization percentage |
llm_endpoint_queue_depth_estimated | gauge | endpoint, model | Current estimated queue depth |
llm_endpoint_error_rate_windowed | gauge | endpoint, model, time_window | Error rate per time window |
llm_endpoint_latency_p50_windowed_seconds | gauge | endpoint, model, time_window | P50 latency per time window |
llm_endpoint_latency_p95_windowed_seconds | gauge | endpoint, model, time_window | P95 latency per time window |
llm_endpoint_latency_p99_windowed_seconds | gauge | endpoint, model, time_window | P99 latency per time window |
Example Queries​
# Average latency for endpoint in last 5 minutes
llm_endpoint_latency_windowed_seconds{endpoint="10.0.0.1:8080", time_window="5m"}
# P95 latency comparison across endpoints
llm_endpoint_latency_p95_windowed_seconds{time_window="15m"}
# Token throughput per endpoint
llm_endpoint_tokens_windowed_total{token_type="completion", time_window="1h"}
# Current queue depth for load balancing decisions
llm_endpoint_queue_depth_estimated{endpoint="10.0.0.1:8080"}
# Error rate monitoring
llm_endpoint_error_rate_windowed{time_window="5m"} > 0.05
Use Cases​
- Load Balancing: Use queue depth and latency metrics to route requests to less loaded endpoints
- Performance Monitoring: Track P95/P99 latency trends across time windows
- Capacity Planning: Monitor utilization percentages to identify when to scale
- Alerting: Set alerts on error rates or latency spikes within specific time windows
7. Troubleshooting​
| Issue | Check | Fix |
|---|---|---|
| Target DOWN | Prometheus /targets | Verify router is running and exposing :9190/metrics |
| No metrics | Generate traffic | Send requests through router |
| Dashboard empty | Grafana datasource | Check Prometheus URL configuration |