Skip to main content

Metrics & Monitoring

Metrics collection and visualization for Semantic Router using Prometheus and Grafana.


1. Metrics & Endpoints​

ComponentEndpointNotes
Router metrics:9190/metricsPrometheus format (flag: --metrics-port)
Router health:8080/healthHTTP readiness/liveness
Envoy metrics (optional):19000/stats/prometheusIf Envoy is enabled

Configuration location: tools/observability/
Dashboard: tools/observability/llm-router-dashboard.json


2. Local Mode (Router on Host)​

Run router natively on host, observability in Docker.

Quick Start​

# Start router
make run-router

# Start observability
make o11y-local

Access:

Verify targets:

# Check Prometheus scrapes localhost:9190
open http://localhost:9090/targets

Stop:

make stop-observability

Configuration​

All configs in tools/observability/:

  • prometheus.yaml - Scrapes the target from the ROUTER_TARGET env var (default: localhost:9190)
  • grafana-datasource.yaml - Points to localhost:9090
  • grafana-dashboard.yaml - Dashboard provisioning
  • llm-router-dashboard.json - Dashboard definition

Troubleshooting​

IssueFix
Target DOWNStart router: make run-router
No metricsGenerate traffic, check :9190/metrics
Port conflictChange port or stop conflicting service

3. Docker Compose Mode​

All services in Docker containers.

Quick Start​

# Start full stack (includes observability)
docker compose -f deploy/docker-compose/docker-compose.yml up --build

# Or with testing profile
docker compose -f deploy/docker-compose/docker-compose.yml --profile testing up --build

Access:

Expected targets:

  • semantic-router:9190
  • envoy-proxy:19000 (optional)

Configuration​

Same configs as local mode (tools/observability/), but:

  • ROUTER_TARGET=semantic-router:9190
  • PROMETHEUS_URL=prometheus:9090
  • Uses semantic-network bridge network

4. Kubernetes Mode​

Production-ready Prometheus + Grafana for K8s clusters.

Namespace: vllm-semantic-router-system

Components​

ComponentPurposeLocation
PrometheusScrapes router metrics, 15d retentiondeploy/kubernetes/observability/prometheus/
GrafanaDashboard visualizationdeploy/kubernetes/observability/grafana/
IngressOptional external accessdeploy/kubernetes/observability/ingress.yaml

Deploy​

# Apply manifests
kubectl apply -k deploy/kubernetes/observability/

# Verify
kubectl get pods -n vllm-semantic-router-system

Access​

Port-forward:

kubectl port-forward svc/prometheus 9090:9090 -n vllm-semantic-router-system
kubectl port-forward svc/grafana 3000:3000 -n vllm-semantic-router-system

Ingress: Customize ingress.yaml with your domain and TLS

Key Configuration​

Prometheus uses Kubernetes service discovery:

scrape_configs:
- job_name: semantic-router
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [vllm-semantic-router-system]

Grafana credentials (change in production):

kubectl create secret generic grafana-admin \
--namespace vllm-semantic-router-system \
--from-literal=admin-user=admin \
--from-literal=admin-password='your-password'

5. Key Metrics​

MetricTypeDescription
llm_category_classifications_countcounterCategory classifications
llm_model_completion_tokens_totalcounterTokens per model
llm_model_routing_modifications_totalcounterModel routing changes
llm_model_completion_latency_secondshistogramCompletion latency

Example queries:

rate(llm_model_completion_tokens_total[5m])
histogram_quantile(0.95, rate(llm_model_completion_latency_seconds_bucket[5m]))

6. Windowed Endpoint Metrics (Load Balancing)​

Enhanced time-windowed metrics for endpoint and model performance tracking, useful for load balancing decisions.

Configuration​

Enable windowed metrics in config.yaml:

observability:
metrics:
windowed_metrics:
enabled: true
time_windows: ["1m", "5m", "15m", "1h", "24h"]
update_interval: "10s"
endpoint_metrics: true
queue_depth_estimation: true
max_endpoints: 100

Endpoint-Level Metrics​

MetricTypeLabelsDescription
llm_endpoint_latency_windowed_secondsgaugeendpoint, model, time_windowAverage latency per time window
llm_endpoint_requests_windowed_totalgaugeendpoint, model, time_windowRequest count per time window
llm_endpoint_tokens_windowed_totalgaugeendpoint, model, token_type, time_windowToken throughput per window
llm_endpoint_utilization_percentagegaugeendpoint, time_windowEstimated utilization percentage
llm_endpoint_queue_depth_estimatedgaugeendpoint, modelCurrent estimated queue depth
llm_endpoint_error_rate_windowedgaugeendpoint, model, time_windowError rate per time window
llm_endpoint_latency_p50_windowed_secondsgaugeendpoint, model, time_windowP50 latency per time window
llm_endpoint_latency_p95_windowed_secondsgaugeendpoint, model, time_windowP95 latency per time window
llm_endpoint_latency_p99_windowed_secondsgaugeendpoint, model, time_windowP99 latency per time window

Example Queries​

# Average latency for endpoint in last 5 minutes
llm_endpoint_latency_windowed_seconds{endpoint="10.0.0.1:8080", time_window="5m"}

# P95 latency comparison across endpoints
llm_endpoint_latency_p95_windowed_seconds{time_window="15m"}

# Token throughput per endpoint
llm_endpoint_tokens_windowed_total{token_type="completion", time_window="1h"}

# Current queue depth for load balancing decisions
llm_endpoint_queue_depth_estimated{endpoint="10.0.0.1:8080"}

# Error rate monitoring
llm_endpoint_error_rate_windowed{time_window="5m"} > 0.05

Use Cases​

  1. Load Balancing: Use queue depth and latency metrics to route requests to less loaded endpoints
  2. Performance Monitoring: Track P95/P99 latency trends across time windows
  3. Capacity Planning: Monitor utilization percentages to identify when to scale
  4. Alerting: Set alerts on error rates or latency spikes within specific time windows

7. Troubleshooting​

IssueCheckFix
Target DOWNPrometheus /targetsVerify router is running and exposing :9190/metrics
No metricsGenerate trafficSend requests through router
Dashboard emptyGrafana datasourceCheck Prometheus URL configuration