Using the Grafana Dashboard
You have the monitoring stack running. You understand what each metric means. Now it’s time to actually use the dashboard to monitor your Edera deployment.
The “Edera Workload Monitoring” dashboard has 18 panels organized into 6 sections. This isn’t a generic Kubernetes dashboard—it’s purpose-built for the three-layer Edera architecture. Every panel answers a specific operational question.
Let’s walk through the dashboard section by section and learn how to interpret what you’re seeing.
Accessing the Dashboard
Step 1: Log into Grafana
Navigate to your Grafana instance:
# Get the LoadBalancer URL
kubectl get svc grafana -n edera-monitoring -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
# Or use port-forward for local access
kubectl port-forward -n edera-monitoring svc/grafana 3000:3000Login with:
- Username:
admin - Password:
feeltheteal(or your custom password)
Step 2: Open the Edera Dashboard
The dashboard should auto-load as the default. If not:
- Click the menu icon (☰) in the top-left
- Navigate to “Dashboards”
- Select “Edera Workload Monitoring”
Or access directly at:
http://<GRAFANA_URL>:3000/d/edera-workload-monitoring/edera-workload-monitoringDashboard Overview
The dashboard is organized into these sections:
┌──────────────────────────────────────────────────────────┐
│ Section 1: Overview │
│ - Ready Zones | Total Zones | Health Status | Timeline │
├──────────────────────────────────────────────────────────┤
│ Section 2: Zone CPU Metrics │
│ - Per-core CPU usage | Average CPU per zone │
├──────────────────────────────────────────────────────────┤
│ Section 3: Zone Memory Metrics │
│ - Memory usage (bytes) | Memory usage (%) │
├──────────────────────────────────────────────────────────┤
│ Section 4: Hypervisor Metrics │
│ - CPU rate | Memory allocation | vCPUs online │
├──────────────────────────────────────────────────────────┤
│ Section 5: Host (Dom0) Metrics │
│ - Per-core CPU | CPU by mode | Memory breakdown │
├──────────────────────────────────────────────────────────┤
│ Section 6: Health & Kubernetes │
│ - Health checks | Namespace distribution | Summary │
└──────────────────────────────────────────────────────────┘Let’s dive into each section.
Section 1: Overview
This is your 30-second health check. Four panels that immediately tell you if things are normal or on fire.
Panel 1: Ready Zones
Type: Stat (large number)
Query: sum(zones{state="ready"})
What you see: A single number showing zones in ready state.
Interpretation:
- Green and stable: Normal operation
- Fluctuating rapidly: Zones are churning (starting/stopping frequently)
- Dropping to zero: Critical issue—all zones are failing
Action:
- If this drops unexpectedly, immediately check Section 6 (Health Status)
- Correlate with Kubernetes events:
kubectl get events -n <namespace>
Panel 2: Total Zones
Type: Stat (large number)
Query: sum(zones)
What you see: Total count of all zones regardless of state.
Interpretation:
- Should match your expected workload count
- Significantly higher than “Ready Zones” means zones stuck in other states
Action:
- Compare to Ready Zones
- If total » ready, check for zones stuck in “starting” or “error” states
- Query Prometheus:
zones{state!="ready"}
Panel 3: Health Status
Type: Stat (indicator)
Query: sum(rate(health_check_total{status="success"}[5m])) / sum(rate(health_check_total[5m])) > 0.95
What you see: A binary indicator (usually green checkmark or red X).
Interpretation:
- Green (>95% success): System healthy
- Red (<95% success): Degraded health checks
Action:
- If red, drill into specific services with:
rate(health_check_total{status="failed"}[5m]) - Check logs:
kubectl logs -n edera-system -l app=edera
Panel 4: Zones by State (Timeline)
Type: Time series graph
Query: zones (all states)
What you see: Colored lines showing zone counts over time, broken down by state.
Interpretation:
- Flat lines: Stable cluster
- Spiky “starting” line: Frequent pod churn (possibly autoscaling or restarts)
- Persistent “error” line: Zones failing to start
Action:
- Zoom in on spikes to correlate with deployments or scaling events
- If “error” state persists, identify the zone:
zones{state="error"}
Section 2: Zone CPU Metrics
This section shows CPU usage for individual zones. Remember: each zone can have multiple vCPUs.
Panel 5: Zone CPU Usage by CPU Core
Type: Time series (multi-line)
Query: zone_cpu_usage_percent
What you see: One line per vCPU per zone. With many zones, this looks like a rainbow of lines.
Interpretation:
- Lines clustered low (0-30%): Zones are under-utilized
- Lines near 100%: Zones are CPU-bound
- Uneven usage within a zone: Workload isn’t parallelizing across vCPUs
How to use:
- Filter by zone: Add
{zone_id="<id>"}to focus on one zone - Look for sustained 100% usage—that zone needs more CPU
- Look for asymmetric usage—workload might not be multi-threaded
Action:
- If a zone is consistently maxed: increase vCPU allocation or scale horizontally
- If usage is low across zones: consider reducing vCPU to save costs
Panel 6: Average CPU Usage per Zone
Type: Time series
Query: avg by (zone_id, k8s_pod) (zone_cpu_usage_percent)
What you see: One line per zone showing average CPU across all vCPUs.
Interpretation:
- Easier to read than per-core view
- Shows which zones are CPU-intensive
How to use:
- Identify top CPU consumers
- Correlate with k8s_pod labels to identify workload type
- Track CPU usage over time to detect trends (memory leaks often cause CPU spikes)
Action:
- Click a line to see zone_id and k8s_pod
- Use this to right-size resource requests in Kubernetes manifests
Section 3: Zone Memory Metrics
Memory is less flexible than CPU—zones have fixed allocations. These panels show how close zones are to their limits.
Panel 7: Zone Memory Usage
Type: Time series Queries:
zone_memory_used_byteszone_memory_total_bytes
What you see: Two sets of lines—used (typically lower) and total (typically higher).
Interpretation:
- Used approaching total: Zone is at memory capacity
- Large gap: Zone has plenty of free memory
- Used > total: Impossible—indicates a metric collection issue
How to use:
- Filter by zone to see specific usage
- Watch for used memory steadily increasing (memory leak)
Action:
- If used approaches total, increase zone memory allocation
- If used is consistently low, decrease allocation to save resources
Panel 8: Zone Memory Usage %
Type: Gauge (percentage)
Query: (zone_memory_used_bytes / zone_memory_total_bytes) * 100
What you see: Percentage bars for each zone.
Interpretation:
- Green (<70%): Healthy
- Yellow (70-85%): Monitor closely
- Red (>85%): Approaching OOM (out-of-memory)
Action:
- Any zone above 85%: immediate action required
- Check if the zone needs more memory or if there’s a memory leak
- Query the pod:
kubectl describe pod <pod-name> -n <namespace>
Section 4: Hypervisor Metrics
This is Xen’s view of the world. These panels show how the hypervisor is managing resources.
Panel 9: Hypervisor CPU Usage Rate
Type: Time series
Query: rate(hypervisor_cpu_usage_seconds_total[5m])
What you see: CPU time consumption rate from Xen’s perspective.
Interpretation:
- A value of
1.0means one full CPU core is being used - This is cumulative across all vCPUs for a zone
How to use:
- Compare to zone CPU metrics to validate consistency
- Detect CPU time theft or unexpected usage
- A zone with 2 vCPUs fully utilized should show ~2.0 here
Action:
- Discrepancies between zone and hypervisor CPU indicate measurement issues
- Use this to validate that zones are getting the CPU time they request
Panel 10: Hypervisor Memory Allocation
Type: Time series Queries:
hypervisor_memory_total_byteshypervisor_memory_outstanding_bytes
What you see: Total allocated and “outstanding” (owed) memory.
Interpretation:
- Total: What Xen has allocated to zones
- Outstanding: Memory allocated but not yet provided (ballooning)
Why it matters:
- Outstanding memory > 0 indicates memory pressure
- Xen is trying to give zones memory but the host doesn’t have enough
- Persistent outstanding memory means the node is over-subscribed
Action:
- If outstanding is consistently >20% of total: reduce zone density or add nodes
- Check host memory (Section 5) to confirm capacity
Panel 11: Hypervisor vCPUs Online per Zone
Type: Gauge
Query: hypervisor_vcpus_online
What you see: Number of vCPUs online for each zone.
Interpretation:
- Should match the zone’s requested CPU count
- Mismatch indicates configuration or scheduling issue
Action:
- If vCPUs < expected: check zone creation logs
- If vCPUs > expected: possible configuration drift
Section 5: Host (Dom0) Metrics
This is the foundation. The physical node running everything.
Panel 12: Host CPU Usage per Core
Type: Time series
Query: host_cpu_usage_percent
What you see: One line per physical CPU core on the node.
Interpretation:
- All cores low (<50%): Node has CPU headroom
- All cores high (>80%): Node is saturated
- Some cores pinned at 100%: Potential CPU pinning or NUMA issues
Action:
- If all cores are saturated: add more nodes or reduce zone density
- If usage is uneven: check CPU affinity settings
Panel 13: Host CPU Time by Mode
Type: Stacked area chart
Query: rate(host_cpu_usage_seconds_total[5m])
What you see: CPU time breakdown by mode (user, system, idle, iowait, etc.).
Interpretation:
- High user time: Application workload (normal)
- High system time: Kernel/hypervisor overhead (could indicate issues)
- High iowait: Disk I/O bottleneck
- High idle: Node is under-utilized
Action:
- High iowait: investigate disk performance (EBS throttling?)
- High system time: possible hypervisor overhead or kernel issues
- High idle: opportunity to increase zone density
Panel 14: Host Memory Usage
Type: Time series Queries:
host_memory_used_byteshost_memory_free_byteshost_memory_total_bytes
What you see: Memory breakdown for the host node.
Interpretation:
- Free memory shrinking: Node approaching capacity
- Used + free ≠ total: Some memory is buffers/cache (normal in Linux)
Action:
- If free memory < 10%: stop provisioning new zones on this node
- Check hypervisor outstanding memory (Panel 10) for correlation
Panel 15: Host Memory Usage %
Type: Gauge
Query: (host_memory_used_bytes / host_memory_total_bytes) * 100
What you see: Single percentage bar for host memory.
Interpretation:
- <80%: Healthy
- 80-90%: Monitor closely
- >90%: At capacity—no room for more zones
Action:
- Above 90%: don’t schedule more zones on this node
- Correlate with zone memory to understand what’s consuming memory
Section 6: Health & Kubernetes
This section ties everything together with health checks and Kubernetes context.
Panel 16: Health Check Rates by Service
Type: Time series
Query: rate(health_check_total[5m])
What you see: Success and failure rates for each Edera service.
Interpretation:
- Steady success rate: Services are healthy
- Spikes in failures: Transient issues
- Sustained failures: Service degradation
Action:
- Drill into failed checks:
health_check_total{status="failed"} - Correlate with service logs
Panel 17: Zone Distribution by Kubernetes Namespace
Type: Pie chart or table
Query: count by (k8s_namespace) (zone_cpu_usage_percent)
What you see: How zones are distributed across namespaces.
Interpretation:
- Shows resource allocation per team/environment
- Useful for chargeback or capacity planning
Action:
- Use to identify which teams are consuming most resources
- Balance resource allocation across namespaces
Panel 18: Resource Summary Table
Type: Table Query: Multiple metrics aggregated
What you see: Comprehensive table with zone_id, k8s_pod, CPU, memory, state.
Interpretation:
- This is your “everything at once” view
- Sortable by any column
How to use:
- Sort by CPU to find top consumers
- Sort by memory % to find zones near limits
- Filter by namespace or state
Action:
- Export to CSV for reporting
- Use as a checklist during incident response
Dashboard Time Range and Refresh
Default time range: Last 1 hour Default refresh: 30 seconds
Adjusting the time range:
- Click the time picker in top-right
- Select preset ranges (5m, 15m, 1h, 6h, 24h, 7d)
- Or set a custom range
Use cases:
- Last 5 minutes: Real-time troubleshooting
- Last 1 hour: Recent trends
- Last 24 hours: Daily patterns
- Last 7 days: Weekly trends and capacity planning
Refresh rate:
- 30s default balances freshness with load
- Increase to 10s for active incidents
- Decrease to 1m for historical analysis
Practical Workflows
Workflow 1: Daily Health Check (2 minutes)
- Open the dashboard
- Check Section 1 (Overview):
- Ready Zones count normal?
- Health status green?
- Scroll through sections looking for red/yellow indicators
- Done—everything’s healthy
Workflow 2: Investigating High CPU (5 minutes)
- Section 1: Notice spikes in zone activity
- Section 2: Identify which zones have high CPU
- Sort by highest CPU usage
- Note the k8s_pod and k8s_namespace labels
- Check Kubernetes:
kubectl describe pod <pod> -n <namespace> - Decide: scale up (more vCPUs) or scale out (more replicas)?
Workflow 3: Capacity Planning (10 minutes)
- Set time range to “Last 7 days”
- Section 5: Check host CPU and memory trends
- Are we approaching capacity?
- Section 2 & 3: Check zone resource utilization
- Are zones over-provisioned?
- Calculate headroom:
- Host capacity - (sum of zone allocations) = available for new zones
- Decision: add nodes, reduce zone sizes, or optimize workloads
Workflow 4: Incident Response (During an outage)
- Set refresh to 10s, time range to “Last 15 minutes”
- Section 1: What changed? Zone count dropped? Health check failed?
- Section 6: Which service failed health checks?
- Section 5: Host issues? (CPU/memory/disk?)
- Section 4: Hypervisor issues? (outstanding memory?)
- Section 2 & 3: Zone-level issues? (OOM? CPU starvation?)
- Correlate with logs and Kubernetes events
- Fix root cause and watch metrics return to normal
Creating Custom Dashboards
The default dashboard is comprehensive, but you may want custom views.
Creating a new dashboard:
- In Grafana, click “+” → “Dashboard”
- Click “Add visualization”
- Select “Prometheus” as the data source
- Enter a PromQL query (examples below)
- Customize visualization type and settings
- Save the panel and dashboard
Useful custom panels:
Top 10 zones by memory usage:
topk(10, (zone_memory_used_bytes / zone_memory_total_bytes) * 100)Zones by Kubernetes namespace (bar chart):
count(zone_cpu_usage_percent) by (k8s_namespace)Alert panel for zones >90% memory:
(zone_memory_used_bytes / zone_memory_total_bytes) * 100 > 90Cluster-wide CPU usage:
sum(zone_cpu_usage_percent) / count(zone_cpu_usage_percent)Nodes with low free memory:
(host_memory_free_bytes / host_memory_total_bytes) * 100 < 20Setting Up Alerts
Dashboards are for humans. Alerts are for automation.
Creating an alert in Grafana:
- Edit a panel (or create a new one)
- Go to the “Alert” tab
- Click “Create alert rule from this panel”
- Set conditions (e.g., “WHEN avg() OF query(A) IS ABOVE 90”)
- Configure evaluation interval and pending period
- Set notification channel (Slack, PagerDuty, email)
- Save
Recommended alerts:
Zone memory critical:
- Query:
(zone_memory_used_bytes / zone_memory_total_bytes) * 100 - Condition: > 90 for 5 minutes
- Severity: Critical
Host memory critical:
- Query:
(host_memory_used_bytes / host_memory_total_bytes) * 100 - Condition: > 90 for 5 minutes
- Severity: Critical
Health check failures:
- Query:
sum(rate(health_check_total{status="success"}[5m])) / sum(rate(health_check_total[5m])) - Condition: < 0.95 for 5 minutes
- Severity: Warning
Zones stuck in error state:
- Query:
zones{state="error"} - Condition: > 0 for 2 minutes
- Severity: Warning
CPU saturation:
- Query:
avg(host_cpu_usage_percent{mode="idle"}) - Condition: < 10 for 10 minutes
- Severity: Warning
Troubleshooting Common Issues
Issue: Dashboard shows “No data”
Possible causes:
- Prometheus isn’t scraping metrics
- Time range is wrong
- Metrics query is incorrect
Debug steps:
# Check Prometheus targets
kubectl port-forward -n edera-monitoring svc/prometheus 9090:9090
# Navigate to http://localhost:9090/targets
# All nodes should be "UP"
# Test a simple query in Prometheus UI
zones{state="ready"}
# If this returns data, the issue is in Grafana
# If not, Prometheus isn't collecting metricsIssue: Only some nodes showing metrics
Possible causes:
- Edera not running on some nodes
- Firewall blocking port 3035
- Node labels preventing scraping
Debug steps:
# Check if Edera is running on all nodes
kubectl get nodes
kubectl get pods -A -o wide | grep edera
# Test metrics endpoint directly
NODE_IP=<node-ip>
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://$NODE_IP:3035/metrics
# If this fails, Edera isn't exposing metricsIssue: Dashboard is slow or timing out
Possible causes:
- Too many zones (high cardinality)
- Long time range with small interval
- Inefficient queries
Solutions:
- Reduce time range (use last 1 hour instead of last 7 days)
- Increase refresh interval (1m instead of 10s)
- Use recording rules to pre-compute expensive queries
- Increase Prometheus resources (CPU/memory)
Issue: Metrics are delayed
Possible causes:
- Scrape interval too long
- Prometheus overloaded
- Network latency
Debug steps:
# Check Prometheus scrape duration
# In Prometheus UI, query:
scrape_duration_seconds{job="edera"}
# Values >5s indicate scraping is slow
# Consider reducing scrape interval or increasing Prometheus resourcesNext Steps
You’ve now learned to:
- Deploy a production-ready Prometheus and Grafana stack
- Understand the three-layer metric hierarchy
- Navigate the 18-panel dashboard
- Interpret CPU, memory, and health metrics
- Troubleshoot issues using metrics
- Create custom dashboards and alerts
Operational best practices:
- Check the dashboard daily - 2-minute health check
- Set up critical alerts - Don’t rely on manual monitoring
- Review trends weekly - Capacity planning and optimization
- Correlate with Kubernetes - Metrics + events + logs = full picture
- Iterate on thresholds - Tune alerts to reduce noise
Further learning:
- Prometheus documentation: https://prometheus.io/docs/
- Grafana documentation: https://grafana.com/docs/
- PromQL tutorial: https://prometheus.io/docs/prometheus/latest/querying/basics/
- Recording rules: Pre-compute expensive queries for better performance
- Alertmanager: Advanced alert routing, grouping, and silencing
Congratulations! You’ve completed Module 6. You now have a comprehensive monitoring solution for your Edera deployment and the knowledge to use it effectively.
Your Edera journey continues:
- Module 1-3: Understand the “why” and “what” of Edera
- Module 4: Get Edera running (installation and first deployment)
- Module 5: Observability and troubleshooting
- Module 6 (this module): Production operations and monitoring ✅
You’re now equipped to run Edera in production with full visibility into your microVM workloads.
