Using the Grafana Dashboard

You have the monitoring stack running. You understand what each metric means. Now it’s time to actually use the dashboard to monitor your Edera deployment.

The “Edera Workload Monitoring” dashboard has 18 panels organized into 6 sections. This isn’t a generic Kubernetes dashboard—it’s purpose-built for the three-layer Edera architecture. Every panel answers a specific operational question.

Let’s walk through the dashboard section by section and learn how to interpret what you’re seeing.

Accessing the Dashboard

Step 1: Log into Grafana

Navigate to your Grafana instance:

# Get the LoadBalancer URL
kubectl get svc grafana -n edera-monitoring -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

# Or use port-forward for local access
kubectl port-forward -n edera-monitoring svc/grafana 3000:3000

Username: admin
Password: feeltheteal (or your custom password)

Step 2: Open the Edera Dashboard

The dashboard should auto-load as the default. If not:

Click the menu icon (☰) in the top-left
Navigate to “Dashboards”
Select “Edera Workload Monitoring”

Or access directly at:

http://<GRAFANA_URL>:3000/d/edera-workload-monitoring/edera-workload-monitoring

Dashboard Overview

The dashboard is organized into these sections:

┌──────────────────────────────────────────────────────────┐
│ Section 1: Overview                                      │
│ - Ready Zones | Total Zones | Health Status | Timeline   │
├──────────────────────────────────────────────────────────┤
│ Section 2: Zone CPU Metrics                              │
│ - Per-core CPU usage | Average CPU per zone              │
├──────────────────────────────────────────────────────────┤
│ Section 3: Zone Memory Metrics                           │
│ - Memory usage (bytes) | Memory usage (%)                │
├──────────────────────────────────────────────────────────┤
│ Section 4: Hypervisor Metrics                            │
│ - CPU rate | Memory allocation | vCPUs online            │
├──────────────────────────────────────────────────────────┤
│ Section 5: Host (Dom0) Metrics                           │
│ - Per-core CPU | CPU by mode | Memory breakdown          │
├──────────────────────────────────────────────────────────┤
│ Section 6: Health & Kubernetes                           │
│ - Health checks | Namespace distribution | Summary       │
└──────────────────────────────────────────────────────────┘

Let’s dive into each section.

Section 1: Overview

This is your 30-second health check. Four panels that immediately tell you if things are normal or on fire.

Panel 1: Ready Zones

Type: Stat (large number) Query: sum(zones{state="ready"})

What you see: A single number showing zones in ready state.

Interpretation:

Green and stable: Normal operation
Fluctuating rapidly: Zones are churning (starting/stopping frequently)
Dropping to zero: Critical issue—all zones are failing

Action:

If this drops unexpectedly, immediately check Section 6 (Health Status)
Correlate with Kubernetes events: kubectl get events -n <namespace>

Panel 2: Total Zones

Type: Stat (large number) Query: sum(zones)

What you see: Total count of all zones regardless of state.

Interpretation:

Should match your expected workload count
Significantly higher than “Ready Zones” means zones stuck in other states

Action:

Compare to Ready Zones
If total » ready, check for zones stuck in “starting” or “error” states
Query Prometheus: zones{state!="ready"}

Panel 3: Health Status

Type: Stat (indicator) Query: sum(rate(health_check_total{status="success"}[5m])) / sum(rate(health_check_total[5m])) > 0.95

What you see: A binary indicator (usually green checkmark or red X).

Interpretation:

Green (>95% success): System healthy
Red (<95% success): Degraded health checks

Action:

If red, drill into specific services with: rate(health_check_total{status="failed"}[5m])
Check logs: kubectl logs -n edera-system -l app=edera

Panel 4: Zones by State (Timeline)

Type: Time series graph Query: zones (all states)

What you see: Colored lines showing zone counts over time, broken down by state.

Interpretation:

Flat lines: Stable cluster
Spiky “starting” line: Frequent pod churn (possibly autoscaling or restarts)
Persistent “error” line: Zones failing to start

Action:

Zoom in on spikes to correlate with deployments or scaling events
If “error” state persists, identify the zone: zones{state="error"}

Section 2: Zone CPU Metrics

This section shows CPU usage for individual zones. Remember: each zone can have multiple vCPUs.

Panel 5: Zone CPU Usage by CPU Core

Type: Time series (multi-line) Query: zone_cpu_usage_percent

What you see: One line per vCPU per zone. With many zones, this looks like a rainbow of lines.

Interpretation:

Lines clustered low (0-30%): Zones are under-utilized
Lines near 100%: Zones are CPU-bound
Uneven usage within a zone: Workload isn’t parallelizing across vCPUs

How to use:

Filter by zone: Add {zone_id="<id>"} to focus on one zone
Look for sustained 100% usage—that zone needs more CPU
Look for asymmetric usage—workload might not be multi-threaded

Action:

If a zone is consistently maxed: increase vCPU allocation or scale horizontally
If usage is low across zones: consider reducing vCPU to save costs

Panel 6: Average CPU Usage per Zone

Type: Time series Query: avg by (zone_id, k8s_pod) (zone_cpu_usage_percent)

What you see: One line per zone showing average CPU across all vCPUs.

Interpretation:

Easier to read than per-core view
Shows which zones are CPU-intensive

How to use:

Identify top CPU consumers
Correlate with k8s_pod labels to identify workload type
Track CPU usage over time to detect trends (memory leaks often cause CPU spikes)

Action:

Click a line to see zone_id and k8s_pod
Use this to right-size resource requests in Kubernetes manifests

Section 3: Zone Memory Metrics

Memory is less flexible than CPU—zones have fixed allocations. These panels show how close zones are to their limits.

Panel 7: Zone Memory Usage

Type: Time series Queries:

zone_memory_used_bytes
zone_memory_total_bytes

What you see: Two sets of lines—used (typically lower) and total (typically higher).

Interpretation:

Used approaching total: Zone is at memory capacity
Large gap: Zone has plenty of free memory
Used > total: Impossible—indicates a metric collection issue

How to use:

Filter by zone to see specific usage
Watch for used memory steadily increasing (memory leak)

Action:

If used approaches total, increase zone memory allocation
If used is consistently low, decrease allocation to save resources

Panel 8: Zone Memory Usage %

Type: Gauge (percentage) Query: (zone_memory_used_bytes / zone_memory_total_bytes) * 100

What you see: Percentage bars for each zone.

Interpretation:

Green (<70%): Healthy
Yellow (70-85%): Monitor closely
Red (>85%): Approaching OOM (out-of-memory)

Action:

Any zone above 85%: immediate action required
Check if the zone needs more memory or if there’s a memory leak
Query the pod: kubectl describe pod <pod-name> -n <namespace>

Section 4: Hypervisor Metrics

This is Xen’s view of the world. These panels show how the hypervisor is managing resources.

Panel 9: Hypervisor CPU Usage Rate

Type: Time series Query: rate(hypervisor_cpu_usage_seconds_total[5m])

What you see: CPU time consumption rate from Xen’s perspective.

Interpretation:

A value of 1.0 means one full CPU core is being used
This is cumulative across all vCPUs for a zone

How to use:

Compare to zone CPU metrics to validate consistency
Detect CPU time theft or unexpected usage
A zone with 2 vCPUs fully utilized should show ~2.0 here

Action:

Discrepancies between zone and hypervisor CPU indicate measurement issues
Use this to validate that zones are getting the CPU time they request

Panel 10: Hypervisor Memory Allocation

Type: Time series Queries:

hypervisor_memory_total_bytes
hypervisor_memory_outstanding_bytes

What you see: Total allocated and “outstanding” (owed) memory.

Interpretation:

Total: What Xen has allocated to zones
Outstanding: Memory allocated but not yet provided (ballooning)

Why it matters:

Outstanding memory > 0 indicates memory pressure
Xen is trying to give zones memory but the host doesn’t have enough
Persistent outstanding memory means the node is over-subscribed

Action:

If outstanding is consistently >20% of total: reduce zone density or add nodes
Check host memory (Section 5) to confirm capacity

Panel 11: Hypervisor vCPUs Online per Zone

Type: Gauge Query: hypervisor_vcpus_online

What you see: Number of vCPUs online for each zone.

Interpretation:

Should match the zone’s requested CPU count
Mismatch indicates configuration or scheduling issue

Action:

If vCPUs < expected: check zone creation logs
If vCPUs > expected: possible configuration drift

Section 5: Host (Dom0) Metrics

This is the foundation. The physical node running everything.

Panel 12: Host CPU Usage per Core

Type: Time series Query: host_cpu_usage_percent

What you see: One line per physical CPU core on the node.

Interpretation:

All cores low (<50%): Node has CPU headroom
All cores high (>80%): Node is saturated
Some cores pinned at 100%: Potential CPU pinning or NUMA issues

Action:

If all cores are saturated: add more nodes or reduce zone density
If usage is uneven: check CPU affinity settings

Panel 13: Host CPU Time by Mode

Type: Stacked area chart Query: rate(host_cpu_usage_seconds_total[5m])

What you see: CPU time breakdown by mode (user, system, idle, iowait, etc.).

Interpretation:

High user time: Application workload (normal)
High system time: Kernel/hypervisor overhead (could indicate issues)
High iowait: Disk I/O bottleneck
High idle: Node is under-utilized

Action:

High iowait: investigate disk performance (EBS throttling?)
High system time: possible hypervisor overhead or kernel issues
High idle: opportunity to increase zone density

Panel 14: Host Memory Usage

Type: Time series Queries:

host_memory_used_bytes
host_memory_free_bytes
host_memory_total_bytes

What you see: Memory breakdown for the host node.

Interpretation:

Free memory shrinking: Node approaching capacity
Used + free ≠ total: Some memory is buffers/cache (normal in Linux)

Action:

If free memory < 10%: stop provisioning new zones on this node
Check hypervisor outstanding memory (Panel 10) for correlation

Panel 15: Host Memory Usage %

Type: Gauge Query: (host_memory_used_bytes / host_memory_total_bytes) * 100

What you see: Single percentage bar for host memory.

Interpretation:

<80%: Healthy
80-90%: Monitor closely
>90%: At capacity—no room for more zones

Action:

Above 90%: don’t schedule more zones on this node
Correlate with zone memory to understand what’s consuming memory

Section 6: Health & Kubernetes

This section ties everything together with health checks and Kubernetes context.

Panel 16: Health Check Rates by Service

Type: Time series Query: rate(health_check_total[5m])

What you see: Success and failure rates for each Edera service.

Interpretation:

Steady success rate: Services are healthy
Spikes in failures: Transient issues
Sustained failures: Service degradation

Action:

Drill into failed checks: health_check_total{status="failed"}
Correlate with service logs

Panel 17: Zone Distribution by Kubernetes Namespace

Type: Pie chart or table Query: count by (k8s_namespace) (zone_cpu_usage_percent)

What you see: How zones are distributed across namespaces.

Interpretation:

Shows resource allocation per team/environment
Useful for chargeback or capacity planning

Action:

Use to identify which teams are consuming most resources
Balance resource allocation across namespaces

Panel 18: Resource Summary Table

Type: Table Query: Multiple metrics aggregated

What you see: Comprehensive table with zone_id, k8s_pod, CPU, memory, state.

Interpretation:

This is your “everything at once” view
Sortable by any column

How to use:

Sort by CPU to find top consumers
Sort by memory % to find zones near limits
Filter by namespace or state

Action:

Export to CSV for reporting
Use as a checklist during incident response

Dashboard Time Range and Refresh

Default time range: Last 1 hour Default refresh: 30 seconds

Adjusting the time range:

Click the time picker in top-right
Select preset ranges (5m, 15m, 1h, 6h, 24h, 7d)
Or set a custom range

Use cases:

Last 5 minutes: Real-time troubleshooting
Last 1 hour: Recent trends
Last 24 hours: Daily patterns
Last 7 days: Weekly trends and capacity planning

Refresh rate:

30s default balances freshness with load
Increase to 10s for active incidents
Decrease to 1m for historical analysis

Practical Workflows

Workflow 1: Daily Health Check (2 minutes)

Open the dashboard
Check Section 1 (Overview):
- Ready Zones count normal?
- Health status green?
Scroll through sections looking for red/yellow indicators
Done—everything’s healthy

Workflow 2: Investigating High CPU (5 minutes)

Section 1: Notice spikes in zone activity
Section 2: Identify which zones have high CPU
- Sort by highest CPU usage
Note the k8s_pod and k8s_namespace labels
Check Kubernetes: kubectl describe pod <pod> -n <namespace>
Decide: scale up (more vCPUs) or scale out (more replicas)?

Workflow 3: Capacity Planning (10 minutes)

Set time range to “Last 7 days”
Section 5: Check host CPU and memory trends
- Are we approaching capacity?
Section 2 & 3: Check zone resource utilization
- Are zones over-provisioned?
Calculate headroom:
- Host capacity - (sum of zone allocations) = available for new zones
Decision: add nodes, reduce zone sizes, or optimize workloads

Workflow 4: Incident Response (During an outage)

Set refresh to 10s, time range to “Last 15 minutes”
Section 1: What changed? Zone count dropped? Health check failed?
Section 6: Which service failed health checks?
Section 5: Host issues? (CPU/memory/disk?)
Section 4: Hypervisor issues? (outstanding memory?)
Section 2 & 3: Zone-level issues? (OOM? CPU starvation?)
Correlate with logs and Kubernetes events
Fix root cause and watch metrics return to normal

Creating Custom Dashboards

The default dashboard is comprehensive, but you may want custom views.

Creating a new dashboard:

In Grafana, click “+” → “Dashboard”
Click “Add visualization”
Select “Prometheus” as the data source
Enter a PromQL query (examples below)
Customize visualization type and settings
Save the panel and dashboard

Useful custom panels:

Top 10 zones by memory usage:

topk(10, (zone_memory_used_bytes / zone_memory_total_bytes) * 100)

Zones by Kubernetes namespace (bar chart):

count(zone_cpu_usage_percent) by (k8s_namespace)

Alert panel for zones >90% memory:

(zone_memory_used_bytes / zone_memory_total_bytes) * 100 > 90

Cluster-wide CPU usage:

sum(zone_cpu_usage_percent) / count(zone_cpu_usage_percent)

Nodes with low free memory:

(host_memory_free_bytes / host_memory_total_bytes) * 100 < 20

Setting Up Alerts

Dashboards are for humans. Alerts are for automation.

Creating an alert in Grafana:

Edit a panel (or create a new one)
Go to the “Alert” tab
Click “Create alert rule from this panel”
Set conditions (e.g., “WHEN avg() OF query(A) IS ABOVE 90”)
Configure evaluation interval and pending period
Set notification channel (Slack, PagerDuty, email)
Save

Recommended alerts:

Zone memory critical:

Query: (zone_memory_used_bytes / zone_memory_total_bytes) * 100
Condition: > 90 for 5 minutes
Severity: Critical

Host memory critical:

Query: (host_memory_used_bytes / host_memory_total_bytes) * 100
Condition: > 90 for 5 minutes
Severity: Critical

Health check failures:

Query: sum(rate(health_check_total{status="success"}[5m])) / sum(rate(health_check_total[5m]))
Condition: < 0.95 for 5 minutes
Severity: Warning

Zones stuck in error state:

Query: zones{state="error"}
Condition: > 0 for 2 minutes
Severity: Warning

CPU saturation:

Query: avg(host_cpu_usage_percent{mode="idle"})
Condition: < 10 for 10 minutes
Severity: Warning

Troubleshooting Common Issues

Issue: Dashboard shows “No data”

Possible causes:

Prometheus isn’t scraping metrics
Time range is wrong
Metrics query is incorrect

Debug steps:

# Check Prometheus targets
kubectl port-forward -n edera-monitoring svc/prometheus 9090:9090
# Navigate to http://localhost:9090/targets
# All nodes should be "UP"

# Test a simple query in Prometheus UI
zones{state="ready"}

# If this returns data, the issue is in Grafana
# If not, Prometheus isn't collecting metrics

Issue: Only some nodes showing metrics

Possible causes:

Edera not running on some nodes
Firewall blocking port 3035
Node labels preventing scraping

Debug steps:

# Check if Edera is running on all nodes
kubectl get nodes
kubectl get pods -A -o wide | grep edera

# Test metrics endpoint directly
NODE_IP=<node-ip>
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://$NODE_IP:3035/metrics

# If this fails, Edera isn't exposing metrics

Issue: Dashboard is slow or timing out

Possible causes:

Too many zones (high cardinality)
Long time range with small interval
Inefficient queries

Solutions:

Reduce time range (use last 1 hour instead of last 7 days)
Increase refresh interval (1m instead of 10s)
Use recording rules to pre-compute expensive queries
Increase Prometheus resources (CPU/memory)

Issue: Metrics are delayed

Possible causes:

Scrape interval too long
Prometheus overloaded
Network latency

Debug steps:

# Check Prometheus scrape duration
# In Prometheus UI, query:
scrape_duration_seconds{job="edera"}

# Values >5s indicate scraping is slow
# Consider reducing scrape interval or increasing Prometheus resources

Next Steps

You’ve now learned to:

Deploy a production-ready Prometheus and Grafana stack
Understand the three-layer metric hierarchy
Navigate the 18-panel dashboard
Interpret CPU, memory, and health metrics
Troubleshoot issues using metrics
Create custom dashboards and alerts

Operational best practices:

Check the dashboard daily - 2-minute health check
Set up critical alerts - Don’t rely on manual monitoring
Review trends weekly - Capacity planning and optimization
Correlate with Kubernetes - Metrics + events + logs = full picture
Iterate on thresholds - Tune alerts to reduce noise

Further learning:

Prometheus documentation: https://prometheus.io/docs/
Grafana documentation: https://grafana.com/docs/
PromQL tutorial: https://prometheus.io/docs/prometheus/latest/querying/basics/
Recording rules: Pre-compute expensive queries for better performance
Alertmanager: Advanced alert routing, grouping, and silencing

Congratulations! You’ve completed Module 6. You now have a comprehensive monitoring solution for your Edera deployment and the knowledge to use it effectively.

Your Edera journey continues:

Module 1-3: Understand the “why” and “what” of Edera
Module 4: Get Edera running (installation and first deployment)
Module 5: Observability and troubleshooting
Module 6 (this module): Production operations and monitoring ✅

You’re now equipped to run Edera in production with full visibility into your microVM workloads.

Last updated on 2025-12-01

Understanding Edera Metrics

Welcome to Edera

Using the Grafana Dashboard

Accessing the Dashboard

Dashboard Overview

Section 1: Overview

Panel 1: Ready Zones

Panel 2: Total Zones

Panel 3: Health Status

Panel 4: Zones by State (Timeline)

Section 2: Zone CPU Metrics

Panel 5: Zone CPU Usage by CPU Core

Panel 6: Average CPU Usage per Zone

Section 3: Zone Memory Metrics

Panel 7: Zone Memory Usage

Panel 8: Zone Memory Usage %

Section 4: Hypervisor Metrics

Panel 9: Hypervisor CPU Usage Rate

Panel 10: Hypervisor Memory Allocation

Panel 11: Hypervisor vCPUs Online per Zone

Section 5: Host (Dom0) Metrics

Panel 12: Host CPU Usage per Core

Panel 13: Host CPU Time by Mode

Panel 14: Host Memory Usage

Panel 15: Host Memory Usage %

Section 6: Health & Kubernetes

Panel 16: Health Check Rates by Service

Panel 17: Zone Distribution by Kubernetes Namespace

Panel 18: Resource Summary Table

Dashboard Time Range and Refresh

Practical Workflows

Workflow 1: Daily Health Check (2 minutes)

Workflow 2: Investigating High CPU (5 minutes)

Workflow 3: Capacity Planning (10 minutes)

Workflow 4: Incident Response (During an outage)

Creating Custom Dashboards

Setting Up Alerts

Troubleshooting Common Issues

Next Steps