Welcome to Edera

Customer training for authorized users only

Only users with authorized email domains can access this portal.
Contact support@edera.dev if you need assistance.

Using the Grafana Dashboard


You have the monitoring stack running. You understand what each metric means. Now it’s time to actually use the dashboard to monitor your Edera deployment.

The “Edera Workload Monitoring” dashboard has 18 panels organized into 6 sections. This isn’t a generic Kubernetes dashboard—it’s purpose-built for the three-layer Edera architecture. Every panel answers a specific operational question.

Let’s walk through the dashboard section by section and learn how to interpret what you’re seeing.

Accessing the Dashboard

Step 1: Log into Grafana

Navigate to your Grafana instance:

# Get the LoadBalancer URL
kubectl get svc grafana -n edera-monitoring -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

# Or use port-forward for local access
kubectl port-forward -n edera-monitoring svc/grafana 3000:3000

Login with:

  • Username: admin
  • Password: feeltheteal (or your custom password)

Step 2: Open the Edera Dashboard

The dashboard should auto-load as the default. If not:

  1. Click the menu icon (☰) in the top-left
  2. Navigate to “Dashboards”
  3. Select “Edera Workload Monitoring”

Or access directly at:

http://<GRAFANA_URL>:3000/d/edera-workload-monitoring/edera-workload-monitoring

Dashboard Overview

The dashboard is organized into these sections:

┌──────────────────────────────────────────────────────────┐
│ Section 1: Overview                                      │
│ - Ready Zones | Total Zones | Health Status | Timeline   │
├──────────────────────────────────────────────────────────┤
│ Section 2: Zone CPU Metrics                              │
│ - Per-core CPU usage | Average CPU per zone              │
├──────────────────────────────────────────────────────────┤
│ Section 3: Zone Memory Metrics                           │
│ - Memory usage (bytes) | Memory usage (%)                │
├──────────────────────────────────────────────────────────┤
│ Section 4: Hypervisor Metrics                            │
│ - CPU rate | Memory allocation | vCPUs online            │
├──────────────────────────────────────────────────────────┤
│ Section 5: Host (Dom0) Metrics                           │
│ - Per-core CPU | CPU by mode | Memory breakdown          │
├──────────────────────────────────────────────────────────┤
│ Section 6: Health & Kubernetes                           │
│ - Health checks | Namespace distribution | Summary       │
└──────────────────────────────────────────────────────────┘

Let’s dive into each section.

Section 1: Overview

This is your 30-second health check. Four panels that immediately tell you if things are normal or on fire.

Panel 1: Ready Zones

Type: Stat (large number) Query: sum(zones{state="ready"})

What you see: A single number showing zones in ready state.

Interpretation:

  • Green and stable: Normal operation
  • Fluctuating rapidly: Zones are churning (starting/stopping frequently)
  • Dropping to zero: Critical issue—all zones are failing

Action:

  • If this drops unexpectedly, immediately check Section 6 (Health Status)
  • Correlate with Kubernetes events: kubectl get events -n <namespace>

Panel 2: Total Zones

Type: Stat (large number) Query: sum(zones)

What you see: Total count of all zones regardless of state.

Interpretation:

  • Should match your expected workload count
  • Significantly higher than “Ready Zones” means zones stuck in other states

Action:

  • Compare to Ready Zones
  • If total » ready, check for zones stuck in “starting” or “error” states
  • Query Prometheus: zones{state!="ready"}

Panel 3: Health Status

Type: Stat (indicator) Query: sum(rate(health_check_total{status="success"}[5m])) / sum(rate(health_check_total[5m])) > 0.95

What you see: A binary indicator (usually green checkmark or red X).

Interpretation:

  • Green (>95% success): System healthy
  • Red (<95% success): Degraded health checks

Action:

  • If red, drill into specific services with: rate(health_check_total{status="failed"}[5m])
  • Check logs: kubectl logs -n edera-system -l app=edera

Panel 4: Zones by State (Timeline)

Type: Time series graph Query: zones (all states)

What you see: Colored lines showing zone counts over time, broken down by state.

Interpretation:

  • Flat lines: Stable cluster
  • Spiky “starting” line: Frequent pod churn (possibly autoscaling or restarts)
  • Persistent “error” line: Zones failing to start

Action:

  • Zoom in on spikes to correlate with deployments or scaling events
  • If “error” state persists, identify the zone: zones{state="error"}

Section 2: Zone CPU Metrics

This section shows CPU usage for individual zones. Remember: each zone can have multiple vCPUs.

Panel 5: Zone CPU Usage by CPU Core

Type: Time series (multi-line) Query: zone_cpu_usage_percent

What you see: One line per vCPU per zone. With many zones, this looks like a rainbow of lines.

Interpretation:

  • Lines clustered low (0-30%): Zones are under-utilized
  • Lines near 100%: Zones are CPU-bound
  • Uneven usage within a zone: Workload isn’t parallelizing across vCPUs

How to use:

  • Filter by zone: Add {zone_id="<id>"} to focus on one zone
  • Look for sustained 100% usage—that zone needs more CPU
  • Look for asymmetric usage—workload might not be multi-threaded

Action:

  • If a zone is consistently maxed: increase vCPU allocation or scale horizontally
  • If usage is low across zones: consider reducing vCPU to save costs

Panel 6: Average CPU Usage per Zone

Type: Time series Query: avg by (zone_id, k8s_pod) (zone_cpu_usage_percent)

What you see: One line per zone showing average CPU across all vCPUs.

Interpretation:

  • Easier to read than per-core view
  • Shows which zones are CPU-intensive

How to use:

  • Identify top CPU consumers
  • Correlate with k8s_pod labels to identify workload type
  • Track CPU usage over time to detect trends (memory leaks often cause CPU spikes)

Action:

  • Click a line to see zone_id and k8s_pod
  • Use this to right-size resource requests in Kubernetes manifests

Section 3: Zone Memory Metrics

Memory is less flexible than CPU—zones have fixed allocations. These panels show how close zones are to their limits.

Panel 7: Zone Memory Usage

Type: Time series Queries:

  • zone_memory_used_bytes
  • zone_memory_total_bytes

What you see: Two sets of lines—used (typically lower) and total (typically higher).

Interpretation:

  • Used approaching total: Zone is at memory capacity
  • Large gap: Zone has plenty of free memory
  • Used > total: Impossible—indicates a metric collection issue

How to use:

  • Filter by zone to see specific usage
  • Watch for used memory steadily increasing (memory leak)

Action:

  • If used approaches total, increase zone memory allocation
  • If used is consistently low, decrease allocation to save resources

Panel 8: Zone Memory Usage %

Type: Gauge (percentage) Query: (zone_memory_used_bytes / zone_memory_total_bytes) * 100

What you see: Percentage bars for each zone.

Interpretation:

  • Green (<70%): Healthy
  • Yellow (70-85%): Monitor closely
  • Red (>85%): Approaching OOM (out-of-memory)

Action:

  • Any zone above 85%: immediate action required
  • Check if the zone needs more memory or if there’s a memory leak
  • Query the pod: kubectl describe pod <pod-name> -n <namespace>

Section 4: Hypervisor Metrics

This is Xen’s view of the world. These panels show how the hypervisor is managing resources.

Panel 9: Hypervisor CPU Usage Rate

Type: Time series Query: rate(hypervisor_cpu_usage_seconds_total[5m])

What you see: CPU time consumption rate from Xen’s perspective.

Interpretation:

  • A value of 1.0 means one full CPU core is being used
  • This is cumulative across all vCPUs for a zone

How to use:

  • Compare to zone CPU metrics to validate consistency
  • Detect CPU time theft or unexpected usage
  • A zone with 2 vCPUs fully utilized should show ~2.0 here

Action:

  • Discrepancies between zone and hypervisor CPU indicate measurement issues
  • Use this to validate that zones are getting the CPU time they request

Panel 10: Hypervisor Memory Allocation

Type: Time series Queries:

  • hypervisor_memory_total_bytes
  • hypervisor_memory_outstanding_bytes

What you see: Total allocated and “outstanding” (owed) memory.

Interpretation:

  • Total: What Xen has allocated to zones
  • Outstanding: Memory allocated but not yet provided (ballooning)

Why it matters:

  • Outstanding memory > 0 indicates memory pressure
  • Xen is trying to give zones memory but the host doesn’t have enough
  • Persistent outstanding memory means the node is over-subscribed

Action:

  • If outstanding is consistently >20% of total: reduce zone density or add nodes
  • Check host memory (Section 5) to confirm capacity

Panel 11: Hypervisor vCPUs Online per Zone

Type: Gauge Query: hypervisor_vcpus_online

What you see: Number of vCPUs online for each zone.

Interpretation:

  • Should match the zone’s requested CPU count
  • Mismatch indicates configuration or scheduling issue

Action:

  • If vCPUs < expected: check zone creation logs
  • If vCPUs > expected: possible configuration drift

Section 5: Host (Dom0) Metrics

This is the foundation. The physical node running everything.

Panel 12: Host CPU Usage per Core

Type: Time series Query: host_cpu_usage_percent

What you see: One line per physical CPU core on the node.

Interpretation:

  • All cores low (<50%): Node has CPU headroom
  • All cores high (>80%): Node is saturated
  • Some cores pinned at 100%: Potential CPU pinning or NUMA issues

Action:

  • If all cores are saturated: add more nodes or reduce zone density
  • If usage is uneven: check CPU affinity settings

Panel 13: Host CPU Time by Mode

Type: Stacked area chart Query: rate(host_cpu_usage_seconds_total[5m])

What you see: CPU time breakdown by mode (user, system, idle, iowait, etc.).

Interpretation:

  • High user time: Application workload (normal)
  • High system time: Kernel/hypervisor overhead (could indicate issues)
  • High iowait: Disk I/O bottleneck
  • High idle: Node is under-utilized

Action:

  • High iowait: investigate disk performance (EBS throttling?)
  • High system time: possible hypervisor overhead or kernel issues
  • High idle: opportunity to increase zone density

Panel 14: Host Memory Usage

Type: Time series Queries:

  • host_memory_used_bytes
  • host_memory_free_bytes
  • host_memory_total_bytes

What you see: Memory breakdown for the host node.

Interpretation:

  • Free memory shrinking: Node approaching capacity
  • Used + free ≠ total: Some memory is buffers/cache (normal in Linux)

Action:

  • If free memory < 10%: stop provisioning new zones on this node
  • Check hypervisor outstanding memory (Panel 10) for correlation

Panel 15: Host Memory Usage %

Type: Gauge Query: (host_memory_used_bytes / host_memory_total_bytes) * 100

What you see: Single percentage bar for host memory.

Interpretation:

  • <80%: Healthy
  • 80-90%: Monitor closely
  • >90%: At capacity—no room for more zones

Action:

  • Above 90%: don’t schedule more zones on this node
  • Correlate with zone memory to understand what’s consuming memory

Section 6: Health & Kubernetes

This section ties everything together with health checks and Kubernetes context.

Panel 16: Health Check Rates by Service

Type: Time series Query: rate(health_check_total[5m])

What you see: Success and failure rates for each Edera service.

Interpretation:

  • Steady success rate: Services are healthy
  • Spikes in failures: Transient issues
  • Sustained failures: Service degradation

Action:

  • Drill into failed checks: health_check_total{status="failed"}
  • Correlate with service logs

Panel 17: Zone Distribution by Kubernetes Namespace

Type: Pie chart or table Query: count by (k8s_namespace) (zone_cpu_usage_percent)

What you see: How zones are distributed across namespaces.

Interpretation:

  • Shows resource allocation per team/environment
  • Useful for chargeback or capacity planning

Action:

  • Use to identify which teams are consuming most resources
  • Balance resource allocation across namespaces

Panel 18: Resource Summary Table

Type: Table Query: Multiple metrics aggregated

What you see: Comprehensive table with zone_id, k8s_pod, CPU, memory, state.

Interpretation:

  • This is your “everything at once” view
  • Sortable by any column

How to use:

  • Sort by CPU to find top consumers
  • Sort by memory % to find zones near limits
  • Filter by namespace or state

Action:

  • Export to CSV for reporting
  • Use as a checklist during incident response

Dashboard Time Range and Refresh

Default time range: Last 1 hour Default refresh: 30 seconds

Adjusting the time range:

  • Click the time picker in top-right
  • Select preset ranges (5m, 15m, 1h, 6h, 24h, 7d)
  • Or set a custom range

Use cases:

  • Last 5 minutes: Real-time troubleshooting
  • Last 1 hour: Recent trends
  • Last 24 hours: Daily patterns
  • Last 7 days: Weekly trends and capacity planning

Refresh rate:

  • 30s default balances freshness with load
  • Increase to 10s for active incidents
  • Decrease to 1m for historical analysis

Practical Workflows

Workflow 1: Daily Health Check (2 minutes)

  1. Open the dashboard
  2. Check Section 1 (Overview):
    • Ready Zones count normal?
    • Health status green?
  3. Scroll through sections looking for red/yellow indicators
  4. Done—everything’s healthy

Workflow 2: Investigating High CPU (5 minutes)

  1. Section 1: Notice spikes in zone activity
  2. Section 2: Identify which zones have high CPU
    • Sort by highest CPU usage
  3. Note the k8s_pod and k8s_namespace labels
  4. Check Kubernetes: kubectl describe pod <pod> -n <namespace>
  5. Decide: scale up (more vCPUs) or scale out (more replicas)?

Workflow 3: Capacity Planning (10 minutes)

  1. Set time range to “Last 7 days”
  2. Section 5: Check host CPU and memory trends
    • Are we approaching capacity?
  3. Section 2 & 3: Check zone resource utilization
    • Are zones over-provisioned?
  4. Calculate headroom:
    • Host capacity - (sum of zone allocations) = available for new zones
  5. Decision: add nodes, reduce zone sizes, or optimize workloads

Workflow 4: Incident Response (During an outage)

  1. Set refresh to 10s, time range to “Last 15 minutes”
  2. Section 1: What changed? Zone count dropped? Health check failed?
  3. Section 6: Which service failed health checks?
  4. Section 5: Host issues? (CPU/memory/disk?)
  5. Section 4: Hypervisor issues? (outstanding memory?)
  6. Section 2 & 3: Zone-level issues? (OOM? CPU starvation?)
  7. Correlate with logs and Kubernetes events
  8. Fix root cause and watch metrics return to normal

Creating Custom Dashboards

The default dashboard is comprehensive, but you may want custom views.

Creating a new dashboard:

  1. In Grafana, click “+” → “Dashboard”
  2. Click “Add visualization”
  3. Select “Prometheus” as the data source
  4. Enter a PromQL query (examples below)
  5. Customize visualization type and settings
  6. Save the panel and dashboard

Useful custom panels:

Top 10 zones by memory usage:

topk(10, (zone_memory_used_bytes / zone_memory_total_bytes) * 100)

Zones by Kubernetes namespace (bar chart):

count(zone_cpu_usage_percent) by (k8s_namespace)

Alert panel for zones >90% memory:

(zone_memory_used_bytes / zone_memory_total_bytes) * 100 > 90

Cluster-wide CPU usage:

sum(zone_cpu_usage_percent) / count(zone_cpu_usage_percent)

Nodes with low free memory:

(host_memory_free_bytes / host_memory_total_bytes) * 100 < 20

Setting Up Alerts

Dashboards are for humans. Alerts are for automation.

Creating an alert in Grafana:

  1. Edit a panel (or create a new one)
  2. Go to the “Alert” tab
  3. Click “Create alert rule from this panel”
  4. Set conditions (e.g., “WHEN avg() OF query(A) IS ABOVE 90”)
  5. Configure evaluation interval and pending period
  6. Set notification channel (Slack, PagerDuty, email)
  7. Save

Recommended alerts:

Zone memory critical:

  • Query: (zone_memory_used_bytes / zone_memory_total_bytes) * 100
  • Condition: > 90 for 5 minutes
  • Severity: Critical

Host memory critical:

  • Query: (host_memory_used_bytes / host_memory_total_bytes) * 100
  • Condition: > 90 for 5 minutes
  • Severity: Critical

Health check failures:

  • Query: sum(rate(health_check_total{status="success"}[5m])) / sum(rate(health_check_total[5m]))
  • Condition: < 0.95 for 5 minutes
  • Severity: Warning

Zones stuck in error state:

  • Query: zones{state="error"}
  • Condition: > 0 for 2 minutes
  • Severity: Warning

CPU saturation:

  • Query: avg(host_cpu_usage_percent{mode="idle"})
  • Condition: < 10 for 10 minutes
  • Severity: Warning

Troubleshooting Common Issues

Issue: Dashboard shows “No data”

Possible causes:

  • Prometheus isn’t scraping metrics
  • Time range is wrong
  • Metrics query is incorrect

Debug steps:

# Check Prometheus targets
kubectl port-forward -n edera-monitoring svc/prometheus 9090:9090
# Navigate to http://localhost:9090/targets
# All nodes should be "UP"

# Test a simple query in Prometheus UI
zones{state="ready"}

# If this returns data, the issue is in Grafana
# If not, Prometheus isn't collecting metrics

Issue: Only some nodes showing metrics

Possible causes:

  • Edera not running on some nodes
  • Firewall blocking port 3035
  • Node labels preventing scraping

Debug steps:

# Check if Edera is running on all nodes
kubectl get nodes
kubectl get pods -A -o wide | grep edera

# Test metrics endpoint directly
NODE_IP=<node-ip>
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://$NODE_IP:3035/metrics

# If this fails, Edera isn't exposing metrics

Issue: Dashboard is slow or timing out

Possible causes:

  • Too many zones (high cardinality)
  • Long time range with small interval
  • Inefficient queries

Solutions:

  • Reduce time range (use last 1 hour instead of last 7 days)
  • Increase refresh interval (1m instead of 10s)
  • Use recording rules to pre-compute expensive queries
  • Increase Prometheus resources (CPU/memory)

Issue: Metrics are delayed

Possible causes:

  • Scrape interval too long
  • Prometheus overloaded
  • Network latency

Debug steps:

# Check Prometheus scrape duration
# In Prometheus UI, query:
scrape_duration_seconds{job="edera"}

# Values >5s indicate scraping is slow
# Consider reducing scrape interval or increasing Prometheus resources

Next Steps

You’ve now learned to:

  • Deploy a production-ready Prometheus and Grafana stack
  • Understand the three-layer metric hierarchy
  • Navigate the 18-panel dashboard
  • Interpret CPU, memory, and health metrics
  • Troubleshoot issues using metrics
  • Create custom dashboards and alerts

Operational best practices:

  1. Check the dashboard daily - 2-minute health check
  2. Set up critical alerts - Don’t rely on manual monitoring
  3. Review trends weekly - Capacity planning and optimization
  4. Correlate with Kubernetes - Metrics + events + logs = full picture
  5. Iterate on thresholds - Tune alerts to reduce noise

Further learning:


Congratulations! You’ve completed Module 6. You now have a comprehensive monitoring solution for your Edera deployment and the knowledge to use it effectively.

Your Edera journey continues:

  • Module 1-3: Understand the “why” and “what” of Edera
  • Module 4: Get Edera running (installation and first deployment)
  • Module 5: Observability and troubleshooting
  • Module 6 (this module): Production operations and monitoring ✅

You’re now equipped to run Edera in production with full visibility into your microVM workloads.

Last updated on