Understanding Edera Metrics
Traditional container monitoring is flat: you have a node, you have containers, you measure CPU and memory. Done.
Edera’s architecture is hierarchical. Every workload runs in a microVM (zone) managed by a hypervisor (Xen) on a host node (Dom0). To properly understand what’s happening, you need visibility at all three layers:
┌───────────────────────────────────────────┐
│ Host Layer (Dom0) │
│ - Physical node resources │
│ - Total CPU, memory, disk │
│ - System-level health │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ Hypervisor Layer (Xen) │ │
│ │ - Resource allocation to zones │ │
│ │ - vCPU scheduling │ │
│ │ - Memory ballooning │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Zone Layer │ │ Zone Layer │ │ │
│ │ │ │ │ │ │ │
│ │ │ - Per-zone │ │ - Per-zone │ │ │
│ │ │ CPU usage │ │ CPU usage │ │ │
│ │ │ - Per-zone │ │ - Per-zone │ │ │
│ │ │ memory │ │ memory │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────┘ │
└───────────────────────────────────────────┘This isn’t just academic. Each layer tells you something different:
- Zone metrics answer: “Is this workload healthy and within resource limits?”
- Hypervisor metrics answer: “How is Xen managing resource allocation?”
- Host metrics answer: “Does this node have capacity for more zones?”
Let’s break down what metrics are available at each layer and what they actually mean.
Metric Naming Convention
All Edera metrics follow a consistent naming pattern:
zone_*- Zone (microVM) level metricshypervisor_*- Xen hypervisor metricshost_*- Dom0 (host) system metricshealth_*- Health check metrics
Metrics are exposed as Prometheus format at http://NODE_IP:3035/metrics.
Zone-Level Metrics
Zone metrics track individual microVM resource consumption. Each zone represents one container workload running in isolation.
zone_cpu_usage_percent
Type: Gauge
Labels: zone_id, cpu, k8s_pod, k8s_namespace
CPU usage percentage for each vCPU in a zone.
Example:
zone_cpu_usage_percent{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", cpu="0", k8s_pod="nginx-abc123", k8s_namespace="default"} 45.3
zone_cpu_usage_percent{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", cpu="1", k8s_pod="nginx-abc123", k8s_namespace="default"} 12.7What it means: Zone a1b2c3d4 is using 45.3% of vCPU 0 and 12.7% of vCPU 1.
Why it matters:
- Identify CPU-bound workloads
- Detect unbalanced CPU allocation (some cores maxed, others idle)
- Right-size zone vCPU allocation
Alert threshold: Consider alerting if any vCPU consistently exceeds 90%.
zone_memory_used_bytes
Type: Gauge
Labels: zone_id, k8s_pod, k8s_namespace
Current memory usage in bytes for a zone.
Example:
zone_memory_used_bytes{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 536870912What it means: Zone is using 512 MiB (536870912 / 1024 / 1024).
zone_memory_total_bytes
Type: Gauge
Labels: zone_id, k8s_pod, k8s_namespace
Total memory allocated to a zone.
Example:
zone_memory_total_bytes{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 1073741824What it means: Zone has 1 GiB total memory available.
Combined usage:
(zone_memory_used_bytes / zone_memory_total_bytes) * 100This gives you memory usage percentage. If this approaches 100%, the zone is at memory capacity.
Why it matters:
- Detect memory pressure before OOM kills occur
- Right-size zone memory allocation
- Identify memory leaks
Alert threshold: Alert if memory usage exceeds 85% for more than 5 minutes.
zones
Type: Gauge
Labels: state
Count of zones in each state.
Example:
zones{state="ready"} 47
zones{state="starting"} 2
zones{state="stopping"} 1
zones{state="error"} 0What it means: You have 47 zones in ready state, 2 starting, 1 stopping, 0 in error.
Why it matters:
- Overall cluster health at a glance
- Track zone lifecycle (starting/stopping should be transient)
- Alert on persistent error states
Alert threshold: Alert if zones{state="error"} > 0 for more than 1 minute.
Hypervisor-Level Metrics
Hypervisor metrics expose Xen’s internal state. This is where you see how resources are being managed at the virtualization layer.
hypervisor_cpu_usage_seconds_total
Type: Counter
Labels: zone_id, k8s_pod, k8s_namespace
Cumulative CPU time consumed by a zone, as measured by Xen.
Example:
hypervisor_cpu_usage_seconds_total{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 3742.56What it means: This zone has consumed 3742.56 CPU-seconds since creation.
Use with rate():
rate(hypervisor_cpu_usage_seconds_total[5m])This gives you the CPU time rate over the last 5 minutes. A value of 1.0 means one full CPU core is being used.
Why it matters:
- Accurate CPU accounting from the hypervisor perspective
- Detect CPU time theft or unexpected usage
- Validate zone CPU limits
hypervisor_memory_total_bytes
Type: Gauge
Labels: zone_id, k8s_pod, k8s_namespace
Total memory allocated to a zone from Xen’s perspective.
Example:
hypervisor_memory_total_bytes{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 1073741824What it means: Xen has allocated 1 GiB to this zone.
hypervisor_memory_outstanding_bytes
Type: Gauge
Labels: zone_id, k8s_pod, k8s_namespace
Memory that has been allocated but not yet fully provided (memory ballooning).
Example:
hypervisor_memory_outstanding_bytes{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 104857600What it means: 100 MiB is “owed” to the zone but not yet provided.
Why it matters:
- Xen uses memory ballooning to dynamically adjust memory allocation
- High outstanding memory indicates memory pressure on the host
- If outstanding memory is persistently high, the node may be over-subscribed
Alert threshold: Alert if outstanding memory exceeds 20% of total for extended periods.
hypervisor_vcpus_online
Type: Gauge
Labels: zone_id, k8s_pod, k8s_namespace
Number of vCPUs currently online for a zone.
Example:
hypervisor_vcpus_online{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 2What it means: This zone has 2 vCPUs online.
Why it matters:
- Verify zones have the expected number of CPUs
- Detect CPU hotplug events
- Troubleshoot zones not utilizing all allocated vCPUs
Host-Level Metrics
Host metrics measure the underlying physical node (Dom0). This is the foundation everything else runs on.
host_cpu_usage_percent
Type: Gauge
Labels: cpu, mode
CPU usage percentage for each physical core, broken down by mode.
Example:
host_cpu_usage_percent{cpu="0", mode="user"} 34.2
host_cpu_usage_percent{cpu="0", mode="system"} 12.5
host_cpu_usage_percent{cpu="0", mode="idle"} 52.1
host_cpu_usage_percent{cpu="0", mode="iowait"} 1.2What it means: Physical CPU 0 is spending 34.2% in user mode, 12.5% in system/kernel mode, 52.1% idle, 1.2% waiting for I/O.
Why it matters:
- Identify system-level bottlenecks
- High
iowaitindicates disk I/O bottlenecks - High
systemmode suggests kernel/hypervisor overhead - Low
idlemeans the node is saturated
Alert threshold: Alert if idle < 10% across all cores for more than 5 minutes.
host_cpu_usage_seconds_total
Type: Counter
Labels: cpu, mode
Cumulative CPU time in seconds for each mode.
Example:
host_cpu_usage_seconds_total{cpu="0", mode="user"} 123456.78
host_cpu_usage_seconds_total{cpu="0", mode="system"} 45678.90Use with rate():
rate(host_cpu_usage_seconds_total{mode="system"}[5m])This shows the rate of system CPU time, useful for detecting kernel overhead spikes.
host_memory_used_bytes
Type: Gauge
Total memory used on the host.
Example:
host_memory_used_bytes 12884901888What it means: Host is using 12 GiB of RAM.
host_memory_free_bytes
Type: Gauge
Free memory available on the host.
Example:
host_memory_free_bytes 4294967296What it means: Host has 4 GiB free.
host_memory_total_bytes
Type: Gauge
Total physical memory on the host.
Example:
host_memory_total_bytes 17179869184What it means: Host has 16 GiB total memory.
Combined usage:
(host_memory_used_bytes / host_memory_total_bytes) * 100This gives host memory usage percentage.
Why it matters:
- Determine if the node has capacity for more zones
- Detect memory pressure at the host level
- Plan for node scaling
Alert threshold: Alert if host memory usage exceeds 90%.
Health Check Metrics
Health metrics track the overall system health from Edera’s perspective.
health_check_total
Type: Counter
Labels: status, service
Count of health checks by status (success/failed) and service.
Example:
health_check_total{status="success", service="zone_manager"} 14523
health_check_total{status="failed", service="zone_manager"} 7What it means: The zone manager has had 14,523 successful health checks and 7 failures.
Use with rate():
rate(health_check_total{status="success"}[5m]) / rate(health_check_total[5m])This gives you the health check success rate over 5 minutes.
Why it matters:
- Early warning of system degradation
- Track service availability
- Correlate health check failures with other metrics
Alert threshold: Alert if success rate drops below 95%.
Label Enrichment
All zone-level metrics include Kubernetes labels:
zone_id- Unique zone identifier (UUID)k8s_pod- Kubernetes pod namek8s_namespace- Kubernetes namespace
This lets you correlate Edera metrics with Kubernetes resources. You can group by namespace to see per-team resource usage, or filter by pod to troubleshoot specific workloads.
Example query:
sum(zone_cpu_usage_percent) by (k8s_namespace)This shows total CPU usage across all zones, grouped by Kubernetes namespace.
Metric Collection Frequency
Edera exposes metrics in real-time, but Prometheus scrapes them at intervals:
- Default scrape interval: 10 seconds
- Evaluation interval: 15 seconds
This means:
- Metrics are updated every 10 seconds
- Alerting rules are evaluated every 15 seconds
- You can use
rate()andincrease()functions with windows as small as 30-60 seconds
For large clusters (100+ nodes), consider increasing scrape intervals to reduce load:
- 30 seconds for production monitoring
- 60 seconds for historical/trend analysis
Understanding the Data Flow
Here’s how metrics flow from Edera to your dashboard:
1. Edera exposes metrics
↓
http://NODE_IP:3035/metrics
(Prometheus text format)
2. Prometheus scrapes
↓
Every 10s, Kubernetes service discovery
finds nodes and scrapes metrics
3. Prometheus stores
↓
Time-series database with labels
(30-day retention by default)
4. Grafana queries
↓
PromQL queries fetch data
from Prometheus API
5. Dashboard displays
↓
18 panels visualize metrics
across all three layersCommon Metric Queries
Here are some useful PromQL queries for day-to-day operations:
Total zones running:
sum(zones{state="ready"})Average CPU usage per zone:
avg by (zone_id, k8s_pod) (zone_cpu_usage_percent)Zones using >80% memory:
(zone_memory_used_bytes / zone_memory_total_bytes) * 100 > 80Top 5 zones by CPU:
topk(5, avg by (zone_id, k8s_pod) (zone_cpu_usage_percent))Host CPU idle time:
avg(host_cpu_usage_percent{mode="idle"})Memory available for new zones:
host_memory_free_bytes - (sum(hypervisor_memory_total_bytes) - host_memory_used_bytes)Health check success rate (last 5 minutes):
sum(rate(health_check_total{status="success"}[5m])) / sum(rate(health_check_total[5m]))Metrics You Won’t See (And Why)
Coming from traditional container monitoring, you might expect certain metrics that aren’t present in Edera:
Container-level process metrics: Each zone is an isolated VM with its own kernel. Edera doesn’t expose per-process metrics from inside zones—that’s the zone’s responsibility. Use in-zone monitoring (like Prometheus node_exporter running inside the workload) if you need process-level detail.
Network traffic metrics: Currently not exposed. Network monitoring should be done at the Kubernetes CNI layer or with tools like Cilium/Hubble.
Disk I/O metrics: Not currently exposed. Use host-level monitoring tools or cloud provider metrics (CloudWatch for EBS).
Why the limitation? Edera’s metrics focus on the virtualization layer: resource allocation, isolation health, and system state. Application-level metrics belong inside the zone or at the orchestration layer.
Metric Retention and Storage
Prometheus retention: 30 days by default (configurable) Storage: 50 GB persistent volume Cardinality: Depends on cluster size
Cardinality estimate:
- 10 metrics per zone × 3 labels = ~30 time series per zone
- 100 zones = ~3,000 time series
- 1,000 zones = ~30,000 time series
For large clusters, consider:
- Increasing retention to longer intervals (60s scrape)
- Using remote write to long-term storage (Thanos, Cortex, Mimir)
- Implementing recording rules for expensive queries
Now you understand what every metric means and why it matters. You know the three-layer hierarchy and how to query data at each level.
Next: Let’s put this knowledge to use by navigating the Grafana dashboard →
