Understanding Edera Metrics

Traditional container monitoring is flat: you have a node, you have containers, you measure CPU and memory. Done.

Edera’s architecture is hierarchical. Every workload runs in a microVM (zone) managed by a hypervisor (Xen) on a host node (Dom0). To properly understand what’s happening, you need visibility at all three layers:

┌───────────────────────────────────────────┐
│  Host Layer (Dom0)                        │
│  - Physical node resources                │
│  - Total CPU, memory, disk                │
│  - System-level health                    │
│                                           │
│  ┌─────────────────────────────────────┐  │
│  │  Hypervisor Layer (Xen)             │  │
│  │  - Resource allocation to zones     │  │
│  │  - vCPU scheduling                  │  │
│  │  - Memory ballooning                │  │
│  │                                     │  │
│  │  ┌──────────────┐  ┌──────────────┐ │  │
│  │  │ Zone Layer   │  │ Zone Layer   │ │  │
│  │  │              │  │              │ │  │
│  │  │ - Per-zone   │  │ - Per-zone   │ │  │
│  │  │   CPU usage  │  │   CPU usage  │ │  │
│  │  │ - Per-zone   │  │ - Per-zone   │ │  │
│  │  │   memory     │  │   memory     │ │  │
│  │  └──────────────┘  └──────────────┘ │  │
│  └─────────────────────────────────────┘  │
└───────────────────────────────────────────┘

This isn’t just academic. Each layer tells you something different:

Zone metrics answer: “Is this workload healthy and within resource limits?”
Hypervisor metrics answer: “How is Xen managing resource allocation?”
Host metrics answer: “Does this node have capacity for more zones?”

Let’s break down what metrics are available at each layer and what they actually mean.

Metric Naming Convention

All Edera metrics follow a consistent naming pattern:

zone_* - Zone (microVM) level metrics
hypervisor_* - Xen hypervisor metrics
host_* - Dom0 (host) system metrics
health_* - Health check metrics

Metrics are exposed as Prometheus format at http://NODE_IP:3035/metrics.

Zone-Level Metrics

Zone metrics track individual microVM resource consumption. Each zone represents one container workload running in isolation.

zone_cpu_usage_percent

Type: Gauge Labels: zone_id, cpu, k8s_pod, k8s_namespace

CPU usage percentage for each vCPU in a zone.

Example:

zone_cpu_usage_percent{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", cpu="0", k8s_pod="nginx-abc123", k8s_namespace="default"} 45.3
zone_cpu_usage_percent{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", cpu="1", k8s_pod="nginx-abc123", k8s_namespace="default"} 12.7

What it means: Zone a1b2c3d4 is using 45.3% of vCPU 0 and 12.7% of vCPU 1.

Why it matters:

Identify CPU-bound workloads
Detect unbalanced CPU allocation (some cores maxed, others idle)
Right-size zone vCPU allocation

Alert threshold: Consider alerting if any vCPU consistently exceeds 90%.

zone_memory_used_bytes

Type: Gauge Labels: zone_id, k8s_pod, k8s_namespace

Current memory usage in bytes for a zone.

Example:

zone_memory_used_bytes{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 536870912

What it means: Zone is using 512 MiB (536870912 / 1024 / 1024).

zone_memory_total_bytes

Type: Gauge Labels: zone_id, k8s_pod, k8s_namespace

Total memory allocated to a zone.

Example:

zone_memory_total_bytes{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 1073741824

What it means: Zone has 1 GiB total memory available.

Combined usage:

(zone_memory_used_bytes / zone_memory_total_bytes) * 100

This gives you memory usage percentage. If this approaches 100%, the zone is at memory capacity.

Why it matters:

Detect memory pressure before OOM kills occur
Right-size zone memory allocation
Identify memory leaks

Alert threshold: Alert if memory usage exceeds 85% for more than 5 minutes.

zones

Type: Gauge Labels: state

Count of zones in each state.

Example:

zones{state="ready"} 47
zones{state="starting"} 2
zones{state="stopping"} 1
zones{state="error"} 0

What it means: You have 47 zones in ready state, 2 starting, 1 stopping, 0 in error.

Why it matters:

Overall cluster health at a glance
Track zone lifecycle (starting/stopping should be transient)
Alert on persistent error states

Alert threshold: Alert if zones{state="error"} > 0 for more than 1 minute.

Hypervisor-Level Metrics

Hypervisor metrics expose Xen’s internal state. This is where you see how resources are being managed at the virtualization layer.

hypervisor_cpu_usage_seconds_total

Type: Counter Labels: zone_id, k8s_pod, k8s_namespace

Cumulative CPU time consumed by a zone, as measured by Xen.

Example:

hypervisor_cpu_usage_seconds_total{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 3742.56

What it means: This zone has consumed 3742.56 CPU-seconds since creation.

Use with rate():

rate(hypervisor_cpu_usage_seconds_total[5m])

This gives you the CPU time rate over the last 5 minutes. A value of 1.0 means one full CPU core is being used.

Why it matters:

Accurate CPU accounting from the hypervisor perspective
Detect CPU time theft or unexpected usage
Validate zone CPU limits

hypervisor_memory_total_bytes

Type: Gauge Labels: zone_id, k8s_pod, k8s_namespace

Total memory allocated to a zone from Xen’s perspective.

Example:

hypervisor_memory_total_bytes{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 1073741824

What it means: Xen has allocated 1 GiB to this zone.

hypervisor_memory_outstanding_bytes

Type: Gauge Labels: zone_id, k8s_pod, k8s_namespace

Memory that has been allocated but not yet fully provided (memory ballooning).

Example:

hypervisor_memory_outstanding_bytes{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 104857600

What it means: 100 MiB is “owed” to the zone but not yet provided.

Why it matters:

Xen uses memory ballooning to dynamically adjust memory allocation
High outstanding memory indicates memory pressure on the host
If outstanding memory is persistently high, the node may be over-subscribed

Alert threshold: Alert if outstanding memory exceeds 20% of total for extended periods.

hypervisor_vcpus_online

Type: Gauge Labels: zone_id, k8s_pod, k8s_namespace

Number of vCPUs currently online for a zone.

Example:

hypervisor_vcpus_online{zone_id="a1b2c3d4-1234-5678-abcd-ef1234567890", k8s_pod="nginx-abc123", k8s_namespace="default"} 2

What it means: This zone has 2 vCPUs online.

Why it matters:

Verify zones have the expected number of CPUs
Detect CPU hotplug events
Troubleshoot zones not utilizing all allocated vCPUs

Host-Level Metrics

Host metrics measure the underlying physical node (Dom0). This is the foundation everything else runs on.

host_cpu_usage_percent

Type: Gauge Labels: cpu, mode

CPU usage percentage for each physical core, broken down by mode.

Example:

host_cpu_usage_percent{cpu="0", mode="user"} 34.2
host_cpu_usage_percent{cpu="0", mode="system"} 12.5
host_cpu_usage_percent{cpu="0", mode="idle"} 52.1
host_cpu_usage_percent{cpu="0", mode="iowait"} 1.2

What it means: Physical CPU 0 is spending 34.2% in user mode, 12.5% in system/kernel mode, 52.1% idle, 1.2% waiting for I/O.

Why it matters:

Identify system-level bottlenecks
High iowait indicates disk I/O bottlenecks
High system mode suggests kernel/hypervisor overhead
Low idle means the node is saturated

Alert threshold: Alert if idle < 10% across all cores for more than 5 minutes.

host_cpu_usage_seconds_total

Type: Counter Labels: cpu, mode

Cumulative CPU time in seconds for each mode.

Example:

host_cpu_usage_seconds_total{cpu="0", mode="user"} 123456.78
host_cpu_usage_seconds_total{cpu="0", mode="system"} 45678.90

Use with rate():

rate(host_cpu_usage_seconds_total{mode="system"}[5m])

This shows the rate of system CPU time, useful for detecting kernel overhead spikes.

host_memory_used_bytes

Type: Gauge

Total memory used on the host.

Example:

host_memory_used_bytes 12884901888

What it means: Host is using 12 GiB of RAM.

host_memory_free_bytes

Type: Gauge

Free memory available on the host.

Example:

host_memory_free_bytes 4294967296

What it means: Host has 4 GiB free.

host_memory_total_bytes

Type: Gauge

Total physical memory on the host.

Example:

host_memory_total_bytes 17179869184

What it means: Host has 16 GiB total memory.

Combined usage:

(host_memory_used_bytes / host_memory_total_bytes) * 100

This gives host memory usage percentage.

Why it matters:

Determine if the node has capacity for more zones
Detect memory pressure at the host level
Plan for node scaling

Alert threshold: Alert if host memory usage exceeds 90%.

Health Check Metrics

Health metrics track the overall system health from Edera’s perspective.

health_check_total

Type: Counter Labels: status, service

Count of health checks by status (success/failed) and service.

Example:

health_check_total{status="success", service="zone_manager"} 14523
health_check_total{status="failed", service="zone_manager"} 7

What it means: The zone manager has had 14,523 successful health checks and 7 failures.

Use with rate():

rate(health_check_total{status="success"}[5m]) / rate(health_check_total[5m])

This gives you the health check success rate over 5 minutes.

Why it matters:

Early warning of system degradation
Track service availability
Correlate health check failures with other metrics

Alert threshold: Alert if success rate drops below 95%.

Label Enrichment

All zone-level metrics include Kubernetes labels:

zone_id - Unique zone identifier (UUID)
k8s_pod - Kubernetes pod name
k8s_namespace - Kubernetes namespace

This lets you correlate Edera metrics with Kubernetes resources. You can group by namespace to see per-team resource usage, or filter by pod to troubleshoot specific workloads.

Example query:

sum(zone_cpu_usage_percent) by (k8s_namespace)

This shows total CPU usage across all zones, grouped by Kubernetes namespace.

Metric Collection Frequency

Edera exposes metrics in real-time, but Prometheus scrapes them at intervals:

Default scrape interval: 10 seconds
Evaluation interval: 15 seconds

This means:

Metrics are updated every 10 seconds
Alerting rules are evaluated every 15 seconds
You can use rate() and increase() functions with windows as small as 30-60 seconds

For large clusters (100+ nodes), consider increasing scrape intervals to reduce load:

30 seconds for production monitoring
60 seconds for historical/trend analysis

Understanding the Data Flow

Here’s how metrics flow from Edera to your dashboard:

1. Edera exposes metrics
   ↓
   http://NODE_IP:3035/metrics
   (Prometheus text format)

2. Prometheus scrapes
   ↓
   Every 10s, Kubernetes service discovery
   finds nodes and scrapes metrics

3. Prometheus stores
   ↓
   Time-series database with labels
   (30-day retention by default)

4. Grafana queries
   ↓
   PromQL queries fetch data
   from Prometheus API

5. Dashboard displays
   ↓
   18 panels visualize metrics
   across all three layers

Common Metric Queries

Here are some useful PromQL queries for day-to-day operations:

Total zones running:

sum(zones{state="ready"})

Average CPU usage per zone:

avg by (zone_id, k8s_pod) (zone_cpu_usage_percent)

Zones using >80% memory:

(zone_memory_used_bytes / zone_memory_total_bytes) * 100 > 80

Top 5 zones by CPU:

topk(5, avg by (zone_id, k8s_pod) (zone_cpu_usage_percent))

Host CPU idle time:

avg(host_cpu_usage_percent{mode="idle"})

Memory available for new zones:

host_memory_free_bytes - (sum(hypervisor_memory_total_bytes) - host_memory_used_bytes)

Health check success rate (last 5 minutes):

sum(rate(health_check_total{status="success"}[5m])) / sum(rate(health_check_total[5m]))

Metrics You Won’t See (And Why)

Coming from traditional container monitoring, you might expect certain metrics that aren’t present in Edera:

Container-level process metrics: Each zone is an isolated VM with its own kernel. Edera doesn’t expose per-process metrics from inside zones—that’s the zone’s responsibility. Use in-zone monitoring (like Prometheus node_exporter running inside the workload) if you need process-level detail.

Network traffic metrics: Currently not exposed. Network monitoring should be done at the Kubernetes CNI layer or with tools like Cilium/Hubble.

Disk I/O metrics: Not currently exposed. Use host-level monitoring tools or cloud provider metrics (CloudWatch for EBS).

Why the limitation? Edera’s metrics focus on the virtualization layer: resource allocation, isolation health, and system state. Application-level metrics belong inside the zone or at the orchestration layer.

Metric Retention and Storage

Prometheus retention: 30 days by default (configurable) Storage: 50 GB persistent volume Cardinality: Depends on cluster size

Cardinality estimate:

10 metrics per zone × 3 labels = ~30 time series per zone
100 zones = ~3,000 time series
1,000 zones = ~30,000 time series

For large clusters, consider:

Increasing retention to longer intervals (60s scrape)
Using remote write to long-term storage (Thanos, Cortex, Mimir)
Implementing recording rules for expensive queries

Now you understand what every metric means and why it matters. You know the three-layer hierarchy and how to query data at each level.

Next: Let’s put this knowledge to use by navigating the Grafana dashboard →

Last updated on 2025-12-01

Setting Up the Monitoring Stack Using the Grafana Dashboard

Welcome to Edera

Understanding Edera Metrics

Metric Naming Convention

Zone-Level Metrics

zone_cpu_usage_percent

zone_memory_used_bytes

zone_memory_total_bytes

zones

Hypervisor-Level Metrics

hypervisor_cpu_usage_seconds_total

hypervisor_memory_total_bytes

hypervisor_memory_outstanding_bytes

hypervisor_vcpus_online

Host-Level Metrics

host_cpu_usage_percent

host_cpu_usage_seconds_total

host_memory_used_bytes

host_memory_free_bytes

host_memory_total_bytes

Health Check Metrics

health_check_total

Label Enrichment

Metric Collection Frequency

Understanding the Data Flow

Common Metric Queries

Metrics You Won’t See (And Why)

Metric Retention and Storage