Skip to content

Metrics Collection & Reporting

The internal/metrics package collects node metrics (system resources, tunnel health, peer latency) and reports them to the control plane via POST /v1/nodes/{node_id}/metrics. All data sources are abstracted behind injectable interfaces for testability.

The Manager runs two independent ticker loops in a single goroutine: one for collection and one for reporting. Collected metrics are buffered in memory and flushed to the control plane at the configured report interval.

Metric Groups

plexd collects the following metric groups at the configured collect_interval:

Metric GroupData Points
node_resourcesCPU usage (%), memory used/total, disk used/total, load average
tunnel_healthPer-peer handshake age, TX/RX bytes, packet loss (%), last handshake timestamp
peer_latencyPer-peer RTT (ms) via ICMP echo over the mesh interface
agent_statsplexd goroutine count, heap memory, GC stats, uptime, reconnect count

Metrics are delivered to the control plane as batch POST requests (JSON array, gzip-compressed) at report_interval (default 60s) or when batch_size (default 100 data points) is reached, whichever comes first. There is no local Prometheus or OpenTelemetry exposition endpoint — all metrics flow exclusively to the control plane.

Config

Config holds metrics collection and reporting parameters.

FieldTypeDefaultDescription
EnabledbooltrueWhether metrics collection is active
CollectIntervaltime.Duration15sInterval between collection cycles (min 5s)
ReportIntervaltime.Duration60sInterval between reporting to control plane (min 10s)
BatchSizeint100Max metric points per report batch (must be > 0)
LocalEndpointapi.LocalEndpointConfig(zero)Optional local endpoint for dual-destination delivery (see below)
go
cfg := metrics.Config{}
cfg.ApplyDefaults() // Enabled=true, CollectInterval=15s, ReportInterval=60s, BatchSize=100
if err := cfg.Validate(); err != nil {
    log.Fatal(err)
}

ApplyDefaults sets Enabled=true on a zero-valued Config. To disable metrics, set Enabled=false after calling ApplyDefaults.

Validation Rules

FieldRuleError Message
CollectInterval>= 5smetrics: config: CollectInterval must be at least 5s
ReportInterval>= 10smetrics: config: ReportInterval must be at least 10s
ReportInterval>= CollectIntervalmetrics: config: ReportInterval must be >= CollectInterval
BatchSize> 0metrics: config: BatchSize must be > 0

When Enabled=false, validation is skipped entirely (including LocalEndpoint validation).

Local Endpoint

For a step-by-step setup guide, see Setting Up Local Endpoint Delivery.

LocalEndpoint allows metrics to be sent to an additional local endpoint alongside the control plane. The type is api.LocalEndpointConfig, defined once in internal/api/types.go and shared across all three observability pipelines.

FieldTypeYAML KeyDescription
URLstringlocal_endpoint.urlHTTPS endpoint URL. Empty means not configured.
SecretKeystringlocal_endpoint.secret_keyAuth credential. Required when URL is set. Redacted in config dumps.
TLSInsecureSkipVerifyboollocal_endpoint.tls_insecure_skip_verifyDisable TLS certificate verification.

Validation rules (applied only when Enabled=true and URL is non-empty):

RuleError Message
URL must be parseablemetrics: config: local_endpoint: invalid URL "<url>"
Scheme must be httpsmetrics: config: local_endpoint: URL must be HTTPS, got "<scheme>"
SecretKey must be non-emptymetrics: config: local_endpoint: SecretKey is required when URL is set

A zero-valued LocalEndpointConfig (all fields empty/false) is valid and means "not configured".

yaml
metrics:
  enabled: true
  collect_interval: 15s
  report_interval: 60s
  batch_size: 100
  local_endpoint:
    url: https://metrics.local:9090/ingest
    secret_key: local-metrics-token
    tls_insecure_skip_verify: false

Collector

Interface for subsystem-specific metric collection. Each collector returns a slice of api.MetricPoint.

go
type Collector interface {
    Collect(ctx context.Context) ([]api.MetricPoint, error)
}

Metric Groups

ConstantValueCollectorDescription
GroupSystem"system"SystemCollectorCPU, memory, disk, network, load avg
GroupTunnel"tunnel"TunnelCollectorPer-peer tunnel health + packet loss
GroupLatency"latency"LatencyCollectorPer-peer round-trip latency
GroupAgent"agent"AgentStatsCollectorGo runtime, uptime, reconnects

SystemCollector

Collects system resource metrics via an injectable SystemReader.

SystemReader

go
type SystemReader interface {
    ReadStats(ctx context.Context) (*SystemStats, error)
}

SystemStats

go
type SystemStats struct {
    CPUUsagePercent  float64 `json:"cpu_usage_percent"`
    MemoryUsedBytes  uint64  `json:"memory_used_bytes"`
    MemoryTotalBytes uint64  `json:"memory_total_bytes"`
    DiskUsedBytes    uint64  `json:"disk_used_bytes"`
    DiskTotalBytes   uint64  `json:"disk_total_bytes"`
    NetworkRxBytes   uint64  `json:"network_rx_bytes"`
    NetworkTxBytes   uint64  `json:"network_tx_bytes"`
    LoadAvg1         float64 `json:"load_avg_1"`
    LoadAvg5         float64 `json:"load_avg_5"`
    LoadAvg15        float64 `json:"load_avg_15"`
}

Constructor

go
func NewSystemCollector(reader SystemReader, logger *slog.Logger) *SystemCollector

Collect Behavior

Returns a single MetricPoint with Group="system". The Data field contains the JSON-encoded SystemStats.

Example JSON data:

json
{
  "cpu_usage_percent": 42.5,
  "memory_used_bytes": 2147483648,
  "memory_total_bytes": 8589934592,
  "disk_used_bytes": 53687091200,
  "disk_total_bytes": 107374182400,
  "network_rx_bytes": 1048576,
  "network_tx_bytes": 524288,
  "load_avg_1": 1.5,
  "load_avg_5": 1.2,
  "load_avg_15": 0.9
}

On reader error, returns nil, fmt.Errorf("metrics: system: %w", err).

LinuxSystemReader

Concrete SystemReader implementation for Linux, reading from /proc and syscall.Statfs.

go
func NewLinuxSystemReader(mountPoint, netIface string) *LinuxSystemReader
ParameterDefaultDescription
mountPoint"/"Filesystem path for disk stats via syscall.Statfs
netIface""Network interface for rx/tx bytes; empty sums all (excl. lo)

Data Sources:

FieldSourceMethod
CPUUsagePercent/proc/statTwo samples 100ms apart, delta busy/total
MemoryUsedBytes/proc/meminfoMemTotal - MemAvailable
MemoryTotalBytes/proc/meminfoMemTotal (kB × 1024)
DiskUsedBytessyscall.Statfs(Blocks - Bfree) × Bsize
DiskTotalBytessyscall.StatfsBlocks × Bsize
NetworkRxBytes/proc/net/devSum of rx bytes (field 0) per interface
NetworkTxBytes/proc/net/devSum of tx bytes (field 8) per interface
LoadAvg1/5/15/proc/loadavgFirst three space-separated fields

Build-tagged //go:build linux.

TunnelCollector

Collects per-peer tunnel health metrics via an injectable TunnelStatsReader.

TunnelStatsReader

go
type TunnelStatsReader interface {
    ReadTunnelStats(ctx context.Context) ([]TunnelStats, error)
}

TunnelStats

go
type TunnelStats struct {
    PeerID             string    `json:"peer_id"`
    LastHandshakeTime  time.Time `json:"last_handshake_time"`
    RxBytes            uint64    `json:"rx_bytes"`
    TxBytes            uint64    `json:"tx_bytes"`
    HandshakeSucceeded bool      `json:"handshake_succeeded"`
    HandshakeStale     bool      `json:"handshake_stale"`
    PacketLossPercent  float64   `json:"packet_loss_percent"`
}

PacketLossPercent is provided by the TunnelStatsReader implementation. When packet loss measurement is unavailable, the value is -1.

Constructor

go
func NewTunnelCollector(reader TunnelStatsReader, logger *slog.Logger) *TunnelCollector

Collect Behavior

Returns one MetricPoint per peer with Group="tunnel" and PeerID set. Each point's Data field contains the JSON-encoded TunnelStats. Returns an empty slice (not nil) when no peers exist.

Example JSON data (per peer):

json
{
  "peer_id": "peer-abc-123",
  "last_handshake_time": "2026-02-12T10:30:00Z",
  "rx_bytes": 104857600,
  "tx_bytes": 52428800,
  "handshake_succeeded": true,
  "handshake_stale": false,
  "packet_loss_percent": 0.5
}

On reader error, returns nil, fmt.Errorf("metrics: tunnel: %w", err).

LatencyCollector

Measures per-peer round-trip latency via an injectable Pinger and PeerLister.

Pinger

go
type Pinger interface {
    Ping(ctx context.Context, peerID string) (rttNano int64, err error)
}

PeerLister

go
type PeerLister interface {
    PeerIDs() []string
}

LatencyResult

go
type LatencyResult struct {
    PeerID  string `json:"peer_id"`
    RTTNano int64  `json:"rtt_nano"`
}

Constructor

go
func NewLatencyCollector(pinger Pinger, lister PeerLister, logger *slog.Logger) *LatencyCollector

Collect Behavior

Returns one MetricPoint per peer with Group="latency" and PeerID set. When a ping fails for a peer, the collector logs a warning and records RTTNano=-1 (unreachable) — it does not skip the peer or return an error. Returns an empty slice when no peers exist. If the context is cancelled mid-iteration, returns partial results with ctx.Err().

Example JSON data (per peer):

json
{
  "peer_id": "peer-abc-123",
  "rtt_nano": 15000000
}

Unreachable peer:

json
{
  "peer_id": "peer-xyz-789",
  "rtt_nano": -1
}

AgentStatsCollector

Collects Go runtime and agent health metrics.

ReconnectCounter

go
type ReconnectCounter interface {
    ReconnectCount() int
}

AgentStats

go
type AgentStats struct {
    GoroutineCount int     `json:"goroutine_count"`
    HeapAllocBytes uint64  `json:"heap_alloc_bytes"`
    HeapSysBytes   uint64  `json:"heap_sys_bytes"`
    GCPauseTotalNs uint64  `json:"gc_pause_total_ns"`
    GCNumGC        uint32  `json:"gc_num_gc"`
    UptimeSeconds  float64 `json:"uptime_seconds"`
    ReconnectCount int     `json:"reconnect_count"`
}

Constructor

go
func NewAgentStatsCollector(startTime time.Time, reconnects ReconnectCounter, logger *slog.Logger) *AgentStatsCollector

The reconnects parameter may be nil if reconnect counting is not available; reconnect_count defaults to 0.

Collect Behavior

Returns a single MetricPoint with Group="agent". Uses runtime.NumGoroutine() and runtime.ReadMemStats() for Go runtime metrics. Uptime is calculated as time.Since(startTime).

Example JSON data:

json
{
  "goroutine_count": 42,
  "heap_alloc_bytes": 8388608,
  "heap_sys_bytes": 16777216,
  "gc_pause_total_ns": 1500000,
  "gc_num_gc": 12,
  "uptime_seconds": 3600.5,
  "reconnect_count": 3
}

MetricReporter

Interface abstracting the control plane metrics reporting API. Satisfied by api.ControlPlane.

go
type MetricReporter interface {
    ReportMetrics(ctx context.Context, nodeID string, batch api.MetricBatch) error
}

LocalReporter

LocalReporter implements MetricsReporter by POSTing metric batches to a locally-configured HTTPS endpoint with bearer-token authentication. It operates independently from the control plane client—with its own http.Client, TLS settings, and credential cache.

Constructor

go
func NewLocalReporter(cfg api.LocalEndpointConfig, fetcher SecretFetcher, nsk []byte, nodeID string, logger *slog.Logger) *LocalReporter
ParameterDescription
cfgLocal endpoint configuration (URL, secret key, TLS settings)
fetcherSecretFetcher for retrieving encrypted credentials (satisfied by api.ControlPlane)
nskNode secret key bytes for AES-256-GCM decryption via nodeapi.DecryptSecret
nodeIDNode identifier passed to SecretFetcher.FetchSecret
loggerStructured logger (log/slog)

SecretFetcher

go
type SecretFetcher interface {
    FetchSecret(ctx context.Context, nodeID, key string) (*api.SecretResponse, error)
}

Defined in the metrics package. The api.ControlPlane client satisfies this interface.

HTTP Client

Each LocalReporter creates its own http.Client at construction time:

SettingValueNotes
Timeout10sPer-request timeout including connection, TLS handshake, and response body
TLSConfigurableTLSInsecureSkipVerify controls certificate validation for this client only
CompressionNoneBatches are sent as uncompressed JSON (unlike the gzip-compressed control plane client)

Credential Resolution

On each ReportMetrics call, LocalReporter resolves a bearer token via the following flow:

  1. Check cache (read lock): if cachedToken is non-empty and fetchedAt is within the 5-minute TTL, return the cached token immediately.
  2. Acquire write lock and double-check the cache (another goroutine may have refreshed it).
  3. Fetch secret: call SecretFetcher.FetchSecret(ctx, nodeID, secretKey) to retrieve the encrypted credential.
  4. Decrypt: call nodeapi.DecryptSecret(nsk, resp.Ciphertext, resp.Nonce) to obtain the plaintext token.
  5. Update cache: store the plaintext token and current timestamp.

Failure modes:

ScenarioBehavior
Fetch or decrypt fails, stale cache existsLog warning, return stale cached token
Fetch or decrypt fails, no cacheReturn error, no HTTP POST is made

The cache is protected by sync.RWMutex with a double-checked locking pattern, making it safe for concurrent ReportMetrics calls.

ReportMetrics Behavior

go
func (r *LocalReporter) ReportMetrics(ctx context.Context, nodeID string, batch api.MetricBatch) error
  1. Resolve bearer token (see above)
  2. JSON-marshal the MetricBatch
  3. POST to cfg.URL with headers:
    • Content-Type: application/json
    • Authorization: Bearer {token}
  4. Return nil on 2xx response; return error containing the status code on non-2xx

MultiReporter

MultiReporter implements MetricsReporter by dispatching to both a platform and a local reporter concurrently. It is the adapter that enables dual-destination delivery without modifying the Manager.

Constructor

go
func NewMultiReporter(platform, local MetricsReporter, logger *slog.Logger) *MultiReporter
ParameterDescription
platformPrimary reporter (typically api.ControlPlane) — its error is returned to the caller
localSecondary reporter (typically LocalReporter) — its error is logged but not returned
loggerStructured logger (log/slog)

Error Semantics

ReportMetrics dispatches to both reporters in parallel goroutines and waits for both to complete.

Platform resultLocal resultReturn valueSide effect
successsuccessnil
errorsuccessplatform error
successerrornilLocal error logged as warning
errorerrorplatform errorLocal error logged as warning

Only the platform error is returned. This is critical because the Manager uses the return value to decide whether to retain the batch in its buffer for retry. Local endpoint failures must not trigger batch retention.

Concurrency

A slow or hanging local endpoint does not delay the availability of the platform result. Both goroutines run independently; MultiReporter waits for both via sync.WaitGroup before returning.

Manager

Orchestrates metric collection and reporting via two independent ticker loops.

Constructor

go
func NewManager(cfg Config, collectors []Collector, reporter MetricReporter, nodeID string, logger *slog.Logger) *Manager
ParameterDescription
cfgMetrics configuration
collectorsSlice of Collector implementations to run each cycle
reporterMetricReporter for sending batches to control plane
nodeIDNode identifier included in report requests
loggerStructured logger (log/slog)

Run Method

go
func (m *Manager) Run(ctx context.Context) error

Blocks until the context is cancelled. Returns ctx.Err() on cancellation.

Lifecycle

go
sysColl := metrics.NewSystemCollector(sysReader, logger)
tunColl := metrics.NewTunnelCollector(tunReader, logger)
latColl := metrics.NewLatencyCollector(pinger, lister, logger)

mgr := metrics.NewManager(cfg, []metrics.Collector{sysColl, tunColl, latColl}, controlPlane, nodeID, logger)

// Blocks until ctx is cancelled
err := mgr.Run(ctx)
// err == context.Canceled (normal shutdown)

Run Sequence

  1. If Enabled=false: log info, return nil immediately
  2. Start collect ticker (CollectInterval) and report ticker (ReportInterval)
  3. On collect tick: call each collector's Collect, append results to mutex-protected buffer, log errors per-collector but continue
  4. On report tick: swap buffer under lock, send via ReportMetrics, log errors but continue
  5. On context cancellation: best-effort flush of remaining buffer using context.Background(), return ctx.Err()

Buffer Management

  • Collected MetricPoints are appended to an internal buffer protected by sync.Mutex
  • On report tick, the buffer is swapped out atomically (lock, copy reference, set to nil, unlock)
  • Empty buffers skip the report call entirely
  • On shutdown, remaining buffered points are flushed with a background context

API Contract

POST /v1/nodes/{node_id}/metrics

Reports a batch of metric points to the control plane.

Request body (api.MetricBatch = []api.MetricPoint):

json
[
  {
    "timestamp": "2026-02-12T10:30:00Z",
    "group": "system",
    "data": {
      "cpu_usage_percent": 42.5,
      "memory_used_bytes": 2147483648,
      "memory_total_bytes": 8589934592,
      "disk_used_bytes": 53687091200,
      "disk_total_bytes": 107374182400,
      "network_rx_bytes": 1048576,
      "network_tx_bytes": 524288,
      "load_avg_1": 1.5,
      "load_avg_5": 1.2,
      "load_avg_15": 0.9
    }
  },
  {
    "timestamp": "2026-02-12T10:30:00Z",
    "group": "tunnel",
    "peer_id": "peer-abc-123",
    "data": {
      "peer_id": "peer-abc-123",
      "last_handshake_time": "2026-02-12T10:29:55Z",
      "rx_bytes": 104857600,
      "tx_bytes": 52428800,
      "handshake_succeeded": true,
      "handshake_stale": false,
      "packet_loss_percent": 0.5
    }
  },
  {
    "timestamp": "2026-02-12T10:30:00Z",
    "group": "latency",
    "peer_id": "peer-abc-123",
    "data": {
      "peer_id": "peer-abc-123",
      "rtt_nano": 15000000
    }
  }
]

MetricPoint Schema

go
type MetricPoint struct {
    Timestamp time.Time       `json:"timestamp"`
    Group     string          `json:"group"`              // "system", "tunnel", "latency", "agent"
    PeerID    string          `json:"peer_id,omitempty"`  // set for tunnel and latency groups
    Data      json.RawMessage `json:"data"`               // group-specific JSON payload
}
FieldTypeDescription
Timestamptime.TimeWhen the metric was collected (RFC 3339)
GroupstringMetric group identifier
PeerIDstringPeer identifier (omitted for system metrics)
Datajson.RawMessageGroup-specific payload (see schemas above)

Error Handling

ScenarioBehavior
Collector returns errorLog warn, skip collector, continue with others
Reporter returns errorLog warn, buffer is cleared, continue next cycle
All collectors failEmpty buffer, report tick is a no-op
Ping fails for a peerLog warn, record RTTNano=-1, continue other peers
Context cancelled mid-collectPartial results returned by latency collector
Context cancelled (shutdown)Best-effort flush, return ctx.Err()
Metrics disabledReturn nil immediately, no goroutines started

Logging

All log entries use component=metrics.

LevelEventKeys
InfoMetrics disabledcomponent
WarnCollector failedcomponent, error
WarnMetrics report failedcomponent, error
WarnLatency ping failedpeer_id, error
WarnUsing cached credentialcomponent, error
WarnLocal metrics report failedcomponent, error
InfoLocal endpoint enabledpipeline, url

Integration Points

With api.ControlPlane

api.ControlPlane satisfies the MetricReporter interface directly:

go
controlPlane, _ := api.NewControlPlane(apiCfg, "1.0.0", logger)

// controlPlane.ReportMetrics matches MetricReporter.ReportMetrics
mgr := metrics.NewManager(cfg, collectors, controlPlane, nodeID, logger)
mgr.Run(ctx)

With LocalReporter and MultiReporter

When LocalEndpoint.URL is configured, the pipeline manager receives a MultiReporter that wraps both the control plane client and a LocalReporter. When not configured, the control plane client is passed directly—no extra goroutines, no SecretFetcher calls, no MultiReporter wrapping.

go
nsk := []byte(identity.NodeSecretKey)

var metricsReporter metrics.MetricsReporter = controlPlane
if cfg.Metrics.LocalEndpoint.URL != "" {
    localReporter := metrics.NewLocalReporter(cfg.Metrics.LocalEndpoint, controlPlane, nsk, identity.NodeID, logger)
    metricsReporter = metrics.NewMultiReporter(controlPlane, localReporter, logger)
    logger.Info("local endpoint enabled", "pipeline", "metrics", "url", cfg.Metrics.LocalEndpoint.URL)
}
mgr := metrics.NewManager(cfg.Metrics, collectors, metricsReporter, identity.NodeID, logger)

Integration Tests

See Local Endpoint Integration Tests for the full integration test suite covering dual delivery, error isolation, credential resolution, and TLS skip-verify across all three pipelines.

With wireguard.Manager

wireguard.Manager.PeerIndex() can serve as the basis for a PeerLister implementation, providing the list of active peer IDs to the LatencyCollector.