Skip to content

Observability Contract

vfo should emit one canonical run model that can feed the browser dashboard now and metrics, logs, and traces later without changing the underlying meaning of the data.

Objective

  • keep the web app, CI reports, and future observability backends aligned
  • avoid one-off schemas that only serve a single UI
  • preserve low-cardinality metrics while still carrying rich per-run detail

Core Model

The app should emit structured events for a small set of stable entities:

  • run_id
  • pipeline_id
  • asset_id
  • node_id
  • stage_id
  • attempt_id

Each event should carry:

  • timestamp
  • state transition
  • duration when available
  • exit code when available
  • command line or action name
  • input and output asset references
  • error class and human-readable detail when failed

Canonical States

Use a small stable state set across all sinks:

  • queued
  • running
  • waiting
  • complete
  • failed
  • skipped

Sink Mapping

The same event stream should be adapted into multiple outputs:

  • browser dashboard: denormalized snapshot for UI rendering
  • Prometheus: numeric counters, gauges, and histograms with low-cardinality labels
  • Grafana Mimir: long-term storage for Prometheus-style metrics
  • Grafana Loki: structured logs with stable metadata
  • Grafana Tempo: traces and span timing for pipeline execution

Guiding Rules

  • never rely on UI labels as the canonical schema
  • keep labels stable and machine-friendly
  • avoid high-cardinality Prometheus labels such as raw paths or arbitrary filenames
  • preserve full detail in logs or snapshot payloads instead
  • version the dashboard payload so readers can evolve safely

Suggested Shape

At a minimum, the dashboard snapshot should include:

  • run header metadata
  • one or more pipelines
  • asset lists per pipeline
  • workflow nodes and edges
  • node details for the selected asset/pipeline
  • summary counts and stage totals

That gives the web app enough information to render the live dashboard while leaving the event stream available for future observability exports.

Next Step

The next implementation step is to derive this snapshot from runtime events, then fan the same events into metrics, logs, and traces.