Skip to content

Service Level Objectives (SLOs)

This document outlines our defined SLOs, alerting thresholds, and how we validate system performance under load.


  1. Run Start Latency (p95)

    • Target: ≤ 2s to enqueue a workflow run after API POST /runs.
    • Alert: p95 latency > 5s sustained for 5 minutes.
  2. Run Success Rate

    • Target: ≥ 99% of runs succeed end-to-end.
    • Alert: success rate < 95% for 10 minutes.
  3. Stream Gaps (SSE)

    • Target: zero dropped or out-of-order events.
    • Alert: >10 SSE gaps/minute per instance.

  • Run start latency: Alert critical >5s, warning >3s.
  • Run success rate: Alert critical <95%, warning <98%.
  • Stream gap rate: Alert critical >50/min, warning >10/min.

We use load tests (e.g. Locust, k6) to validate SLOs at scale:

  • Ramp up to 1000 concurrent runs.
  • Measure latency distribution, stream continuity, and worker throughput.
  • Compare against defined targets.

When the system is overloaded:

  • API returns 429 Too Many Requests with Retry-After header.
  • Clients must back off and retry after suggested interval.
  • This ensures event queues and API servers are not overwhelmed.

  • SLO compliance is reviewed quarterly.
  • Dashboards and alerting rules in Prometheus/Grafana enforce these thresholds.
  • SLO outcomes feed into error budgets for operational planning.