SLOs and error budgets are easier to discuss than to operationalize. Many teams agree that reliability matters, but still struggle to answer basic planning questions: What should the target actually be? How much downtime or failure is acceptable in a month? When should releases slow down, and when is an incident serious enough to consume the budget? This guide gives SRE teams, platform engineers, and service owners a practical framework for using an SLO calculator or building one in a spreadsheet. You will get the core formulas, the inputs that matter, examples you can adapt, and a repeatable way to revisit your numbers as traffic, architecture, and business expectations change.
Overview
An SLO, or service level objective, defines the reliability target for a service over a measurement window. The companion concept, the error budget, represents how much unreliability is allowed before the service misses that target. Together, they create a shared language between engineering and the business: reliability is no longer a vague aspiration but a measurable operating constraint.
A simple way to think about it is this:
- SLI: the indicator you measure, such as successful requests, latency below a threshold, or job completion rate.
- SLO: the target for that indicator, such as 99.9% of requests succeeding over 30 days.
- Error budget: the remaining portion of allowable failure, such as 0.1% failed requests over the same period.
An SLO calculator helps teams turn those ideas into numbers they can use in planning, incident review, and release decisions. Instead of arguing in the abstract, you can quantify tradeoffs:
- How much outage time does 99.95% availability allow in 28 or 30 days?
- How many failed requests can a service absorb before it breaches?
- How quickly is the budget being burned during an incident?
- Should a team continue shipping features this week, or pause to improve reliability?
This matters because many reliability programs fail at the operational layer, not the conceptual one. Teams pick arbitrary targets, monitor the wrong signals, or create SLOs they cannot explain. A good SRE SLO guide starts with the opposite approach: choose a user-centered indicator, define a realistic target, calculate the budget clearly, and connect the result to actions.
If you are still building your telemetry stack, pair this work with an instrumentation plan. Our OpenTelemetry Setup Guide for Logs, Metrics, and Traces can help establish the logs, metrics, and traces needed to support SLI collection.
How to estimate
The goal of an SLO calculator is not mathematical complexity. It is consistency. You want a method that any engineer, incident commander, or product stakeholder can follow and verify.
Start with the measurement model most relevant to your service:
- Availability-based SLO: useful for APIs, web apps, and request-driven systems.
- Latency-based SLO: useful when slow responses are effectively user-visible failures.
- Quality or correctness SLO: useful for background processing, data pipelines, or asynchronous workflows.
1. Define the SLI numerator and denominator
For request success rate, a common formula is:
SLI = good events / valid events
Examples:
- Good events = HTTP requests that return acceptable responses within the required threshold.
- Valid events = all eligible requests from real users, excluding clearly irrelevant traffic if your policy defines exclusions.
The most important part is consistency. If your team changes what counts as a valid event every week, the calculator becomes noise.
2. Set the SLO target
The basic error budget formula is:
Error budget percentage = 1 - SLO target
Examples:
- 99.9% SLO → 0.1% error budget
- 99.95% SLO → 0.05% error budget
- 99.99% SLO → 0.01% error budget
Higher targets look attractive, but they sharply reduce the amount of failure your system can tolerate. That may be appropriate for a critical user path, but not for every internal service. An error budget calculator makes this visible quickly.
3. Choose the measurement window
Common windows include 7 days, 28 days, and 30 days. A shorter window reacts faster to recent incidents. A longer window smooths volatility. Many teams prefer 28 days because it represents four weeks consistently, though monthly reporting may push some teams toward 30-day windows.
4. Convert the percentage into allowed failures
There are two common ways to do this.
Request-based error budget:
Allowed bad events = valid events × error budget percentage
Time-based availability budget:
Allowed downtime = measurement window duration × error budget percentage
Both approaches are useful. Request-based budgets are often better for modern distributed systems because user impact is usually event-based, not simply uptime-based. Time-based budgets remain helpful for communicating with non-specialists and for services where availability is the primary concern.
5. Track burn rate
Burn rate tells you how quickly the service is consuming the error budget.
Burn rate = observed error rate / allowed error rate
If your SLO allows a 0.1% error rate and your service is currently seeing 0.5% errors, the burn rate is 5. That means the service is consuming the budget five times faster than planned. This is often more actionable during incidents than the raw SLO number.
Burn rate can guide alerting and incident response. Fast burn over a short window suggests an acute event. Sustained moderate burn over a longer window suggests chronic degradation. This is where observability tools become operational tools. If you are comparing stacks for SLO dashboards and burn-rate alerts, see Prometheus vs Datadog vs Grafana Cloud: Monitoring Stack Comparison.
Inputs and assumptions
An SLO calculator is only as useful as the assumptions behind it. Before you trust the output, review the inputs carefully.
Pick a user-relevant SLI
The SLI should map to user experience, not just system internals. CPU saturation may explain a problem, but it is rarely the right top-level SLI. Better choices include:
- Successful API requests
- Checkout completions
- Job completions within a deadline
- Page loads below a latency threshold
For Kubernetes workloads, this distinction matters. A healthy pod count is useful, but a pod can be healthy while user requests still fail. Use infrastructure signals as diagnostics, not as the primary user-facing objective. For cluster-level debugging, related runbooks such as the Kubernetes Pending Pod Troubleshooting Guide and the Kubernetes CrashLoopBackOff Troubleshooting Checklist help connect symptoms to root causes.
Define what counts as good, bad, and ignored
Your calculator should make these rules explicit:
- Are 5xx responses bad events?
- Are some 4xx responses user-caused and therefore excluded?
- Does a timeout count as a failed event even if a retry eventually succeeds?
- Are internal health-check requests excluded?
- Is synthetic traffic included or separated?
This is where teams often create accidental loopholes. If exclusions are too broad, the SLO looks healthy while users are not.
Choose realistic thresholds
For latency SLOs, the threshold should match user expectations and product requirements. A threshold that is too loose becomes meaningless. One that is too strict creates constant failure and trains teams to ignore the metric. The same applies to availability targets. If a non-critical internal dashboard is held to the same target as a revenue-critical API, the reliability program becomes disconnected from business impact.
Account for traffic shape
Not all services behave the same way:
- High-volume APIs fit request-based calculations well.
- Low-volume admin systems may need longer windows or event aggregation to avoid noisy percentages.
- Batch systems may need SLOs based on completion before a deadline, not request availability.
When traffic is low, one or two failures can swing percentages dramatically. In those cases, pair percentages with absolute counts so the team does not overreact to tiny sample sizes.
Separate indicator design from policy decisions
The calculator should produce numbers. Your operating policy determines what happens next. For example:
- At 25% budget consumed: review reliability risks in planning.
- At 50% consumed: tighten release review or add canary checks.
- At 100% consumed: pause non-essential changes until stability improves.
These are not universal rules. They should reflect the service criticality, release velocity, and risk tolerance of your team. The key is that the actions are defined before the next incident, not negotiated in the middle of it.
Make the calculator auditable
A practical SLO calculator should be easy to inspect. Whether you build it in a spreadsheet, dashboard query, or internal tool, include:
- The SLI definition
- The measurement window
- The target percentage
- The formula used
- The data source
- The owner responsible for keeping it current
That discipline prevents the common failure mode where an SLO exists in a slide deck but not in daily operations.
Worked examples
The examples below show how an error budget calculator works in practice. The specific numbers are illustrative, but the method is reusable.
Example 1: API availability SLO
A public API has a 30-day SLO of 99.9% successful requests.
- SLO target = 99.9%
- Error budget percentage = 0.1%
- Total valid requests in 30 days = 12,000,000
Allowed bad requests = 12,000,000 × 0.001 = 12,000
If the service has already recorded 3,000 failed requests this month, it has used 25% of the budget. If a later incident causes 9,500 more failures, the total becomes 12,500 and the SLO is breached.
This simple calculation helps answer a release question. If the team is near the budget limit, a risky deployment may not be worth it. If the service is comfortably within budget, the same change may be acceptable.
Example 2: Time-based availability budget
A service has a 28-day availability SLO of 99.95%.
- Error budget percentage = 0.05% = 0.0005
- 28 days = 40,320 minutes
Allowed downtime = 40,320 × 0.0005 = 20.16 minutes
This framing is useful in incident communication. Saying, “We have consumed 12 of 20 allowed minutes this period,” is often clearer than speaking only in percentages.
Example 3: Burn rate during an incident
A service has a 99.9% SLO, so the allowed error rate is 0.1%. During an incident, the observed error rate over the past hour is 2%.
Burn rate = 2% / 0.1% = 20
A burn rate of 20 means the team is consuming the budget twenty times faster than normal. Even if the absolute duration is short, this may justify immediate escalation because a relatively brief incident could wipe out the monthly budget.
Example 4: Latency-based SLO
A web service sets an SLO that 99% of requests must complete within 300 ms over 7 days.
- Total valid requests = 2,500,000
- Good requests under 300 ms = 2,470,000
Observed SLI = 2,470,000 / 2,500,000 = 98.8%
The service misses the target because the result is below 99%. The error budget approach still helps here: the budget is the share of requests allowed to exceed the threshold. If performance regressions often come from deployments, connect this review with your release process and CI/CD diagnostics. Our CI/CD Pipeline Failure Troubleshooting Guide by Error Pattern is useful when reliability issues originate in build or deployment workflow changes.
Example 5: Batch processing SLO
A nightly data pipeline has an SLO that 99.5% of jobs must finish before 6:00 AM over 30 days.
- Total scheduled jobs = 600
- Allowed late or failed jobs = 600 × 0.005 = 3
If four jobs miss the deadline, the SLO is breached even if all infrastructure remained technically “up.” This illustrates why service level objective examples should match the service model. Uptime alone would miss the real business impact.
When to recalculate
SLOs are not set-and-forget metrics. A useful calculator becomes more valuable when teams revisit it deliberately instead of only after a breach.
Recalculate or review your SLO and error budget assumptions when:
- The user journey changes. A new feature, API contract, or client behavior may change what “good” means.
- Traffic volume shifts materially. Growth, seasonality, or customer concentration can change the sensitivity of the budget.
- The architecture changes. Service decomposition, queue adoption, edge caching, or database redesign can change both the SLI and the realistic target.
- Observability coverage improves. Better instrumentation may reveal that the original measurement was incomplete.
- The team changes release cadence. Faster shipping often calls for tighter burn-rate alerting and clearer error budget policy.
- You see repeated false positives or false negatives. If the SLO says users are fine when support tickets say otherwise, revisit the SLI design.
- A major incident exposes ambiguity. If the team debates whether the incident “counts,” the definition needs work.
A practical review cycle might look like this:
- Quarterly: review SLI definitions, targets, and exclusion rules.
- Monthly: inspect budget consumption trends and top causes.
- After incidents: verify whether the SLO reflected real user pain and whether policy actions were triggered appropriately.
- Before major launches: run the calculator with forecasted traffic and likely failure modes.
To make this operational, keep a short runbook next to the calculator:
- Who owns each SLO?
- Where is the source query or dashboard?
- What thresholds trigger review or release restrictions?
- Who can approve exceptions?
- What remediation work happens after a breach?
The best next step is simple: choose one important service, define one user-centered SLI, set one realistic target, and calculate the budget in a visible place your team actually uses. Then attach action to the number. If the budget is healthy, keep shipping with confidence. If it is burning too fast, slow down and fix what users feel. That is the point of an SLO calculator: not just to produce a percentage, but to improve reliability decisions before, during, and after incidents.