Skip to main content

Observability in 2026: SLOs, Error Budgets, and the Modern SRE Stack

March 24, 2026

For years, "monitoring" meant staring at dashboards, waiting for something to turn red, and then scrambling to figure out what went wrong. That era is over. In 2026, the best engineering teams have shifted from reactive monitoring to proactive observability — and the difference is not just semantic.

This post walks through the modern SRE stack: what observability actually means, how to define SLOs and error budgets that drive engineering decisions, and how to instrument your services with OpenTelemetry. If you run anything in production — whether it is a Next.js app on Vercel or a fleet of microservices on Kubernetes — this is the reliability playbook you need.

Monitoring vs Observability: What Actually Changed

Monitoring answers predefined questions: "Is the CPU above 80%?" or "Did the health check fail?" You set thresholds, and you get alerts when they are crossed.

Observability answers questions you have not thought of yet. It gives you the ability to understand why something is broken by examining the system's outputs — traces, metrics, and logs — without deploying new code.

The Three Pillars

PillarWhat It CapturesExample
MetricsNumeric measurements over timeRequest latency p99 = 240ms
LogsDiscrete events with contextERROR: Payment failed for user_id=abc123
TracesEnd-to-end request journeysAPI Gateway → Auth Service → DB → Response (total: 380ms)

The key insight is that these three pillars are correlated. A spike in your latency metric should link to the specific traces that were slow, and those traces should connect to the log entries that explain the failure. Without correlation, you are just doing monitoring with extra steps.

What Changed in Practice

The old way:

  1. Alert fires: "500 error rate above 5%"
  2. Engineer opens Kibana, searches for errors
  3. Finds a log message, guesses the cause
  4. Deploys a fix, hopes it works

The observability way:

  1. Alert fires: "SLO burn rate exceeded for checkout flow"
  2. Engineer opens a trace view filtered to failing requests
  3. Sees the exact span where latency spiked — a database query taking 4 seconds
  4. Drills into the query, finds a missing index on a new column
  5. Adds the index, watches the burn rate recover in real time

The difference is not the tools. It is the workflow: from "something is broken, let me investigate" to "I can see exactly what is broken and why."

SLOs: The Foundation of Reliability Engineering

Service Level Objectives are the most important concept in modern SRE, and they are surprisingly misunderstood.

SLIs, SLOs, and SLAs — Untangled

These three terms get mixed up constantly. Here is the hierarchy:

SLI (Service Level Indicator) — A quantitative measure of your service's behavior. This is raw data.

SLI = (successful requests / total requests) * 100
SLI = percentage of requests completing in under 300ms

SLO (Service Level Objective) — A target value for an SLI. This is your internal goal.

SLO: 99.9% of requests should succeed (availability)
SLO: 95% of requests should complete in under 300ms (latency)

SLA (Service Level Agreement) — A contractual obligation with consequences. This is a business commitment.

SLA: 99.5% uptime, or the customer gets a credit

The relationship: your SLO should always be stricter than your SLA. If your SLA promises 99.5%, your SLO should target 99.9%. The gap is your safety margin.

Choosing the Right SLIs

Not every metric makes a good SLI. The best SLIs measure what users experience, not what your infrastructure does.

Bad SLIs:

  • CPU utilization (users do not care about your CPU)
  • Pod restart count (infrastructure detail)
  • Database connection pool size (internal metric)

Good SLIs:

  • Request success rate (did the user get what they asked for?)
  • Latency at p50, p95, and p99 (how fast was the experience?)
  • Freshness (is the data the user sees up to date?)

A Practical SLO for a Next.js Application

Let us say you run a Next.js e-commerce site. Here is a sensible SLO definition:

# slo-definition.yaml
service: web-storefront
slos:
  - name: availability
    description: "Proportion of successful HTTP responses"
    sli:
      type: availability
      good_events: "http_status < 500"
      total_events: "all HTTP requests"
    target: 99.9%
    window: 30d

  - name: latency
    description: "Proportion of fast page loads"
    sli:
      type: latency
      good_events: "response_time < 500ms"
      total_events: "all page navigation requests"
    target: 95.0%
    window: 30d

  - name: checkout-success
    description: "Proportion of checkout attempts that succeed"
    sli:
      type: availability
      good_events: "checkout completed without error"
      total_events: "all checkout attempts"
    target: 99.5%
    window: 30d

Notice that the checkout SLO is separate and has a different target. Critical user journeys deserve their own SLOs.

Error Budgets: Making Reliability a Feature

An error budget is the inverse of your SLO. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% — roughly 43 minutes of downtime per month.

Why Error Budgets Matter

Error budgets transform reliability from a vague aspiration into a spendable resource. They answer questions that used to cause endless arguments:

QuestionWithout Error BudgetsWith Error Budgets
"Can we ship this risky feature?""I don't know, it might break things""We have 38 minutes of budget left — yes, but with a canary rollout"
"Should we freeze deployments?""Something feels wrong""We burned 80% of our budget in 3 days — freeze until we fix the root cause"
"Do we need to invest in reliability?""Probably?""We exhausted our budget in 2 of the last 4 months — yes, reliability work takes priority"

Calculating Burn Rate

The burn rate tells you how fast you are consuming your error budget. A burn rate of 1.0 means you will exactly exhaust your budget by the end of the window. A burn rate of 10.0 means you will exhaust it in 1/10th of the time.

// Error budget calculation
interface ErrorBudgetStatus {
  sloTarget: number;
  windowDays: number;
  currentErrorRate: number;
  burnRate: number;
  budgetRemaining: number;
  minutesRemaining: number;
}

function calculateErrorBudget(
  sloTarget: number,
  windowDays: number,
  errorsInWindow: number,
  totalRequestsInWindow: number
): ErrorBudgetStatus {
  const errorBudgetFraction = 1 - sloTarget / 100;
  const totalMinutesInWindow = windowDays * 24 * 60;
  const allowedErrors = totalRequestsInWindow * errorBudgetFraction;

  const currentErrorRate = errorsInWindow / totalRequestsInWindow;
  const burnRate = currentErrorRate / errorBudgetFraction;
  const budgetRemaining = Math.max(
    0,
    ((allowedErrors - errorsInWindow) / allowedErrors) * 100
  );
  const minutesRemaining = (budgetRemaining / 100) * totalMinutesInWindow;

  return {
    sloTarget,
    windowDays,
    currentErrorRate,
    burnRate,
    budgetRemaining,
    minutesRemaining,
  };
}

// Example: 99.9% SLO over 30 days
const status = calculateErrorBudget(99.9, 30, 150, 1_000_000);
// burnRate: 1.5 (consuming budget 50% faster than sustainable)
// budgetRemaining: ~85%
// minutesRemaining: ~36.7 minutes

Burn Rate Alerting

Traditional threshold alerts ("error rate > 1%") are either too noisy or too slow. Burn rate alerts solve this by combining urgency with significance.

The multi-window, multi-burn-rate approach from Google's SRE workbook is the gold standard:

// Multi-window burn rate alert configuration
const alertRules = [
  {
    // Page-worthy: will exhaust budget in 1 hour
    severity: "critical",
    shortWindow: "5m",
    longWindow: "1h",
    burnRateThreshold: 14.4,
    action: "page on-call engineer",
  },
  {
    // Urgent: will exhaust budget in 6 hours
    severity: "warning",
    shortWindow: "30m",
    longWindow: "6h",
    burnRateThreshold: 6.0,
    action: "create ticket, notify channel",
  },
  {
    // Slow burn: will exhaust budget before window ends
    severity: "info",
    shortWindow: "6h",
    longWindow: "3d",
    burnRateThreshold: 1.0,
    action: "add to weekly review",
  },
];

The dual window prevents false positives. The short window catches real incidents. The long window confirms the problem is sustained, not a blip.

OpenTelemetry: The Instrumentation Standard

OpenTelemetry (OTel) has won. In 2026, it is the de facto standard for generating, collecting, and exporting telemetry data. If you are starting a new project or modernizing an existing one, OpenTelemetry is the answer.

Why OpenTelemetry Won

Before OTel, you had to choose your observability vendor up front and instrument with their proprietary SDK. Switching from Datadog to Grafana Cloud meant rewriting all your instrumentation. OpenTelemetry decouples instrumentation from the backend — instrument once, export anywhere.

┌─────────────────────────────────────────────────────┐
                  Your Application                    
                                                      
  ┌──────────────────────────────────────────────┐   
           OpenTelemetry SDK                        
    ┌────────┐  ┌────────┐  ┌────────────────┐     
     Traces   │Metrics        Logs            
    └───┬────┘  └───┬────┘  └──────┬─────────┘     
  └──────┼───────────┼──────────────┼─────────────┘   
└─────────┼───────────┼──────────────┼─────────────────┘
                                   
                                   
   ┌──────────────────────────────────────┐
          OTel Collector (optional)       
   └──────┬──────────┬──────────┬─────────┘
                              
     ┌────▼───┐ ┌───▼────┐ ┌──▼──────┐
     │Grafana  │Datadog  │Honeycomb│
      Tempo     APM            
     └────────┘ └────────┘ └─────────┘

Instrumenting a Node.js / Next.js Application

Here is a complete setup for adding OpenTelemetry to a Next.js application. Start with the instrumentation file that Next.js looks for automatically.

First, install the required packages:

pnpm add @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/exporter-metrics-otlp-http \
  @opentelemetry/sdk-metrics \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

Then create the instrumentation file:

// instrumentation.ts (Next.js auto-detects this file)
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { Resource } from "@opentelemetry/resources";
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
  ATTR_DEPLOYMENT_ENVIRONMENT_NAME,
} from "@opentelemetry/semantic-conventions";

export function register() {
  const sdk = new NodeSDK({
    resource: new Resource({
      [ATTR_SERVICE_NAME]: "web-storefront",
      [ATTR_SERVICE_VERSION]: process.env.NEXT_PUBLIC_APP_VERSION ?? "0.0.0",
      [ATTR_DEPLOYMENT_ENVIRONMENT_NAME]:
        process.env.NODE_ENV ?? "development",
    }),
    traceExporter: new OTLPTraceExporter({
      url:
        process.env.OTEL_EXPORTER_OTLP_ENDPOINT + "/v1/traces" ||
        "http://localhost:4318/v1/traces",
    }),
    metricReader: new PeriodicExportingMetricReader({
      exporter: new OTLPMetricExporter({
        url:
          process.env.OTEL_EXPORTER_OTLP_ENDPOINT + "/v1/metrics" ||
          "http://localhost:4318/v1/metrics",
      }),
      exportIntervalMillis: 15000,
    }),
    instrumentations: [
      getNodeAutoInstrumentations({
        "@opentelemetry/instrumentation-http": {
          ignoreIncomingPaths: [/\/_next\/static/, /\/favicon\.ico/],
        },
        "@opentelemetry/instrumentation-fs": {
          enabled: false, // Too noisy for most use cases
        },
      }),
    ],
  });

  sdk.start();
}

Adding Custom Spans

Auto-instrumentation covers HTTP requests, database queries, and other common operations. But the most valuable traces come from custom spans around your business logic.

// app/lib/tracing.ts
import { trace, SpanStatusCode, type Span } from "@opentelemetry/api";

const tracer = trace.getTracer("web-storefront");

/**
 * Wrap an async function in a traced span.
 * Automatically records errors and sets span status.
 */
export async function withSpan<T>(
  name: string,
  attributes: Record<string, string | number | boolean>,
  fn: (span: Span) => Promise<T>
): Promise<T> {
  return tracer.startActiveSpan(name, async (span) => {
    try {
      span.setAttributes(attributes);
      const result = await fn(span);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error instanceof Error ? error.message : "Unknown error",
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Using it in your application code:

// app/api/checkout/route.ts
import { withSpan } from "@/app/lib/tracing";

export async function POST(request: Request) {
  return withSpan(
    "checkout.process",
    { "checkout.source": "web" },
    async (span) => {
      const body = await request.json();
      span.setAttribute("checkout.item_count", body.items.length);

      // Validate inventory — this gets its own child span
      const available = await withSpan(
        "checkout.validate_inventory",
        { "inventory.item_count": body.items.length },
        async () => {
          return checkInventory(body.items);
        }
      );

      if (!available) {
        span.setAttribute("checkout.result", "out_of_stock");
        return Response.json(
          { error: "Items out of stock" },
          { status: 409 }
        );
      }

      // Process payment — another child span
      const payment = await withSpan(
        "checkout.process_payment",
        { "payment.method": body.paymentMethod },
        async () => {
          return processPayment(body);
        }
      );

      span.setAttribute("checkout.result", "success");
      span.setAttribute("checkout.order_id", payment.orderId);

      return Response.json({ orderId: payment.orderId });
    }
  );
}

Custom Metrics

Beyond traces, you will want custom metrics for business-level monitoring:

// app/lib/metrics.ts
import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("web-storefront");

// Counter: things that only go up
export const checkoutCounter = meter.createCounter("checkout.attempts", {
  description: "Number of checkout attempts",
  unit: "1",
});

// Histogram: distribution of values
export const checkoutDuration = meter.createHistogram("checkout.duration", {
  description: "Time to complete checkout",
  unit: "ms",
});

// UpDownCounter: things that go up and down
export const activeCartGauge = meter.createUpDownCounter("carts.active", {
  description: "Number of active shopping carts",
  unit: "1",
});

// Usage in your code
checkoutCounter.add(1, {
  "checkout.method": "credit_card",
  "checkout.result": "success",
});

checkoutDuration.record(342, {
  "checkout.method": "credit_card",
});

Building Dashboards That Matter

Most dashboards are useless. They show a wall of graphs that nobody looks at until something breaks, and then the graph you need is not there. Here is how to build dashboards that actually drive decisions.

The Four Golden Signals

Google's SRE book identified four signals that matter for every service:

  1. Latency — How long requests take (split by success vs error)
  2. Traffic — How much demand is hitting the system
  3. Errors — The rate of failed requests
  4. Saturation — How full your service is (CPU, memory, queue depth)

The RED Method for Microservices

For request-driven services, the RED method is simpler:

  • Rate — Requests per second
  • Errors — Failed requests per second
  • Duration — Distribution of request latencies

Dashboard Layout

A good SRE dashboard follows a top-down structure:

┌─────────────────────────────────────────────────┐
  SLO STATUS  Are we meeting our objectives?     
  ┌──────────┐ ┌──────────┐ ┌──────────────────┐ 
  │Avail:    │Latency:  │Budget: 72% left   
  │99.95%      p99:210ms│  ████████░░░░░░░   
  └──────────┘ └──────────┘ └──────────────────┘ 
├─────────────────────────────────────────────────┤
  GOLDEN SIGNALS  What is happening right now?   
  [Traffic graph] [Error rate] [Latency heatmap]  
├─────────────────────────────────────────────────┤
  INFRASTRUCTURE  Where are the bottlenecks?     
  [CPU] [Memory] [Disk I/O] [Network] [Queue]    
├─────────────────────────────────────────────────┤
  DEPLOYMENTS  What changed?                     
  [Deploy markers on all graphs] [Config changes] 
└─────────────────────────────────────────────────┘

The top row is for executives and on-call engineers glancing at the dashboard. The middle rows are for active investigation. The bottom row correlates incidents with changes.

Grafana Dashboard as Code

Define your dashboards in code so they are version-controlled and reproducible:

{
  "dashboard": {
    "title": "Web Storefront — SLO Overview",
    "panels": [
      {
        "title": "Availability SLO (target: 99.9%)",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status!~\"5..\"}[30d])) / sum(rate(http_requests_total[30d])) * 100",
            "legendFormat": "Availability %"
          }
        ],
        "thresholds": {
          "steps": [
            { "color": "red", "value": 99.0 },
            { "color": "yellow", "value": 99.5 },
            { "color": "green", "value": 99.9 }
          ]
        }
      },
      {
        "title": "Error Budget Remaining",
        "type": "stat",
        "targets": [
          {
            "expr": "(1 - (sum(increase(http_requests_total{status=~\"5..\"}[30d])) / (sum(increase(http_requests_total[30d])) * 0.001))) * 100"
          }
        ]
      },
      {
        "title": "Latency Distribution",
        "type": "heatmap",
        "targets": [
          {
            "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)"
          }
        ]
      }
    ]
  }
}

Putting It All Together: An SRE Playbook

Here is a step-by-step playbook for implementing SRE practices in your team, whether you are a solo developer or part of a larger organization.

Step 1: Define Your Critical User Journeys

Before you write a single line of instrumentation code, identify the 3 to 5 things your users care about most:

1. Homepage loads successfully
2. Search returns relevant results
3. Product page displays with correct pricing
4. Checkout completes without error
5. Order confirmation email is delivered

Step 2: Set SLOs for Each Journey

For each journey, define an SLI and an SLO:

const slos = {
  homepage: {
    availability: { target: 99.95, window: "30d" },
    latency: { target: 95, threshold: "1s", window: "30d" },
  },
  search: {
    availability: { target: 99.9, window: "30d" },
    latency: { target: 90, threshold: "500ms", window: "30d" },
  },
  checkout: {
    availability: { target: 99.5, window: "30d" },
    latency: { target: 95, threshold: "3s", window: "30d" },
  },
};

Step 3: Instrument with OpenTelemetry

Use the auto-instrumentation setup shown earlier, then add custom spans for your critical journeys.

Step 4: Set Up Burn Rate Alerts

Configure alerts based on burn rate, not raw error rates. Start with two tiers — critical (pages you) and warning (creates a ticket).

Step 5: Run Error Budget Reviews

Every week or every sprint, review your error budget consumption:

  • How much budget did we spend?
  • What caused the biggest burns?
  • Do we need to prioritize reliability work or can we ship features?

Step 6: Iterate

SLOs are not set in stone. If you are always meeting your SLO easily, tighten it. If you are constantly burning through your budget, either loosen the SLO or invest more in reliability.

Common Mistakes to Avoid

Setting too many SLOs. Start with 2 to 3. You can always add more. Too many SLOs means none of them get attention.

Using infrastructure metrics as SLIs. Your users do not experience CPU usage. They experience page load time and error messages. Measure what they feel.

Alerting on every SLO violation. Use burn rate alerting. A brief spike that recovers is not worth waking someone up at 3 AM.

Treating SLOs as SLAs. SLOs are internal targets. They should be stricter than your SLAs and they should evolve.

Ignoring the human side. SRE is as much about process and culture as it is about tools. Error budget policies only work if leadership agrees to enforce them — including freezing features when the budget is exhausted.

The Stack I Recommend in 2026

ComponentRecommendedAlternative
InstrumentationOpenTelemetry SDKDatadog APM (vendor lock-in)
CollectorOTel CollectorGrafana Alloy
TracesGrafana TempoJaeger, Honeycomb
MetricsPrometheus + GrafanaDatadog, New Relic
LogsGrafana LokiElasticsearch, Datadog Logs
DashboardsGrafanaDatadog, Chronosphere
AlertingGrafana AlertingPagerDuty, Opsgenie
SLO ManagementSloth, OpenSLONobl9, Datadog SLOs

The Grafana stack (Tempo + Prometheus + Loki + Grafana) gives you a fully open-source observability platform with no vendor lock-in. Pair it with OpenTelemetry instrumentation and you have a stack that will serve you well regardless of scale.

Final Thoughts

The shift from monitoring to observability is not about buying new tools. It is about changing how you think about reliability. SLOs give you a shared language for talking about service health. Error budgets turn reliability into a measurable, spendable resource. OpenTelemetry ensures your instrumentation investment is portable.

Start small. Pick one service, define one SLO, instrument it with OpenTelemetry, and set up a burn rate alert. Once you see how much clarity that brings, you will never go back to threshold-based monitoring.

The best time to adopt SRE practices was when your service first went to production. The second best time is today.

Recommended Posts