Kubernetes Requests and Limits Best Practices

A practical hub for tuning Kubernetes CPU and memory requests and limits without causing throttling, OOM kills, or wasted cluster capacity.

Kubernetes resource requests and limits shape far more than a pod spec. They influence scheduling, node efficiency, application latency, restart behavior, autoscaling, and the day-to-day experience of debugging production issues. This guide is designed as a durable hub teams can revisit as workloads evolve: it explains how requests and limits work, where they commonly go wrong, how to tune CPU and memory without guesswork, and which adjacent troubleshooting topics matter when cluster behavior starts to drift.

Overview

If your cluster feels unstable even when node utilization looks reasonable, resource configuration is often part of the story. Poorly chosen requests can leave pods pending, waste capacity, or mislead the scheduler. Poorly chosen limits can cause CPU throttling, out-of-memory kills, and performance that appears random under load. In many teams, these settings are added once during service creation and then ignored until an incident forces a review.

The practical goal is not to find a single “perfect” number. It is to establish a repeatable process for Kubernetes requests and limits that balances three things:

Application reliability: the workload gets enough resources to serve traffic consistently.
Cluster efficiency: nodes are packed sensibly without constant resource contention.
Operational clarity: engineers can explain why values were chosen and when they should change.

At a high level, think of requests and limits this way:

Requests are what the scheduler uses to decide where a pod can fit. They represent reserved capacity for placement decisions.
Limits are the hard ceiling the container should not exceed at runtime.

That distinction matters because CPU and memory behave differently.

CPU is compressible. If a container tries to use more CPU than its limit allows, it may be throttled rather than immediately terminated. That is why CPU throttling in Kubernetes can produce slow responses and latency spikes without obvious crashes.

Memory is not compressible in the same way. A container that exceeds its memory limit can be terminated by the kernel or Kubernetes runtime behavior, often surfacing as OOMKilled events. That is why memory limits best practices usually require more caution than CPU tuning.

A workable baseline for most teams is:

Set requests for every production workload.
Treat memory limits carefully, with strong evidence.
Use CPU limits selectively, especially for latency-sensitive services.
Tune from observed usage over time, not from guesses or developer laptops.
Review values after feature changes, traffic changes, or incident patterns.

This topic also connects directly to common troubleshooting paths. A pod may sit in Pending because requests are too high for available nodes. A service may enter CrashLoopBackOff because memory pressure causes repeated OOM kills. For those scenarios, see the Kubernetes Pending Pod Troubleshooting Guide and the Kubernetes CrashLoopBackOff Troubleshooting Checklist.

Topic map

This topic is easiest to manage when broken into a small set of decisions. Use this map to orient tuning work and to turn one-off fixes into a consistent Kubernetes resource management practice.

1. Start with workload behavior, not default numbers

A batch job, API service, queue worker, and controller all consume resources differently. Before setting values, identify:

Whether the workload is latency-sensitive or throughput-oriented
Whether usage is steady, bursty, or tied to cron schedules
Whether startup requires more memory or CPU than steady state
Whether horizontal scaling is expected to absorb spikes
Whether downstream bottlenecks make extra CPU meaningless

A common mistake is assigning one standard request and limit profile to every deployment. Uniformity is convenient, but it often creates hidden waste or unstable behavior.

2. Set requests from observed baseline usage

Requests should reflect what the workload usually needs to run well, not its rarest peak and not its quietest idle period. In practice, teams often choose a value that covers normal sustained operation with some headroom. The exact percentage varies by tolerance for risk, but the principle stays consistent: requests are a scheduling input, so they should be grounded in realistic baseline demand.

If requests are too low:

The scheduler may overpack nodes.
Pods may compete aggressively during busy periods.
Node pressure becomes more likely.
Autoscaling signals may become less trustworthy.

If requests are too high:

Pods can remain unschedulable even when actual usage is modest.
Node utilization appears low while capacity is effectively stranded.
Cluster costs rise because the scheduler reserves capacity that workloads never use.

3. Treat CPU and memory as separate tuning problems

One of the most durable resource requests tuning habits is to avoid combining CPU and memory into a single decision. A service may need conservative memory settings to avoid OOMs while still benefiting from flexible CPU bursting. Another may have predictable CPU demand but highly variable memory during cache warmup or large object processing.

Good tuning usually asks separate questions:

How much CPU does the app need to meet latency or throughput targets?
How much memory does the app need to avoid garbage collection pressure, swapping behavior, or OOM termination?
Do startup and steady-state patterns differ enough to justify separate operational handling?

4. Be cautious with CPU limits

CPU limits are attractive because they seem to promise fairness. In reality, they can introduce hard-to-diagnose slowness. If a service is latency-sensitive and already has a sensible CPU request, a tight CPU limit may create throttling during short bursts that users experience as intermittent degradation.

That does not mean CPU limits are always wrong. They can still be useful for noisy-neighbor control, shared clusters, low-priority workloads, or specific platform policies. But they should be applied deliberately and observed in production.

If you suspect throttling, inspect container CPU usage, throttled time metrics, and request latency together. Tuning in isolation can mislead you.

5. Be explicit about memory limits

Memory limits should reflect tested behavior, not assumptions. If memory limits are set too low, the container may be repeatedly killed under load, causing instability that looks like an application defect. If they are set far above realistic need, they may fail to provide useful guardrails in multi-tenant clusters.

For many teams, the safest pattern is:

Set a realistic memory request based on observed steady-state behavior.
Set a memory limit only after validating peak usage, startup overhead, and failure mode tolerance.
Leave enough margin for temporary spikes, runtime overhead, and instrumentation.

Instrumenting the workload matters here. The OpenTelemetry Setup Guide for Logs, Metrics, and Traces can help teams improve correlation between application behavior and container-level resource events.

6. Use policies to prevent drift

Even strong initial tuning degrades if every team uses different conventions. As your cluster grows, standardize around:

Namespaces or workload classes with expected resource profiles
Admission policies that require requests on production workloads
Review rules for memory limits on critical services
Dashboards that compare requests, limits, actual usage, and restart counts
Periodic cleanup of stale settings after architecture changes

This is where platform engineering adds value: not by forcing every service into one shape, but by giving teams a narrow, sensible operating path.

Resource tuning rarely lives alone. It intersects with scheduling, autoscaling, observability, and incident response. These are the subtopics worth tracking alongside requests and limits.

Scheduling and Pending pods

High requests are one of the simplest ways to create Pending pods. The cluster may have enough total capacity, but not enough allocatable space on any node that matches the scheduling constraints. Before adding nodes, verify whether requests are oversized for the workload. The Kubernetes Pending Pod Troubleshooting Guide is a useful companion when requests, affinities, taints, or topology rules interact.

CrashLoopBackOff and OOM diagnosis

Repeated restarts are often blamed on application bugs first. Sometimes that is correct, but memory pressure can be the real trigger. Check events, restart reason, and workload logs together. If OOMKilled appears, revisit memory request and limit assumptions before changing code blindly. The Kubernetes CrashLoopBackOff Troubleshooting Checklist can help structure that review.

Autoscaling behavior

Horizontal Pod Autoscaler and cluster autoscaling decisions are influenced by resource configuration. If requests are far from reality, autoscaling can become noisy or ineffective. Understated requests may make utilization look high too early; overstated requests may suppress signals or waste nodes. Resource values are not just guardrails; they shape scaling math.

Observability and alert design

A useful dashboard for Kubernetes resources should show more than raw CPU and memory usage. Include request vs usage, limit proximity, throttling, OOM events, restarts, pod distribution, and service latency. Without this context, teams end up debating whether the problem is application performance or cluster contention.

For broader monitoring design, compare your stack choices in Prometheus vs Datadog vs Grafana Cloud: Monitoring Stack Comparison. For signal quality during incidents, the On-Call Alert Tuning Checklist to Reduce Noise Without Missing Incidents is a useful follow-up.

SLOs and performance budgeting

Requests and limits should support service goals, not exist apart from them. If a service has an availability or latency target, resource tuning should be evaluated against that target. A configuration that saves node capacity but increases tail latency may not be a good trade. The SLO and Error Budget Calculator Guide for SRE Teams can help teams connect tuning choices to reliability outcomes.

CI/CD and configuration hygiene

Resource settings are configuration, and configuration deserves code review, version history, and rollback safety. Treat request and limit changes like other production-affecting changes: test them, annotate them, and deploy them deliberately. If your delivery process already struggles with environment drift or opaque failures, it is worth tightening operational review in parallel. The CI/CD Pipeline Failure Troubleshooting Guide by Error Pattern is relevant when your deployment system makes safe config iteration harder than it should be.

Workload classes to define clearly

As a reusable mental model, it helps to group workloads into classes with different default expectations:

Latency-sensitive APIs: avoid unnecessarily strict CPU limits; validate memory headroom carefully.
Background workers: often tolerate CPU limits better; still need realistic memory sizing.
Batch jobs: may need large but temporary resource windows; tune around completion goals.
Controllers and operators: usually lightweight, but undersizing them can create broad platform instability.
Data-processing services: often experience bursty memory profiles that make simplistic limits risky.

How to use this hub

Use this page as a working reference rather than a one-time read. The most reliable way to improve Kubernetes resource requests and limits is to make tuning part of normal operations.

A simple review workflow

Pick one workload that is costly, noisy, or user-facing.
Collect a recent usage window that includes normal traffic and at least one busy period.
Compare actual CPU and memory usage against current requests and limits.
Check symptoms: throttling, OOM kills, Pending pods, restarts, latency changes, and scaling behavior.
Adjust one dimension at a time, especially for critical services.
Deploy gradually and watch both resource and application metrics.
Document why the value changed so the next review starts with context.

Questions worth asking during each review

Is the request high because the app truly needs it, or because nobody revisited an early estimate?
Is a CPU limit preventing healthy burst behavior?
Is the memory limit low enough to cause avoidable OOMs?
Did a recent feature, library change, or traffic pattern shift change runtime behavior?
Would a horizontal scaling change be better than a per-pod resource increase?
Are alerts tied to the right symptom, or only to raw usage?

What not to do

Do not copy values from unrelated services.
Do not assume requests should equal limits in every case.
Do not raise limits first and ask questions later during an incident.
Do not tune only by average usage; peaks and startup patterns matter.
Do not separate resource changes from application and SLO outcomes.

For teams managing many repositories or many services, create a lightweight runbook template that records baseline usage, current values, observed problems, changes made, and follow-up date. That turns resource requests tuning into a repeatable operational habit rather than tribal knowledge.

When to revisit

Resource settings should be reviewed whenever the inputs change. In practice, that means revisiting them more often than many teams expect. Use the following triggers as a practical checklist:

After major releases that change memory allocation patterns, concurrency, caching, or request volume
After repeated latency incidents, especially if CPU throttling is suspected
After any OOMKilled event in production or staging under realistic load
When pods remain Pending and scheduling constraints appear stricter than actual usage justifies
When autoscaling feels erratic or node growth outpaces demand
When observability improves and new runtime data becomes available
On a regular schedule, such as quarterly for critical services

A good next step is to choose three production workloads and review them this week: one user-facing service, one worker, and one platform component. For each, compare requests, limits, actual usage, and recent incidents. Remove obviously outdated guesses, note where CPU throttling or memory pressure may be present, and schedule a follow-up review after the next release cycle. Over time, this simple practice improves cluster stability, reduces wasted capacity, and gives your team a more confident approach to Kubernetes resource management.

Kubernetes Resource Requests and Limits Best Practices

Overview

Topic map

1. Start with workload behavior, not default numbers

2. Set requests from observed baseline usage