Kubernetes CrashLoopBackOff Troubleshooting Checklist

A reusable checklist for diagnosing Kubernetes CrashLoopBackOff issues across config, probes, images, dependencies, and resource limits.

A pod in CrashLoopBackOff is not a root cause. It is Kubernetes telling you a container starts, fails, and is being retried with backoff. That distinction matters, because the fastest way to fix a restart loop is to work from symptoms toward the exact failing layer: process startup, image, configuration, dependency, probe behavior, or resource pressure. This checklist is designed as a reusable runbook for Kubernetes troubleshooting. Use it when a deployment suddenly starts flapping, when a rollout stalls, or when a recurring workload becomes unreliable after an otherwise small change.

Overview

This guide gives you a practical sequence for CrashLoopBackOff troubleshooting without jumping straight to guesses. In most cases, you can narrow the issue quickly if you answer four questions in order.

Is the container actually starting and then exiting, or failing before startup?
What was the last known change? Image, config, secret, dependency, probe, node, or policy.
Does the failure happen inside the app process, or is Kubernetes killing it?
Is the problem isolated to one workload, one node, one namespace, or one release?

Start with basic evidence collection before editing manifests or rolling back:

kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl get events -n <namespace> --sort-by=.metadata.creationTimestamp

Those commands answer most first-pass questions. describe shows state transitions, restart counts, probe failures, and scheduling context. Current logs show what the active container emitted before crashing. --previous is especially useful for a Kubernetes pod restarting repeatedly, because the current container may not stay alive long enough to inspect.

As you work, avoid one common trap: treating every restart loop as an application bug. Plenty of pod restart loop causes come from missing environment variables, bad ports, invalid command overrides, failed mounts, or aggressive probes. The checklist below is organized by those scenarios.

Checklist by scenario

Use this section as a decision tree. Pick the scenario that most closely matches the evidence you already have.

1. The container exits immediately with an application error

Symptoms: Logs show a stack trace, startup exception, migration failure, config parser error, or a non-zero exit code. The pod reaches running state briefly, then dies.

Check:

Read the full startup log, not just the last line. Early lines often reveal missing config or failed initialization.
Inspect the container command and args in the pod spec. A bad override can bypass the image's expected entrypoint.
Verify environment variables from env, envFrom, ConfigMaps, and Secrets.
Confirm expected files exist at mounted paths.
Check whether a startup task such as schema migration, cache warm-up, or certificate load is failing.

Useful commands:

kubectl get pod <pod> -n <ns> -o yaml
kubectl logs <pod> -n <ns> --previous

Likely fixes: Restore missing config, correct the command, point the app to the right port or endpoint, or separate one-time initialization from the main container process.

2. The image starts, but the wrong process is running

Symptoms: The container exits cleanly, prints a usage message, or terminates after doing a short task. Sometimes restart count rises without obvious application errors.

Check:

Compare the image's intended entrypoint with the deployment's command and args.
Confirm shell syntax if you use sh -c; quoting mistakes are easy to miss.
Make sure the container is running a long-lived server process, not a migration script or debug command.
Check whether the image tag changed recently and brought a different default entrypoint.

Likely fixes: Remove unnecessary command overrides, pin the correct image tag, or split jobs and services into separate workload types.

3. Probes are killing a healthy-but-slow startup

Symptoms: Events mention liveness or startup probe failures. Logs may show the app starting normally but not becoming ready before probes fail. This is common after adding heavier initialization, remote config loading, or larger JVM/.NET startup times.

Check:

Review liveness, readiness, and startup probe settings together, not in isolation.
Confirm the probe path, port, scheme, and command are correct.
Verify the app binds to the expected interface and port before the liveness threshold is reached.
Look for dependencies in the health endpoint. A liveness check that requires database access can create unnecessary restarts.
Use a startup probe when initialization is genuinely slow.

What to look for in describe pod: repeated Unhealthy events, probe timeout messages, or restarts immediately after failed checks.

Likely fixes: Increase initial delay or failure thresholds, add a startup probe, simplify health endpoints, and separate readiness from liveness semantics.

4. The container is being OOMKilled

Symptoms: Pod status or describe output shows OOMKilled. Logs may end abruptly. Restarts may begin after a release, traffic spike, or configuration change.

Check:

Inspect memory requests and limits.
Compare recent memory usage trends if you have metrics available.
Review app-level memory settings such as JVM heap or worker concurrency.
Check whether sidecars increased total pod memory usage.
Confirm node-level memory pressure is not affecting multiple workloads.

Likely fixes: Raise memory limits carefully, tune runtime memory settings, reduce startup spikes, or right-size sidecars. If the app leaks memory, the Kubernetes symptom will return until the application issue is fixed.

5. The app depends on something that is unavailable

Symptoms: Logs show connection failures to a database, message broker, internal API, DNS name, object store, or identity provider. The application exits because a required dependency is not reachable at startup.

Check:

Verify service names, ports, namespaces, and DNS assumptions.
Confirm NetworkPolicies are not blocking egress or service-to-service traffic.
Check whether the dependency is healthy and available from the cluster.
Validate TLS certificates, CA bundles, and hostname matching if secure connections are involved.
Review startup behavior: does the app retry, or does it exit on the first failure?

Likely fixes: Correct endpoint configuration, restore network paths, improve retry logic, or decouple hard dependency checks from process startup where appropriate.

6. ConfigMap or Secret data is wrong, missing, or stale

Symptoms: The pod starts failing after a config rollout, secret rotation, or environment change. Logs show missing keys, parse errors, authentication failures, or invalid file content.

Check:

Confirm the referenced ConfigMap and Secret names are correct.
Check key names and mounted file paths.
Inspect whether the deployment expects one format but the mounted content provides another.
Verify base64 handling and line endings for sensitive values when relevant.
Make sure the rollout actually picked up the new config version.

Likely fixes: Restore valid key names, align file paths with app expectations, re-roll the workload after config changes, and document required config schema clearly.

7. Volume or filesystem issues prevent startup

Symptoms: Logs mention permission denied, missing file, read-only filesystem, failed mount, or inability to write temp files. Some containers crash before app logs appear.

Check:

Review volume mounts, mount paths, and subPath usage.
Verify file permissions, user IDs, and security context settings.
Confirm the app can write to expected directories if it does not run as root.
Check whether a projected secret or config volume replaced a directory the image expected to exist.
Look for storage attach or mount errors in pod events.

Likely fixes: Correct mount paths, update security context, pre-create writable directories in the image, or avoid masking required filesystem paths with volumes.

8. Image pull and startup assumptions changed together

Symptoms: A rollout using a new tag coincides with restart loops. The image pulls successfully, but behavior changed inside the container.

Check:

Compare the working and failing image digests, not just tags.
Review base image changes, runtime versions, and bundled shell/tools.
Check whether certificates, timezone data, or OS libraries changed.
Confirm the image still exposes the same port and startup command.

Likely fixes: Roll back to a known-good digest, pin image versions more tightly, and treat image changes as operational changes that deserve release notes.

9. The issue only happens on one node or subset of nodes

Symptoms: Some replicas are healthy while others are in restart loops. The failing pods may be concentrated on a single node pool, architecture, or availability zone.

Check:

List pod placement and compare failing nodes.
Check node conditions, runtime health, and architecture compatibility.
Review taints, tolerations, and affinity rules.
Confirm the image supports the node architecture.
Look for node-local dependencies such as mounted paths, local DNS issues, or CNI problems.

Likely fixes: Drain or repair unhealthy nodes, correct scheduling constraints, or publish multi-architecture images if needed.

10. The workload type is wrong for the job

Symptoms: A Deployment keeps restarting a container that is supposed to run once and exit. Teams sometimes mistake successful completion for failure because Kubernetes keeps enforcing a long-running replica model.

Check:

Is the process meant to complete and stop?
Should this be a Job or CronJob instead of a Deployment?
Does the process daemonize incorrectly or exit after initialization?

Likely fixes: Use the correct workload controller and make the container lifecycle match the Kubernetes object managing it.

What to double-check

After the first diagnosis, pause before changing multiple things at once. These are the details that most often turn a quick fix into a longer outage.

Confirm the real termination reason

Do not rely only on the pod phase. Read the container state fields and events. A pod can show CrashLoopBackOff while the underlying reason is Error, Completed, OOMKilled, or repeated probe failure. Your next step depends on that difference.

Compare current and previous logs

For kubectl crashloopbackoff investigations, --previous logs are often more useful than the current stream. If a container restarts quickly, the current logs may only show startup noise.

Map the failure to the last change

Ask what changed in the last deployment window:

New image or base image
New config or secret value
Probe changes
Resource request or limit changes
Dependency migration or endpoint changes
Node pool or cluster version changes

Even if the failure appears application-level, a recent platform or configuration change often explains why it surfaced now.

Check one replica in detail before editing the whole deployment

If you have multiple replicas, compare a healthy pod with a failing pod. Differences in node placement, environment injection, mounted volumes, or startup timing can reveal what a bulk log review hides.

Use ephemeral debugging carefully

If your cluster setup allows it, ephemeral containers or temporary debug pods can help inspect DNS, network paths, mounted data, and service reachability. Use them to gather evidence, not to patch around a broken spec that will fail again on the next rollout.

Inspect rollout mechanics

A restart loop during a release is not always a Kubernetes bug. Sometimes a bad revision is progressing exactly as configured. Review ReplicaSet revisions, rollout history, and whether a canary or progressive delivery step isolated the blast radius. If your failures started in CI/CD, it can help to pair this checklist with a broader deployment pipeline review, such as the CI/CD Pipeline Failure Troubleshooting Guide by Error Pattern.

Common mistakes

This section helps you avoid the habits that make Kubernetes debugging checklists feel longer than they need to be.

Changing probes before reading logs

Probe settings are easy to blame, but many restart loops come from application startup errors that probes only expose. Read logs first, then adjust probes if timing is truly the issue.

Using `latest` or loosely pinned image tags

If the image changed without a clear digest trail, root cause analysis becomes slower. Pin known-good versions and record what changed between revisions.

Treating readiness and liveness as the same thing

Readiness decides whether traffic should be sent. Liveness decides whether the container should be restarted. If both hit the same deep dependency path, you can turn a temporary dependency outage into a hard restart loop.

Ignoring resource settings for sidecars

Service mesh proxies, log shippers, and security agents consume memory and CPU too. Teams often size the main app container and forget the rest of the pod.

Assuming a successful local run means the pod spec is correct

An app may work in Docker locally but fail in-cluster because of different DNS, injected environment variables, filesystem permissions, or service account behavior.

Fixing symptoms manually inside a live container

If you edit files or run commands interactively to make a failing pod behave, you may prove a theory but not solve the operational problem. Capture the evidence and put the fix into the image, manifest, or configuration source of truth.

Skipping basic dependency checks

Before deep code-level debugging, confirm the app can actually reach what it needs. DNS resolution, certificates, endpoint names, and policy changes break more startups than many teams expect.

No documented restart-loop runbook

Crash loops recur. If your team solves them from memory each time, handoffs stay brittle. Turn this checklist into a team runbook with cluster-specific commands, dashboards, and escalation points.

When to revisit

This checklist is most useful when it stays current with how your platform actually works. Revisit and update it whenever your Kubernetes workflows change.

Before seasonal planning cycles: review common restart causes from recent incidents, update runbooks, and identify recurring patterns worth automating.
When workflows or tools change: update commands, observability links, and debugging steps if you adopt new runtimes, service meshes, sidecars, deployment tooling, or policy engines.
After major probe or resource policy changes: capture new defaults for startup probes, memory limits, and health check behavior.
After cluster version upgrades: revisit assumptions about runtime behavior, deprecations, and node architecture.
After a real incident: add the exact signal that would have made diagnosis faster next time.

To make this practical, keep a short team-owned CrashLoopBackOff page with:

Standard evidence-collection commands
Links to dashboards and logs
Known-good probe patterns for your stacks
Resource sizing notes by service type
Dependency ownership and escalation paths
Roll back criteria and rollout verification steps

The final action item is simple: the next time a pod enters CrashLoopBackOff, do not start by editing YAML. Start by identifying whether the process is exiting on its own, being killed by probes, or being terminated by resource pressure. Once you know which layer is failing, the path to a stable fix is usually much shorter.

Kubernetes CrashLoopBackOff Troubleshooting Checklist

Overview

Checklist by scenario

1. The container exits immediately with an application error

2. The image starts, but the wrong process is running