Kubernetes Debugging

/k8s-debugStable

Systematic Kubernetes debugging workflow. Diagnoses pod failures, networking issues, and resource constraints.

DevOpsKubernetes· 2 min read

Quick import: Download the .md file and save it to .claude/commands/ (Claude Code), .cursorrules (Cursor), or paste as a system prompt in ChatGPT, Gemini, or any LLM API.

#What it does

The /k8s-debug skill provides a systematic debugging workflow for Kubernetes clusters. It diagnoses pod failures, CrashLoopBackOff errors, networking issues, resource exhaustion, and configuration problems by following a structured investigation path.

#How to use

bash

/k8s-debug
/k8s-debug pod my-app-7d4b8c6f5-x2k9p
/k8s-debug service my-app --namespace production

#Workflow

Triage -- Checks cluster health, node status, and identifies failing resources
Inspect -- Pulls events, logs, and describe output for the affected resource
Diagnose -- Classifies the failure (OOM, image pull, config, networking, scheduling)
Resolve -- Applies the fix or provides the exact commands to run
Prevent -- Recommends resource limits, probes, or policies to avoid recurrence

#Common issues it resolves

CrashLoopBackOff -- Application startup failures, missing config, port conflicts
ImagePullBackOff -- Registry auth, image tag typos, private repo access
OOMKilled -- Memory limits too low, memory leaks, JVM heap misconfiguration
Pending pods -- Insufficient resources, node affinity rules, PVC binding failures
Service unreachable -- Selector mismatches, missing endpoints, network policy blocks
Failed scheduling -- Taint/toleration mismatches, resource requests exceeding capacity

#Example

bash

> /k8s-debug pod api-server-6b8f9c4d7-m3k2j
 
# Inspecting pod...
Status: CrashLoopBackOff (restarted 7 times)
Last exit code: 137 (OOMKilled)
 
# Resource analysis:
Memory limit:   256Mi
Memory at crash: 254Mi (99% of limit)
Typical usage:  180-240Mi (spikes during request bursts)
 
# Root cause:
Memory limit too tight for traffic spikes.
Container killed by OOM when concurrent requests exceed ~50.
 
# Fix applied:
  Updated memory limit: 256Mi -> 512Mi
  Added memory request: 256Mi (for scheduling accuracy)
  Added readiness probe to shed load during startup
 
# Verification:
  Pod running stable for 5 minutes
  Memory usage: 195Mi / 512Mi (38%)
  No OOM events

Orel Ohayon·16 Mar, 2026

View all skills