Skip to content
/k8s-debugStable

Systematic Kubernetes debugging workflow. Diagnoses pod failures, networking issues, and resource constraints.

DevOpsKubernetesΒ· 2 min read

Quick import: Download the .md file and save it to .claude/commands/ (Claude Code), .cursorrules (Cursor), or paste as a system prompt in ChatGPT, Gemini, or any LLM API.

#What it does

The /k8s-debug skill provides a systematic debugging workflow for Kubernetes clusters. It diagnoses pod failures, CrashLoopBackOff errors, networking issues, resource exhaustion, and configuration problems by following a structured investigation path.

#How to use

bash
/k8s-debug
/k8s-debug pod my-app-7d4b8c6f5-x2k9p
/k8s-debug service my-app --namespace production

#Workflow

  1. Triage -- Checks cluster health, node status, and identifies failing resources
  2. Inspect -- Pulls events, logs, and describe output for the affected resource
  3. Diagnose -- Classifies the failure (OOM, image pull, config, networking, scheduling)
  4. Resolve -- Applies the fix or provides the exact commands to run
  5. Prevent -- Recommends resource limits, probes, or policies to avoid recurrence

#Common issues it resolves

  • CrashLoopBackOff -- Application startup failures, missing config, port conflicts
  • ImagePullBackOff -- Registry auth, image tag typos, private repo access
  • OOMKilled -- Memory limits too low, memory leaks, JVM heap misconfiguration
  • Pending pods -- Insufficient resources, node affinity rules, PVC binding failures
  • Service unreachable -- Selector mismatches, missing endpoints, network policy blocks
  • Failed scheduling -- Taint/toleration mismatches, resource requests exceeding capacity

#Example

bash
> /k8s-debug pod api-server-6b8f9c4d7-m3k2j
 
# Inspecting pod...
Status: CrashLoopBackOff (restarted 7 times)
Last exit code: 137 (OOMKilled)
 
# Resource analysis:
Memory limit:   256Mi
Memory at crash: 254Mi (99% of limit)
Typical usage:  180-240Mi (spikes during request bursts)
 
# Root cause:
Memory limit too tight for traffic spikes.
Container killed by OOM when concurrent requests exceed ~50.
 
# Fix applied:
  Updated memory limit: 256Mi -> 512Mi
  Added memory request: 256Mi (for scheduling accuracy)
  Added readiness probe to shed load during startup
 
# Verification:
  Pod running stable for 5 minutes
  Memory usage: 195Mi / 512Mi (38%)
  No OOM events
Orel OhayonΒ·
View all skills