GitHub - kubetail-org/kstack: Skill pack for Claude Code that helps you monitor and troubleshoot your K8s clusters superintelligently
Pangram verdict · v3.3
We believe that this document is a mix of AI-generated, and human-written content
AI likelihood · overall
MixedArticle text · 1,742 words · 6 segments analyzed
Skill pack for Claude Code that helps you monitor your K8s clusters superintelligently
English | 简体中文 | 日本語 | 한국어 | Deutsch | Español | Português | Français Introduction
Kstack is a skill pack for Claude Code that helps you perform monitoring, troubleshooting and auditing tasks on your K8s clusters in a smart and efficient way. In addition to using standard tools like kubectl, it hands off shell work to tools like Kubetail, Helm, Trivy, Pluto before sending results to Claude, keeping reponses fast and token efficient. Kstack also detects the services running in your cluster and uses their specialized tooling when necessary (e.g. Cilium, Istio). Once you install kstack you'll have access to these skills inside Claude Code: Monitoring
/cluster-status — Health snapshot (pod restarts, node conditions, resource pressure) /events — Recent events, ranked by severity
Troubleshooting
/investigate — Root-cause analysis across events, logs, and related resources /logs — Shared tmux session that translates natural language into log fetches and analysis (via Kubetail) /metrics — Fetch CPU, memory, and other resource metrics for pods, nodes, and workloads /exec — Shared tmux shell into a pod, node, or ephemeral debug container
Audits
/audit-security — RBAC, pod security posture, privilege tightening /audit-network — NetworkPolicy, Service, Ingress, GatewayAPI, DNS and encryption checks /audit-cost — Requests vs. usage, over-provisioning, idle capacity /audit-outdated — Outdated services, known CVEs, available version bumps
Miscellaneous
/cleanup — Remove all kstack-owned resources from the cluster (debug containers, pod clones, watcher jobs) /forget — Clear kstack's local cache and discard what it learned about your cluster(s)
Our goal is to bring the power of AI to K8s monitoring in a user-friendly and cost-effective way that keeps you in control. If you notice a bug or have a suggestion please create a GitHub Issue or send us an email (hello@kubetail.com)!
Quickstart To install the kstack skills globally, run this command: curl -sS https://kstack.sh/install | bash Alternatively, you can install them locally inside a specific project directory: curl -sS https://kstack.sh/install | bash -s -- --local Once installed, the skills will be available inside your agent sessions: ─────────────────────────────────── ❯ /kstack-cluster-status ───────────────────────────────────
By default, the script will install the skills with a kstack-* namespace prefix but you can disable this with the --no-prefix flag. It will also install the skills for all of your available agents (e.g. Claude, Codex, OpenCode) but you can choose to target individual agents with the --agent flag instead (see Installation). Kstack uses your local kubeconfig file for authentication so it will be able to use your RBAC permissions to perform actions on your behalf. If it runs into permissions problems, it will let you know. Other AI Agents Kstack works with any AI agent that supports skills, not just Claude. The curl bootstrap auto-detects which agent CLIs are on your PATH and installs for each. You can target a specific agent with --agent <name>:
Agent Flag Global install path
OpenAI Codex CLI --agent codex ~/.codex/skills/
OpenCode --agent opencode ~/.config/opencode/skills/
Cursor --agent cursor ~/.cursor/skills/
Factory Droid --agent factory ~/.factory/skills/
Slate --agent slate ~/.slate/skills/
Kiro --agent kiro ~/.kiro/skills/
Hermes --agent hermes ~/.hermes/skills/
Local installs mirror this structure under the project directory (e.g. <project>/.codex/skills/) and are picked up only when the agent is run from inside that directory. Skills Reference Each skill is invoked with /<name> inside an agent session. All skills are read-only by default — any action that mutates cluster state requires explicit confirmation. Skills honor your local kubeconfig context and respect RBAC. Global flags (supported by every skill):
Flag Description
--context <ctx> Override the current kubeconfig context
--namespace <n> Scope the run to a single namespace (defaults to all accessible)
--json Emit structured output for piping into other tools
--help Open the reference documentation for the skill in your browser
Monitoring
/cluster-status
A dense health snapshot of the cluster — node conditions, pod aggregates, and a ranked list of the issues that actually matter. What it checks: cluster identity (context, Kubernetes version, platform), node Ready/MemoryPressure/DiskPressure/PIDPressure conditions and SchedulingDisabled, control-plane vs. worker split, pod phase and Ready across all namespaces, pods with non-zero restart counts, and a ranked top-issues list (top 5 by severity). How it works: fans out kubectl version, kubectl get nodes -o json, and kubectl get pods -A -o json in parallel, writing each to a per-context cache (cluster.json, nodes.json, pods.json). Aggregation and severity ranking happen client-side. Follow-up questions ("list pods", "pods on ", "which nodes are tainted") are answered by reading the cache with jq rather than re-invoking the skill. Options:
--refresh — fetch most recent data, bypassing and refreshing the cache (default: false) --ttl <duration> — only update the cache if older than <duration> (default: 15m)
Reference: kstack.sh/reference/skills/cluster-status
/events
Recent cluster events, grouped by reason and ranked by severity so the signal isn't drowned in Pulled/Created/Started noise.
What it checks: Warning events across all namespaces, grouped by (reason, involvedObject.kind, namespace); notable Normal events (Killing, Preempting, NodeNotReady, Rebooted, FailedScheduling) with chatty reasons (Pulled, Created, Started, Scheduled, SuccessfulCreate) collapsed into a tail line. Each group includes count, first/last timestamp, the most recent message, and the involved objects. How it works: a single kubectl get events --all-namespaces call (against events.k8s.io/v1, sorted server-side by lastTimestamp), written to a per-context cache as events.json. Aggregation and ranking happen client-side. Follow-ups ("only payments", "events on pod/checkout-7c9", "show suppressed") are answered by reading the cache with jq — and walk owners one level up (Pod → ReplicaSet → Deployment) so controller-fired events aren't missed. Options:
--refresh — fetch most recent data, bypassing and refreshing the cache (default: false) --ttl <duration> — only update the cache if older than <duration> (default: 5m)
Reference: kstack.sh/reference/skills/events
Troubleshooting
/investigate
Kick off a root-cause investigation on a failing or suspicious resource. When the skill is invoked, it runs a script to gather an initial data bundle and briefs the agent. From there, you can ask follow-up questions in natural language and the agent decides whether to answer from what it has, fetch something new, or reach for another tool. What it gathers: spec and status of the problematic resources; events on those resources and their owners (a Pod's ReplicaSet and Deployment, a Job's CronJob, etc.); logs from current and previous containers, truncated to the lines most likely to contain the failure; obvious related resources (backing Service, mounted ConfigMap/Secret names, bound PVCs, referenced ServiceAccount); and the node the pods are scheduled on when relevant. How it works: the skill loads the bundle from the Kubernetes API and briefs the agent on how to read it (exit codes, event reasons, common state combinations), when follow-ups should re-fetch rather than reason from the stale bundle, and when to hand off to /logs, /exec, or /metrics.
Arguments:
<target> — <kind>/<name> (e.g. pod/checkout-7c9) or natural language (the api deployment, why is checkout crashing). Optional — the skill will prompt if omitted.
Options: none. Scope logs, time windows, or resources via natural language in the prompt or follow-ups. Reference: kstack.sh/reference/skills/investigate
/logs
An AI-powered log fetcher. Describe what you're looking for in natural language and the agent finds the right pods, picks the time window, and builds the grep filter to fetch only the lines that matter. The stream runs inside a tmux window that you and the agent are both attached to. How it works: the agent translates your description into a Kubetail query, starts a detached tmux session (e.g. kstack-logs-api-server), tries to open a new terminal window attached to it, and prints the tmux attach command in chat as a fallback. You and the agent share the same pane — you can scroll, search, or watch the live tail; the agent reads conservatively to save tokens. Requirements: tmux on the agent's $PATH, and Kubetail installed in the cluster (the skill offers to install it via Helm if missing). Arguments:
<target> — natural-language description of what to fetch (api, errors from the last hour on api, checkout for "timeout" in last 15m). Optional — the skill will prompt if omitted.
Options:
--attach — attach the agent to an existing kstack tmux session instead of starting a new one --detach — start a new session detached (no terminal window opened, attach manually)
Reference: kstack.sh/reference/skills/logs
/metrics
An AI-powered metrics fetcher. Describe what you want to see and the agent resolves the right target, picks a sensible time window, and returns a compact summary. Read-only and never mutates cluster state. How it works: the agent translates your description into a query against whichever source fits (metrics-server or Prometheus), reports summary statistics (p50, p95, max) rather than piping the full series through the model, and shows the resolved query before running it when the scope looks broader than intended. For why a metric moved, it hands off to /logs; for root-cause context, /investigate; for a full right-sizing sweep, /audit-cost.
Arguments:
<target> — natural-language description (api, memory on checkout last 1h, top pods by cpu in payments). Optional — the skill will prompt if omitted.
Options: none. Scope the target, metric, and time window via natural language in the prompt or follow-ups. Reference: kstack.sh/reference/skills/metrics
/exec
An AI-powered version of kubectl exec. Describe the target in natural language and the agent picks the right mechanism: a normal exec into a running container, an ephemeral debug container when the target has no usable shell, or a privileged shell on a node. The session runs inside a tmux window that you and the agent are both attached to — either of you can type, both see the output. How it works: the agent starts a detached tmux session (e.g. kstack-exec-api-server), tries to open a new terminal window attached to it, and prints the tmux attach command in chat as a fallback. The agent reads from the pane conservatively to save tokens. Tell it to tear down and it kills the tmux session and deletes any pod it created. Requirements: tmux on the agent's $PATH. Safety: /exec ships with disable-model-invocation: true — the agent never starts a shell on its own. It only runs when you type /exec, deliberately, given the privileged modes above. Arguments:
<target> — natural-language description (api, api/sidecar, node worker-3, debug api). Optional — the skill will prompt if omitted.
Options:
--image <image> — image to use for node and debug-container modes (default netshoot) --attach — attach the agent to an existing kstack tmux session instead of starting a new one --detach — start a new session detached (no terminal window opened, attach manually)
Reference: kstack.sh/reference/skills/exec
Audits All audit skills produce a ranked findings list (severity + evidence + suggested fix).