You think Claude is using your skills but it's mostly pretending

Custom skills, context windows, and the design patterns that make the difference between ‘works sometimes’ and ‘works every time.’

by Kem · the_campfire · hacker culture for the AI generation. One place for the researchers, engineers, hackers, founders, creators, rookies and hustlers. Applications for founding members are open →

There's an uncomfortable thing that happens when you build custom skills for Claude Code. You write a checklist, some brand guidelines, a workflow, whatever you need it to follow consistently. You test it. The output looks right. You move on.

Except there's a reasonable chance Claude never actually used your skill. It saw your message, matched it against training data, and produced something close enough that you couldn't tell the difference. Your skill sat there, loaded but ignored. This happens far more often than people realise, and there's a structural reason for it.

This was billed as a masterclass on slash commands, which is a bit like calling a talk on architecture "a masterclass on hammers." A slash command is just a trigger. The skill is the system behind it: the instructions, files, and configuration that control what happens when it fires. But even skills aren't quite the point here.

The point is system design. How you structure instructions, state and execution so Claude produces the same result every time, not just when conditions happen to be right. Skills are a useful lens for this because they sit exactly on the fault line between "Claude works it out" and "you tell Claude what to do." That's where most setups quietly break, and where the interesting decisions live.

Along the way I'll walk through kem-sec, a tool I built for my personal use that runs 100+ automated checks, multi-step workflows, and agents coordinating through files on disk rather than through a context window that degrades and forgets. Not because you need something that complex, but because it shows what's possible when you push these ideas to their limit. The same design thinking that fixes unreliable skills also opens the door to things like persistent memory and self-learning protocols, and none of it requires waiting for a new model.

The problem

When you start a Claude Code session, the model loads a stack of context before you've typed a single word.

What loads into the context window

System prompt ~4,200 tokens

Tool definitions Read, Write, Glob, Grep, Bash, Edit + any MCP server tools ~14,000–17,000+

Memory + environment ~1,000

Your skill descriptions 250 chars each · 1% budget

CLAUDE.md files ~2,000

Conversation history grows with every message ↓

The heights are proportional. Notice how small your skills are compared to tool definitions above them.

As we see here, the system prompt and tool definitions arrive first and take up the most space. Your skill descriptions sit below all of that, compressed to 250 characters each within a budget of 1% of the total context window. It's important to note that they're not even the real instructions. The actual skill content like your files, checklists and guidelines only loads when the skill is properly invoked. Everything after them such as your conversation history grows with every message.

Researchers at Stanford, MIT and Google have all studied how language models handle information spread across long inputs. The findings are consistent: these models pay the most attention to what comes first and what comes last. Everything in the middle gets less attention. The longer the input, the worse it gets. Stanford found that moving an answer from the start to the middle of a document drops accuracy by around 20%. MIT identified what they call "attention sinks" where the very first tokens in any input attract a disproportionate share of attention regardless of content. Google confirmed this as a systematic U-shaped curve, not random noise, and showed that correcting for it improved performance by 15 percentage points. A 2025 consortium study took it further, confirming that simply making the input longer, even when the model can find everything, degrades performance by up to 85%.

Research group	Finding
Stanford Liu et al. 2023	Moving an answer from the start to the middle of a document drops accuracy by around 20%
MIT Xiao et al. 2023	The very first tokens in any input attract a disproportionate share of attention regardless of content. They called these "attention sinks"
Google Hsieh et al. 2024	Confirmed this as a systematic U-shaped curve, not random noise. Correcting for it improved performance by 15 percentage points
Consortium Du et al. 2025	Simply making the input longer, even when the model can find everything, degrades performance by up to 85%

In practical terms, this is exactly what happens during a Claude Code session. Your conversation history sits at the bottom of that stack and grows with every message. As the session runs, your skill descriptions get pushed deeper into the middle of the context window, right into the trough of that attention curve.

The U-shaped attention curve. Models pay most attention to the start and end of their input. Everything in the middle gets less attention.

The model isn't forgetting your skills. It just pays less attention to where they sit and falls back on training data rather than the instructions you actually wrote. The question is what to do about it.

The research

Liu et al. (2023) "Lost in the Middle" - arxiv 2307.03172 · Xiao et al. (2023) "Attention Sinks" - arxiv 2309.17453 · Hsieh et al. (2024) "Found in the Middle" - arxiv 2406.16008 · Du et al. (2025) "Context Length Alone Hurts" - arxiv 2510.05381

What to do about it

There are two ways to respond to this. The first is to work within the system and try to make auto-detection more reliable. The second is to sidestep it entirely.

Improving auto-detection

Anthropic provides a skill-creator plugin for this. Install it inside Claude Code:

claude plugin install skill-creator@claude-plugins-official

Once installed, type /skill-creator in Claude Code. There's no menu or interface to learn. It's a conversation. Tell Claude what you want to build and it walks you through it. Depending on where you are in the process it works in one of four modes:

Mode	What it does
`Create`	Interviews you about what your skill does and drafts it
`Eval`	Runs your skill against test prompts, one run with and one without, so you can see if it makes a difference
`Improve`	A/B tests your old version against a new one
`Benchmark`	Variance analysis across multiple runs to check consistency

The skill-creator also includes a description optimiser. Ask it to optimise your description and it generates 20 test prompts, runs them against your skill repeatedly and rewrites the 250-character description until it triggers reliably. Anthropic's own guidance here is to make descriptions "pushy" because Claude tends to undertrigger rather than overtrigger.

Built-in commands

Two commands are already built into Claude Code for debugging skill triggers:

/skills lists every skill Claude can see with descriptions as they appear in context.

/context shows a visual map of your context usage and warns you if skills are being excluded for space.

All of this helps and it is worth doing. Better descriptions trigger more often and the eval tools catch problems early. But you're still working within a system where your skill is a 250-character summary competing for attention in the middle of a growing prompt. You've improved the odds but the underlying problem is still there.

Not needing auto-detection

Every skill lives in a folder inside .claude/skills/. The folder name becomes your slash command:

.claude/skills/write-copy/     →  /write-copy
.claude/skills/run-tests/      →  /run-tests
.claude/skills/review-pr/      →  /review-pr

Inside that folder is a file called SKILL.md which is what Claude reads when the skill runs. At the top you can add frontmatter, a block of settings between two --- lines that control how the skill behaves.

---
description: Write marketing copy in our brand voice
disable-model-invocation: true
allowed-tools: Read, Write, Glob
---

Read the brand guidelines at ${CLAUDE_SKILL_DIR}/brand-voice.md
Write copy for whatever the user specifies.
Follow the tone and formatting in that file exactly.

Setting	What it controls
`description`	The 250-character summary Claude sees in context. This is what auto-detection matches against
`disable-model-invocation: true`	The skill only runs when you type the command. Claude can't auto-trigger it and the description is removed from context entirely
`allowed-tools`	Which tools the skill can use without asking permission (Read, Write, Glob, Grep, Bash, Edit etc.)

There are more settings available but these are the three that matter. When you reference a supporting file in your skill you use ${CLAUDE_SKILL_DIR} to point to the skill's own folder. You don't need to remember this syntax as Claude will set it up for you if you ask it to load from a supporting file.

You don't have to build this by hand

Claude can create all of this for you but the difference is in how you ask. If you say "make me a skill for writing copy" you'll get a basic skill with defaults: auto-detection on, no tool restrictions, a generic description. You're back to the same problem.

Instead, be specific about what you want:

Example prompts

"Create a skill called write-copy that only runs when I type the command. Give it access to Read, Write and Glob. It should load brand guidelines from a supporting file called brand-voice.md."

"Set up a skill for running our test suite. Disable auto-detection. Allow it to use Bash and Read. Have it save results to a JSON file."

"I want a code review skill that reads our review checklist from a supporting file, not from the skill description. Make it explicitly invoked only."

The vocabulary matters. "Disable auto-detection" or "only runs when I type the command" tells Claude to set disable-model-invocation: true. "Load from a supporting file" tells it to use file references instead of cramming everything into the SKILL.md itself.

This is the design shift. You stop hoping Claude picks the right skill and start telling it exactly when to run, what to read and where to save. Once you control the invocation, a whole set of patterns opens up that aren't possible when you're relying on auto-detection.

What becomes possible

Controlling when a skill runs is the first step but it's not the interesting part. What matters is what becomes possible once you're no longer fighting auto-detection. There are four patterns and each one builds on the last.

Pattern	What it does
Runtime file loading	Instructions loaded from files during execution, not from a 250-char summary
State persistence	Progress saved to disk so it survives beyond the session
Output validation	Results checked against a defined structure before saving
Multi-skill coordination	Skills communicate through shared files, not through the model

1. Load instructions at runtime

When Claude auto-detects a skill, all it has to work with is that 250-character description. Your actual instructions, guidelines and examples are not loaded.

When you invoke a skill explicitly, Claude reads the full SKILL.md and every file it references with complete attention rather than competing with tool definitions or conversation history. This means you keep SKILL.md short and put the real detail in supporting files:

.claude/skills/write-copy/
├── SKILL.md              # 5 lines: what to do and where to find detail
├── brand-voice.md        # tone, personality, do and don't
├── format-rules.md       # headings, CTAs, length
└── examples/
    └── homepage-v2.md    # what good output looks like

SKILL.md says "read the brand guidelines at this path, follow the format rules, here's an example." The depth lives in the supporting files where Claude reads them with full attention.

Example prompt

"Create a skill called write-copy with separate supporting files for brand voice, format rules and an example. Keep the SKILL.md short and have it reference those files."

2. Save state to disk

If your skill does something in multiple steps, where does the progress live? By default it lives in the context window. Claude remembers what it's done because the conversation history contains it.

That works until the conversation gets long and older details lose attention. Run /clear and the history is gone. End the session and everything disappears.

The alternative is to write progress to a file after each step:

// .my-skill/state.json
{
  "status": "in-progress",
  "current_step": 5,
  "total_steps": 12,
  "completed": ["SEC-01", "SEC-02", "SEC-03", "SEC-04"],
  "findings": [...]
}

Tell your skill to check for this file when it starts. If it finds one it picks up where it left off. If not it starts fresh. Your skill now survives /clear, session breaks, even crashes.

Example prompt

"Add state persistence to my skill. It should save progress to a JSON file after each step and pick up where it left off if the file already exists."

3. Validate the output

Language models are confident by default. They will produce an output that looks structured and complete even when that output may be missing fields, contain duplicates or not match the format you asked for.

If your skill produces structured results like JSON reports or checklists, define what the output should look like and check it before saving:

Check	Why
Required fields present	The model might skip fields it considers obvious
No duplicate entries	Parallel runs can produce overlapping results
Values in expected range	"critical" when you defined "high/medium/low"
Valid format	JSON that's actually valid, not markdown that looks like it

This isn't about distrusting Claude. It's about building systems that are reliable regardless of what any single model call produces.

Example prompt

"Add validation to my skill so it checks the output has all required fields and is valid JSON before saving results."

4. Coordinate through files

When you have multiple skills that work together, the question is how they communicate. One option is to have one skill call another but that means the model needs to decide which skill to invoke next and we've already seen how that goes.

The simpler approach is that each skill reads from and writes to a shared file rather than calling another directly.

/audit    → writes findings to state.json
/fix      → reads findings, applies fixes, updates state.json
/pause    → saves current position
/resume   → reads position, continues
/verdict  → reads completed findings, formats report

Each skill is independent. They don't call each other and they don't need to share a context window. The state file on disk is the only thing connecting them. You can run an audit today, close your session and come back tomorrow to fix the findings. The second skill knows exactly what the first one found because it reads from the same file, not from a conversation that no longer exists.

Example prompt

"I want three skills that work together: draft, review and publish. They should coordinate through a shared JSON file, not call each other. Each one reads the file, does its work and updates it."

The shift

A simple checklist may only require runtime file loading. Once your skill starts working through multiple steps you'll want state persistence. When it produces structured reports you add validation. And when you need a second skill that builds on the first, coordination through shared files ties them together. The patterns layer naturally as the complexity demands it.

a complete system

Multi-skill coordination skills communicate through shared files

Output validation results checked before saving

State persistence progress survives across sessions

Runtime file loading instructions loaded with full attention

a simple checklist

Each pattern layers on the last. Add them as the complexity demands it.

These four patterns together change what a skill is. It stops being a prompt hoping Claude picks it up and becomes a system with defined inputs, persistent state, validated outputs and a clear coordination model.

Let me show you what that looks like when you push all four to their limit.

kem-sec: a case study

I built a tool called kem-sec that I use to audit my own projects. It runs 148 checks across six categories: security, performance, error handling, database practices, compliance and code quality. I'm walking through it here because it uses every one of the four patterns we just covered and it's a good way to see how they work together in practice.

It installs with a single command:

npx kem-sec install

You can run this in your terminal or just ask Claude to do it for you. NPX downloads the package, runs the installer and discards it afterwards. What it leaves behind are the skill files copied into your Claude Code environment so that the slash commands are available the next time you start a session. The source is on npm if you want to review it before installing.

Here's how each of the four patterns shows up in practice.

Runtime file loading

~/.claude/
├── commands/kem-sec/        # 10 short command files
│   ├── audit.md
│   ├── fix.md
│   ├── pause.md
│   ├── resume.md
│   ├── verdict.md
│   └── ...
│
└── kem-sec/                 # supporting files loaded at runtime
    ├── checklists/          # 6 files, 148 checks total
    ├── references/          # 14 protocol documents
    ├── workflows/           # step-by-step orchestration
    └── templates/           # report formatting

Notice the separation. The commands are short and say what to do and where to find the detail. The supporting files contain the actual logic: what to check, how to validate results, how to format the output and what to do when something fails. Claude reads these files during execution when they have its full attention.

The audit workflow

When you type /kem-sec:audit, the workflow runs 14 steps in four phases:

Phase	Steps	What happens
Setup	1–4	Detects your project type, installs a matching CLAUDE.md if needed, determines scope (148 checks) and checks for a previous session to resume
Automated tools	5	Runs npm audit, ESLint, gitleaks and other tools that can check things without AI
Expert analysis	6–11	Spawns six analysis agents in three parallel pairs, each with a checklist and your source code
Reporting	12–14	Deduplicates findings, generates a report, displays the verdict

The expert analysis phase is where it gets interesting. Six separate instances of Claude run the checks, one per category, working in pairs:

Pair 1:  Security (20 checks)  +  Performance (15 checks)
Pair 2:  Error handling (18)   +  Database (25)
Pair 3:  Compliance (30)       +  Code quality (40)

Each pair finishes before the next starts and the workflow is roughly twice as fast as running them one after another. The next two patterns show up in how this workflow handles its data.

Why pairs and not all six at once?

There is a known bug in Claude Code where collecting results from three or more agents running in parallel causes race conditions that result in agents returning blank output or losing their results entirely (#14055). This applies when agents need to return structured data back to the parent process. If your agents are doing independent work like writing directly to files or working in separate worktrees, higher parallelism is fine. But when you need to collect and merge results, two at a time is the safe limit.

State persistence

Progress is saved after each pair completes. If your session crashes halfway through, you run /kem-sec:resume and it picks up from where it left off. The progress is on disk, not in the context window.

Output validation

Each agent returns structured JSON and before any of it is saved the system runs a series of checks. It verifies that all required fields are present, severity values are valid, there are no duplicates between AI findings and what the automated tools already caught and that the JSON itself is actually valid.

Findings from automated tools and AI analysis are then merged using a deduplication algorithm. If npm audit found a vulnerability in src/auth.ts:42 and the security agent flagged the same thing, they get combined into one finding rather than appearing twice.

Coordination through files

None of the ten commands call each other. They all read from and write to the same file: .kem-sec/state.json.

Commands coordinate through a shared state file on disk. No command calls another directly.

Command	Reads	Writes
`/kem-sec:audit`	Previous state (to resume)	Findings, progress, report
`/kem-sec:fix`	Findings (to know what to fix)	Updated status per finding
`/kem-sec:pause`	Current progress	Checkpoint with position
`/kem-sec:resume`	Saved checkpoint	Continues updating progress
`/kem-sec:verdict`	Completed findings	Nothing (display only)

You can run /kem-sec:audit today, close your laptop, open a new session tomorrow and run /kem-sec:fix. It knows exactly what was found because the findings are in a file, not in a conversation that ended yesterday.

Why this matters beyond skills

People are building increasingly complex systems to give AI agents memory. Vector databases, embedding pipelines, dedicated frameworks like Mem0 and Letta. These solve real problems at scale. But look at what we've just walked through with nothing more than JSON files on disk.

State that persists between sessions. Findings that accumulate and inform future actions. Output that's validated before it's trusted. Multiple agents coordinating through a shared file without a central orchestrator.

For skill-based workflows, that is persistent memory. Not through a future model capability or a complex infrastructure stack but through files, structure and clear rules about how information flows.

What people want	How this achieves it
AI that remembers across sessions	State saved to JSON files on disk
AI that improves over time	Findings accumulate, inform fixes, build on previous runs
Agents that work reliably	Explicit invocation, validated output, deterministic workflows
Systems that don't lose progress	Checkpoints after each step, resume from any point

The architecture is also one step from genuine self-improvement. If a skill writes updated rules or priorities based on what previous runs found, the system starts learning from its own output. That's not something you need to wait for. You can build it now with the same tools.

The architecture behind kem-sec isn't specific to security audits. It's a pattern for building any AI workflow that needs to be consistent, persistent and reliable. The tools are skills, slash commands, JSON files and markdown. You already have all of them.

Where to start

Start with one skill, one job, explicitly invoked with disable-model-invocation: true and supporting files. That alone is more reliable than any auto-detected skill. Add the other patterns when the complexity demands it, not before.

Study a working example

Everything I've walked through in this post is available for you to study. kem-sec is open source and installable. You can read every command file, every checklist, every workflow document.

Better yet, feed Claude the context. Give it this blog post, point it at the kem-sec source and ask it to build something similar for your needs. Something like:

Example prompt

"I've read this post about building deterministic skills. I want to build a content review system that works the same way. It should have three skills: draft, review and publish. Use the same patterns: explicit invocation, runtime file loading, state persistence and output validation. Structure it like kem-sec with short command files and separate supporting files."

You now have the vocabulary to ask for the right thing.

Install kem-sec

npx kem-sec install

Then type /kem-sec:help in Claude Code to see all commands

Resources and close

Everything referenced in this post:

What	Where
Official skills documentation	docs.anthropic.com/en/docs/claude-code/skills
The Complete Guide to Building Skills (32-page PDF)	resources.anthropic.com/.../Building-Skill-for-Claude.pdf
Skill-creator plugin	`claude plugin install skill-creator@claude-plugins-official`
Example skills you can study	github.com/anthropics/skills
Agent Skills open standard	agentskills.io
How skills work under the hood	anthropic.com/engineering/...agent-skills

Research	Link
Lost in the Middle - Liu et al. 2023	arxiv 2307.03172
Attention Sinks - Xiao et al. 2023	arxiv 2309.17453
Found in the Middle - Hsieh et al. 2024	arxiv 2406.16008
Context Length Alone Hurts - Du et al. 2025	arxiv 2510.05381

What we covered

Your custom skills sit in the least-attended part of the context window and Claude will often fall back on training data rather than the instructions you actually wrote. That's the problem.

The solution is to stop relying on auto-detection and start designing systems where you control the invocation, the state and the output. Four patterns make this possible: runtime file loading, state persistence, output validation and coordination through shared files.

We walked through kem-sec to see all four working together in practice and saw that the same architecture opens the door to things like persistent memory without waiting for new model capabilities.

Start with one skill. Make it work. Build from there.

kem@the_campfire:~$ _

one place for all of us.

the_campfire is for researchers, engineers, hackers, founders, creators, rookies and hustlers building with AI who want to go beyond the defaults. Applications for founding members are open now.

campfire.aura-intel.net →