I Put My CLAUDE.md in a Cage Match. I Got Results I Didn't Expect.

The Three Context File Types framework + the minimal CLAUDE.md template that won the test.

Ilia Karelin

Jun 21, 2026

∙ Paid

I had this interesting idea...why don’t I put my CLAUDE.md in a cage match?

Yes, you read that right.

Here’s how it came about…

I kept complaining to myself: “Claude keeps missing a bunch of rules in each markdown file.”

And it was missing rules that were in multiple other files too. Then, because of that, I constantly had to point Claude back to those files to read them again...session after session.

So I started to think about a way to fix this problem. I created a bunch of CLAUDE.md files, with various instructions in various styles, but ended just with finalists that I will talk about in a bit.

I was expecting the full version to win by a comfortable margin. So I was asking myself: “Why am I even doing this testing?”

First of all, the result surprised me. But more than that, after all the testing was done, I looked through the AI’s thinking patterns inside of the testing, and I realized something. Finally, in the process of me having a conversation with Claude, little lightbulbs began to pop up for me, so I decided to share my findings with you.

I’d spent months building and modifying my CLAUDE.md. I thought I had it all figured out. It had everything:

- Voice rules

- Post type definitions

- Quality gates

- A file index

- Task routing

- Knowledge base instructions

- Cover image specs

160 lines of carefully structured context. It seemed to have enough information in it in order to not miss anything. If you looked at it, you’d think it was doing serious work.

Then I ran the test. Blind-scored 10 outputs from two versions: my full file versus a 39-line minimal version I’d stripped down for the experiment. GPT 5.5 judged every output without knowing which was which.

Final score: Minimal 88, Full 78.

Minimal won four out of five tasks.

I read the results twice. Then I opened “reveal-key.md” - the file that I wasn’t supposed to look at until the end to see which version was which.

The file showing 5 different tasks and what CLAUDE.md version assigned to each

I got confused. The results were not matching what I had in my head.

Then I went back to check the session logs and that’s when the real finding showed up.

Most people who use Claude Code or any agent framework have something like a CLAUDE.md. Anthropic’s term for it is a memory file. It sits in your project directory and loads at the start of every session, automatically.

The standard approach to building one: add rules as you go, point to other files for deeper context, keep a routing table of which prompt to use for which task. The more you put in, the more consistent Claude gets...or so the logic goes.

I held this belief for a long time. I kept adding to my CLAUDE.md every time something went wrong. Voice rules, mechanics of things. Over time, it became a thorough, well-organized file covering nearly every scenario.

The one thing I never did: measure whether any of it was working.

The Experiment

For this experiment that I ran, I created a folder that I literally called “cage-match”:

Alt text: File explorer showing a cage-match folder containing two files, Full CLAUDE.md and Minimal CLAUDE.md, set up for the blind comparison test.

Then, after that, I created 2 separate CLAUDE.md files and 3 separate test sessions for each version in Devin Desktop. I chose Devin because I wanted to use a completely different model for judging the results.

Version A - Full CLAUDE.md.

My actual file. The one that I thought was doing its job well with 160 lines.

Full 160-line CLAUDE.md file open in an editor, scrolled to the Key Files and Common Tasks sections, spanning multiple screens.

Version B - Minimal CLAUDE.md.

A 39-line version I built from scratch by asking one question: what does Claude actually need to write well?

Minimal 39-line CLAUDE.md file open in an editor, with the entire file visible on a single screen.

Five tasks, same for both versions:

1. Write the opening hook paragraph for a post about building an AI research assistant with Perplexity and NotebookLM.

2. Generate 3 title + subtitle options for a post about why AI prompts fail.

3. Write the Honest Tradeoff section for a post about replacing your weekly review with an AI retro.

4. Edit a generic AI-sounding paragraph for Prosper voice.

5. Recommend a cover concept for a post about running an A/B test on your CLAUDE.md.

How it ran:

Same model (Claude Sonnet 4.6) for both versions. Each task ran once per version. I also created “randomize.py” with AI, it’s a Python script I built for this experiment. It assigned random X/Y labels across the 10 outputs and wrote a scoring sheet. I couldn’t tell which version produced which.

GPT scored every output blind on:

- Voice authenticity

- Quality bar compliance

- No banned phrases

- Specificity

Each criteria 1–5. Max 20 per task, 100 total.

The experiment design was deliberately minimal. The generation happened in two separate Devin Desktop sessions, one with each version loaded as the active CLAUDE.md. Then, the scoring was done by GPT-5.5 in 3rd Devin session.

Devin Desktop model selector dropdown with GPT 5.5 highlighted as the active model used to score the experiment.

By the end of this post, you’ll have:

The full experiment results - task by task, with the exception that tells the most interesting story
The Three Context File Types framework - a way to categorize every file in your context system and tell Claude exactly how to apply each one
The specific one-line fix that changes how Claude processes your rule files
A fill-in-the-blanks minimal context file template - works for CLAUDE.md, AGENTS.md, ChatGPT project instructions, or Cursor rules

Now, let’s get into the cage.

The Results

Here’s how the two versions scored across all five tasks:

Continue reading this post for free, courtesy of Ilia Karelin.

Or purchase a paid subscription.