RAG Corpus Chunking Strategy

Splitting documents at H2 headings with stable IDs and content hashes produces predictable, debuggable chunks that support incremental re-indexing.

Tags

RAG Corpus Chunking Strategy

The Lesson

When chunking documents for RAG retrieval, split at semantic boundaries the author already created (headings) rather than at arbitrary token counts. Assign deterministic chunk IDs and compute content hashes so that re-indexing can detect what changed without re-embedding everything.

Context

A lessons library of 116 markdown documents needed to be chunked for vector search. Each lesson is a standalone markdown file with an H1 title, optional H2 sections, and YAML frontmatter (title, summary, tags, repo source). The chunks feed into a RAG chatbot that retrieves relevant excerpts, cites sources by lesson title, and builds grounded answers. The corpus is rebuilt on every CI run (daily + on push), so re-indexing performance matters.

What Happened

  1. Chose H2 headings as the split boundary. Each lesson's content before the first H2 becomes chunk 0 (labeled "Introduction"). Each H2 section becomes a subsequent chunk. H3 and deeper headings stay within their parent H2 chunk — splitting on every heading would create fragments too small for meaningful retrieval.
  2. Assigned chunk IDs as {lesson_id}-{chunk_index} where chunk_index is the position within the lesson (0 for intro, 1+ for sections). This is deterministic: the same lesson content always produces the same IDs, regardless of when or where the build runs.
  3. Each chunk carries a heading path breadcrumb (e.g., "Git Basics > Branching") so the RAG prompt builder can tell the LLM where in a lesson the excerpt came from. This improves citation quality — the model can say "according to the Branching section of Git Basics" rather than just "according to Git Basics."
  4. Computed a SHA-256 content hash (truncated to 16 hex chars) of each chunk's text. The hash is stored in the chunk metadata. On re-indexing, the embedder can compare hashes against the vector store's existing metadata to skip unchanged chunks. This infrastructure exists but isn't wired yet — the current embedder does a full re-index every time.
  5. Token count is estimated as word_count * 1.33 (approximately 0.75 tokens per word), with a minimum of 1. The corpus validator warns if any chunk exceeds 5,000 estimated tokens but doesn't fail the build. In practice, 793 chunks from 116 lessons averaged well under 1,000 tokens.
  6. The corpus builder outputs two files: rag-chunks.json (the chunks) and rag-manifest.json (statistics: total lessons, chunks, skipped empty lessons, average tokens). The manifest is a build artifact for monitoring corpus drift over time.

Key Insights

Examples

Input markdown:

This lesson covers Git basics.

## Branching

Create branches with `git checkout -b feature`.

## Merging

Merge with `git merge feature`.

Output chunks:

chunk_id chunk_index heading_path content
git-basics-0 0 "Git Basics" "This lesson covers Git basics."
git-basics-1 1 "Git Basics > Branching" "## Branching\n\nCreate branches..."
git-basics-2 2 "Git Basics > Merging" "## Merging\n\nMerge with..."

Applicability

Heading-based chunking works when:

Consider fixed-size windows instead when:

Related Lessons

Related Lessons