A paper dropped this week from Ekimetrics called “Adaptive Chunking.” They go through a system that picks the best chunking method per document instead of applying one strategy to everything. The code is open source on GitHub.
More interesting to me was their five metrics for scoring chunk quality. Each one captures a different dimension of quality, and improving one doesn’t automatically improve the others…
References completeness
When a document says “the committee recommended it in their final report,” the words “it” and “their” point back to specific nouns earlier in the text. If a chunk boundary lands between the pronoun and the noun it refers to, the chunk becomes an incomplete thought. The retriever hands the model a fragment that references things defined somewhere else.
This metric uses a model that tracks pronoun-to-noun relationships to find all those pairs in the document, then checks what fraction of them stay inside the same chunk.
Block integrity
A common document is built out of paragraphs, tables, figure captions, and titles paired with their body text. When a chunk boundary cuts through the middle of a table, that table becomes useless in both chunks. Same with a heading that ends up in one chunk while its content lands in the next.
This metric checks whether those structural units survived the chunking process intact. It uses the document parser’s output to identify block boundaries, then counts how many blocks got split.
Intrachunk cohesion
When a chunk covers tax law in the first half and environmental regulations in the second half, its numerical representation becomes a blurry average of two unrelated topics. A query about either topic gets a weaker match than it should.
This metric converts each sentence and the full chunk into embeddings, then measures how similar they are. If all the sentences are about the same topic, the similarity scores are high. If the chunk mixes topics, the scores drop.
Document contextual coherence
A chunk that’s been ripped out of a longer argument might be internally cohesive but nonsensical on its own. Does it make sense in the context of its neighbors?
The metric looks at roughly 3,000-token stretches of the document and measures how well each chunk fits within the stretch it came from. Higher scores mean chunks that preserve the flow of the surrounding document.
Size compliance
What fraction of your chunks fall within your target size range? Too small and the chunk carries almost no information, wasting a retrieval slot. Too large and the numerical representation tries to capture too many ideas at once, making it harder to match against a specific query.
The paper used a target range of roughly 75 to 825 words. Merging tiny fragments and re-splitting oversized ones as a post-processing step consistently improved scores across every method tested.
Pulling in different directions
These five metrics pull in different directions. Optimizing for cohesion pushes toward smaller chunks, which hurts contextual coherence. Keeping structural blocks intact can force unrelated content into the same chunk, hurting cohesion. Even though there is some push/pull here, the paper found that small improvements across all five metrics, just 1-2 percentage points each, compounded into overall meaningful retrieval improvements.


