Skip to main content

Overview

ExuluChunkers provides text chunking utilities that split large documents into smaller, meaningful segments for embedding generation and semantic search. The package includes two specialized chunking strategies: sentence-based chunking for natural language text and recursive chunking for hierarchical document structures.

Key features

Sentence chunking

Splits text into sentence-based chunks with configurable overlap

Recursive chunking

Hierarchical chunking with customizable splitting rules

Token-aware

Built-in tokenizer respects token limits for embeddings

Configurable overlap

Control chunk overlap for better context preservation

Callable interface

Chunkers are callable functions for intuitive usage

Factory pattern

Async initialization with .create() method

Why chunking matters

When working with large documents and language models, chunking is essential:
Embedding models have token limits (e.g., 8,192 for text-embedding-3-small). Chunking ensures text fits within these limits while preserving semantic coherence.
Smaller chunks provide more precise search results. Instead of returning an entire document, users get the specific paragraph or section relevant to their query.
Chunk overlap ensures important context isnโ€™t lost at boundaries. When a sentence is split across chunks, overlap captures the complete thought.
Semantically coherent chunks produce better embeddings. Sentence-based and hierarchical chunking maintain natural text boundaries.

Available chunkers

SentenceChunker

Splits text into chunks at sentence boundaries, respecting token limits:
import { ExuluChunkers } from "@exulu/backend";

// Create sentence chunker
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512,        // Max 512 tokens per chunk
  chunkOverlap: 50,      // 50 tokens overlap between chunks
  minSentencesPerChunk: 1,
  minCharactersPerSentence: 10
});

// Chunk text
const chunks = await chunker("Your long document text here...");

console.log(chunks.length);        // Number of chunks
console.log(chunks[0].text);       // First chunk text
console.log(chunks[0].tokenCount); // Token count
Use SentenceChunker for:
  • Natural language documents (articles, blog posts, documentation)
  • Text where sentence boundaries are important
  • Content that benefits from grammatical coherence

RecursiveChunker

Hierarchically splits text using customizable rules (paragraphs โ†’ sentences โ†’ pauses โ†’ words โ†’ tokens):
import { ExuluChunkers } from "@exulu/backend";

// Create recursive chunker with default rules
const chunker = await ExuluChunkers.recursive.function.create({
  chunkSize: 1024,             // Max 1024 tokens per chunk
  minCharactersPerChunk: 50
});

// Or with custom rules
const rules = new ExuluChunkers.recursive.rules({
  levels: [
    { delimiters: ["\n\n", "\n"] },      // Split by paragraphs
    { delimiters: [". ", "! ", "? "] },   // Then sentences
    { whitespace: true }                  // Then words
  ]
});

const customChunker = await ExuluChunkers.recursive.function.create({
  chunkSize: 1024,
  rules: rules
});

// Chunk text
const chunks = await customChunker("Your document...");

console.log(chunks[0].level); // Recursion level used for this chunk
Use RecursiveChunker for:
  • Code documentation with hierarchical structure
  • Markdown documents with headers and sections
  • Content with clear structural delimiters
  • When you need control over splitting priorities

Chunking workflow

1

Create chunker

Initialize with .create() and configuration options
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512,
  chunkOverlap: 50
});
2

Chunk text

Call the chunker as a function with your text
const chunks = await chunker(documentText);
3

Process chunks

Iterate through chunks and use them in your application
for (const chunk of chunks) {
  console.log(chunk.text);
  console.log(chunk.tokenCount);
  console.log(chunk.startIndex, chunk.endIndex);
}
4

Generate embeddings

Pass chunks to your embedder for vector generation
const embeddings = await embedder.generate(
  chunks.map(c => c.text)
);

Quick comparison

FeatureSentenceChunkerRecursiveChunker
StrategySentence boundariesHierarchical rules
Overlapโœ… ConfigurableโŒ No overlap
Best forNatural languageStructured documents
CustomizationMinimalExtensive via rules
ComplexitySimpleAdvanced
Level trackingโŒ Noโœ… Yes

Integration with ExuluContext

ExuluChunkers are designed to work with ExuluContext for semantic search:
import { ExuluContext, ExuluChunkers, ExuluEmbedder } from "@exulu/backend";

// Create chunker
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512,
  chunkOverlap: 50
});

// Create context with chunker
const context = new ExuluContext({
  id: "docs",
  name: "Documentation",
  description: "Product documentation search",
  embedder: embedder,
  chunker: chunker, // Use the chunker
  fields: [
    { name: "title", type: "text", required: true },
    { name: "content", type: "longtext", required: true }
  ],
  sources: []
});

// Documents are automatically chunked during insertion
await context.createItem(
  {
    title: "Getting Started",
    content: "Very long documentation content..."
  },
  { generateEmbeddings: true }
);

Chunk structure

All chunkers return an array of Chunk objects:
type Chunk = {
  text: string;         // The chunk text
  startIndex: number;   // Start position in original text
  endIndex: number;     // End position in original text
  tokenCount: number;   // Number of tokens in chunk
  embedding?: number[]; // Optional embedding vector
}

// RecursiveChunk extends Chunk with:
type RecursiveChunk = Chunk & {
  level?: number;       // Recursion level (0 = top level)
}

Token counting

Both chunkers use ExuluTokenizer for accurate token counting:
// Chunkers respect token limits, not character limits
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512 // 512 tokens, not characters
});

// Handles multi-byte characters correctly
const text = "Hello ไธ–็•Œ! This is a test.";
const chunks = await chunker(text);

// Each chunk's tokenCount is accurate
console.log(chunks[0].tokenCount); // Actual token count

Best practices

Match embedding limits: Set chunkSize based on your embedding modelโ€™s token limit. Leave room for context (e.g., 512 tokens for a 1536-token limit).
Use overlap for continuity: For natural language, use 10-20% overlap (e.g., 50-100 tokens for 512-token chunks) to preserve context across boundaries.
Validate chunk size: Ensure chunkSize is larger than chunkOverlap. The chunker will throw an error if overlap equals or exceeds chunk size.
Choose the right chunker: Use SentenceChunker for most text documents. Use RecursiveChunker when you need fine control over structural boundaries.

Performance considerations

  • Tokenization cost: Token counting requires encoding text. For large documents, this adds processing time.
  • Chunk count: Smaller chunks = more chunks = more embeddings = higher API costs and storage.
  • Overlap vs. accuracy: Higher overlap improves context but increases chunk count and costs.
Recommended chunk sizes:
  • Small documents (< 10K tokens): 256-512 tokens per chunk
  • Medium documents (10K-100K tokens): 512-1024 tokens per chunk
  • Large documents (> 100K tokens): 1024-2048 tokens per chunk

Example: Chunking strategies

// Blog post, article, documentation
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512,
  chunkOverlap: 50,
  minSentencesPerChunk: 2,  // At least 2 sentences per chunk
  minCharactersPerSentence: 15
});

const text = `
  Introduction to Machine Learning

  Machine learning is a subset of AI. It enables computers
  to learn from data without explicit programming.

  Types of Learning

  Supervised learning uses labeled data. Unsupervised
  learning finds patterns in unlabeled data.
`;

const chunks = await chunker(text);
// Result: Chunks split at sentence boundaries

Next steps