ExuluChunkers provides text chunking utilities that split large documents into smaller, meaningful segments for embedding generation and semantic search. The package includes two specialized chunking strategies: sentence-based chunking for natural language text and recursive chunking for hierarchical document structures.
When working with large documents and language models, chunking is essential:
Token limits
Embedding models have token limits (e.g., 8,192 for text-embedding-3-small). Chunking ensures text fits within these limits while preserving semantic coherence.
Search granularity
Smaller chunks provide more precise search results. Instead of returning an entire document, users get the specific paragraph or section relevant to their query.
Context preservation
Chunk overlap ensures important context isnโt lost at boundaries. When a sentence is split across chunks, overlap captures the complete thought.
Embedding quality
Semantically coherent chunks produce better embeddings. Sentence-based and hierarchical chunking maintain natural text boundaries.
type Chunk = { text: string; // The chunk text startIndex: number; // Start position in original text endIndex: number; // End position in original text tokenCount: number; // Number of tokens in chunk embedding?: number[]; // Optional embedding vector}// RecursiveChunk extends Chunk with:type RecursiveChunk = Chunk & { level?: number; // Recursion level (0 = top level)}
Both chunkers use ExuluTokenizer for accurate token counting:
Copy
// Chunkers respect token limits, not character limitsconst chunker = await ExuluChunkers.sentence.create({ chunkSize: 512 // 512 tokens, not characters});// Handles multi-byte characters correctlyconst text = "Hello ไธ็! This is a test.";const chunks = await chunker(text);// Each chunk's tokenCount is accurateconsole.log(chunks[0].tokenCount); // Actual token count
// Blog post, article, documentationconst chunker = await ExuluChunkers.sentence.create({ chunkSize: 512, chunkOverlap: 50, minSentencesPerChunk: 2, // At least 2 sentences per chunk minCharactersPerSentence: 15});const text = ` Introduction to Machine Learning Machine learning is a subset of AI. It enables computers to learn from data without explicit programming. Types of Learning Supervised learning uses labeled data. Unsupervised learning finds patterns in unlabeled data.`;const chunks = await chunker(text);// Result: Chunks split at sentence boundaries