Documentation Index Fetch the complete documentation index at: https://docs.exulu.com/llms.txt
Use this file to discover all available pages before exploring further.
SentenceChunker configuration
Factory method
Create a SentenceChunker using the async factory method:
import { ExuluChunkers } from "@exulu/backend" ;
const chunker = await ExuluChunkers . sentence . create ( options );
Options
Maximum number of tokens per chunk
const chunker = await ExuluChunkers . sentence . create ({
chunkSize: 512 // Max 512 tokens per chunk
});
Guidelines:
Small chunks (128-256): High granularity, more chunks, higher costs
Medium chunks (256-512): Balanced for most use cases
Large chunks (512-1024): Less granular, fewer chunks, lower costs
Match your embedding model:
OpenAI text-embedding-3-small: 8,191 tokens โ use 512-1024
OpenAI text-embedding-3-large: 8,191 tokens โ use 512-1024
Cohere embed-english-v3.0: 512 tokens โ use 256-512
Number of tokens to overlap between consecutive chunks (default: 0)
const chunker = await ExuluChunkers . sentence . create ({
chunkSize: 512 ,
chunkOverlap: 50 // 50 tokens overlap
});
Guidelines:
No overlap (0): No redundancy, sharp boundaries
Low overlap (10-20): Minimal context preservation
Medium overlap (50-100): Good balance for natural language
High overlap (100-200): Maximum context, but increases chunk count
Recommended overlap ratios:
10-15% of chunk size for technical docs
15-20% of chunk size for natural language
20-25% of chunk size for narrative content
Overlap must be less than chunkSize. The chunker throws an error if chunkOverlap >= chunkSize.
Minimum number of sentences per chunk (default: 1)
const chunker = await ExuluChunkers . sentence . create ({
chunkSize: 512 ,
minSentencesPerChunk: 2 // At least 2 sentences per chunk
});
Use cases:
1: Allow single-sentence chunks (default)
2-3: Ensure contextual coherence
3+: For documents where individual sentences lack context
Minimum character length for a text segment to be considered a sentence (default: 10)
const chunker = await ExuluChunkers . sentence . create ({
chunkSize: 512 ,
minCharactersPerSentence: 20 // Sentences must be at least 20 chars
});
Use cases:
5-10: Allow short sentences (e.g., โYes.โ, โNo.โ)
10-20: Filter out fragments (default)
20+: Ensure substantive sentences
Complete example
import { ExuluChunkers } from "@exulu/backend" ;
const chunker = await ExuluChunkers . sentence . create ({
chunkSize: 512 , // Max 512 tokens per chunk
chunkOverlap: 75 , // 75 tokens overlap (15% of chunk size)
minSentencesPerChunk: 2 , // At least 2 sentences per chunk
minCharactersPerSentence: 15 // Sentences must be at least 15 chars
});
const text = `
Machine learning is transforming industries. It enables
computers to learn patterns from data. This technology
powers recommendation systems, fraud detection, and more.
Deep learning is a subset of machine learning. It uses
neural networks with many layers. These networks can
recognize complex patterns in images, text, and audio.
` ;
const chunks = await chunker ( text );
console . log ( chunks . length ); // Number of chunks
console . log ( chunks [ 0 ]. text ); // First chunk text
console . log ( chunks [ 0 ]. tokenCount ); // Token count
RecursiveChunker configuration
Factory method
Create a RecursiveChunker using the async factory method:
import { ExuluChunkers } from "@exulu/backend" ;
const chunker = await ExuluChunkers . recursive . function . create ( options );
Options
Maximum number of tokens per chunk
const chunker = await ExuluChunkers . recursive . function . create ({
chunkSize: 1024 // Max 1024 tokens per chunk
});
Same guidelines as SentenceChunker chunkSize.
rules
RecursiveRules
default: "default rules"
Recursive splitting rules defining the hierarchy (default: paragraphs โ sentences โ pauses โ words โ tokens)
// Use default rules
const chunker = await ExuluChunkers . recursive . function . create ({
chunkSize: 1024
// rules not specified = default rules
});
// Or specify custom rules
const rules = new ExuluChunkers . recursive . rules ({
levels: [
{ delimiters: [ " \n\n " ] }, // Split by double newline
{ delimiters: [ ". " , "! " , "? " ] }, // Then sentences
{ whitespace: true } // Then whitespace
]
});
const customChunker = await ExuluChunkers . recursive . function . create ({
chunkSize: 1024 ,
rules: rules
});
Default rules hierarchy:
Paragraphs : ["\n\n", "\r\n", "\n", "\r"]
Sentences : [". ", "! ", "? "]
Pauses : ["{", "}", '"', "[", "]", "<", ">", "(", ")", ":", ";", ",", "โ", "|", "~", "-", "...", "โ, โโโ]`
Words : whitespace: true
Tokens : No delimiters (fallback to token-level splitting)
Minimum character length for a chunk (default: 50)
const chunker = await ExuluChunkers . recursive . function . create ({
chunkSize: 1024 ,
minCharactersPerChunk: 100 // Chunks must be at least 100 chars
});
Guidelines:
20-50: Allow smaller chunks
50-100: Default range
100+: Ensure substantive chunks
Complete example
import { ExuluChunkers } from "@exulu/backend" ;
// Custom rules for markdown documents
const rules = new ExuluChunkers . recursive . rules ({
levels: [
{ delimiters: [ " \n ## " , " \n ### " ] }, // Split by headers
{ delimiters: [ " \n\n " ] }, // Then paragraphs
{ delimiters: [ ". " , "! " , "? " ] }, // Then sentences
{ whitespace: true } // Then words
]
});
const chunker = await ExuluChunkers . recursive . function . create ({
chunkSize: 1024 ,
rules: rules ,
minCharactersPerChunk: 75
});
const markdown = `
## Introduction
Machine learning enables computers to learn from data.
It powers many modern applications.
## Applications
Recommendation systems use ML to suggest content.
Fraud detection systems identify suspicious activity.
## Future Directions
The field continues to evolve rapidly.
New techniques emerge regularly.
` ;
const chunks = await chunker ( markdown );
for ( const chunk of chunks ) {
console . log ( `Level ${ chunk . level } : ${ chunk . text . slice ( 0 , 50 ) } ...` );
console . log ( `Tokens: ${ chunk . tokenCount } ` );
}
RecursiveRules configuration
Constructor
Create custom recursive rules:
import { ExuluChunkers } from "@exulu/backend" ;
const rules = new ExuluChunkers . recursive . rules ({
levels: [ ... ] // Array of RecursiveLevelData
});
Array of recursive levels defining the splitting hierarchy
const rules = new ExuluChunkers . recursive . rules ({
levels: [
{ delimiters: [ " \n\n " ] },
{ delimiters: [ ". " ] },
{ whitespace: true }
]
});
Each level is a RecursiveLevelData object with:
Delimiter(s) to use for splitting at this level
// Single delimiter
{ delimiters : " \n\n " }
// Multiple delimiters
{ delimiters : [ ". " , "! " , "? " ] }
Whether to split on whitespace at this level (default: false)
// Split on any whitespace character
{ whitespace : true }
Cannot use both delimiters and whitespace in the same level. They are mutually exclusive.
includeDelim
'prev' | 'next'
default: "prev"
Whether to include the delimiter in the previous or next chunk (default: โprevโ)
// Delimiter stays with previous chunk
{ delimiters : [ ". " ], includeDelim : "prev" }
// Delimiter moves to next chunk
{ delimiters : [ " \n ## " ], includeDelim : "next" }
Use cases:
"prev": For punctuation (sentences keep their periods)
"next": For headers (headers stay with their content)
RecursiveLevel examples
Markdown documents
Code documentation
Structured data
Minimal (aggressive)
const rules = new ExuluChunkers . recursive . rules ({
levels: [
// Split by headers (keep header with content)
{
delimiters: [ " \n # " , " \n ## " , " \n ### " ],
includeDelim: "next"
},
// Split by paragraphs
{ delimiters: [ " \n\n " ] },
// Split by sentences
{ delimiters: [ ". " , "! " , "? " ] },
// Split by words
{ whitespace: true }
]
});
const rules = new ExuluChunkers . recursive . rules ({
levels: [
// Split by code blocks
{ delimiters: [ "```" ] },
// Split by sections
{ delimiters: [ " \n\n " ] },
// Split by list items
{ delimiters: [ " \n - " , " \n * " , " \n 1. " ] },
// Split by sentences
{ delimiters: [ ". " ] },
// Split by words
{ whitespace: true }
]
});
const rules = new ExuluChunkers . recursive . rules ({
levels: [
// Split by JSON objects
{ delimiters: [ "}, \n " ] },
// Split by object properties
{ delimiters: [ ", \n " ] },
// Split by words
{ whitespace: true }
]
});
const rules = new ExuluChunkers . recursive . rules ({
levels: [
// Only paragraphs and words
{ delimiters: [ " \n\n " ] },
{ whitespace: true }
]
});
Configuration patterns
Natural language documents
// Optimized for articles, blog posts, documentation
const chunker = await ExuluChunkers . sentence . create ({
chunkSize: 512 ,
chunkOverlap: 75 , // 15% overlap
minSentencesPerChunk: 2 , // At least 2 sentences
minCharactersPerSentence: 15
});
Why this works:
512 tokens fits most embedding models
15% overlap preserves context
Minimum 2 sentences ensures coherence
15 char minimum filters fragments
Technical documentation
// Optimized for API docs, guides, tutorials
const rules = new ExuluChunkers . recursive . rules ({
levels: [
{ delimiters: [ " \n ## " , " \n ### " ] }, // Headers
{ delimiters: [ "```" ] }, // Code blocks
{ delimiters: [ " \n\n " ] }, // Paragraphs
{ delimiters: [ ". " ] }, // Sentences
{ whitespace: true } // Words
]
});
const chunker = await ExuluChunkers . recursive . function . create ({
chunkSize: 1024 , // Larger for code examples
rules: rules ,
minCharactersPerChunk: 100
});
Why this works:
Respects structural boundaries (headers, code)
1024 tokens accommodates code examples
100 char minimum ensures substantive chunks
Long-form content
// Optimized for books, papers, long articles
const chunker = await ExuluChunkers . sentence . create ({
chunkSize: 768 , // Larger chunks
chunkOverlap: 150 , // ~20% overlap
minSentencesPerChunk: 3 , // More context per chunk
minCharactersPerSentence: 20
});
Why this works:
Larger chunks capture more context
Higher overlap maintains narrative flow
Minimum 3 sentences ensures coherence
20 char minimum ensures quality sentences
High-precision search
// Optimized for precise search results
const chunker = await ExuluChunkers . sentence . create ({
chunkSize: 256 , // Smaller chunks
chunkOverlap: 25 , // ~10% overlap
minSentencesPerChunk: 1 , // Allow single sentences
minCharactersPerSentence: 10
});
Why this works:
Smaller chunks = more precise results
Lower overlap = less redundancy
Single sentences allowed for granularity
Code files
// Optimized for source code
const rules = new ExuluChunkers . recursive . rules ({
levels: [
{ delimiters: [ " \n class " , " \n function " , " \n const " ] }, // Top-level declarations
{ delimiters: [ "{ \n " , "} \n " ] }, // Blocks
{ delimiters: [ " \n " ] }, // Lines
{ whitespace: true } // Words
]
});
const chunker = await ExuluChunkers . recursive . function . create ({
chunkSize: 2048 , // Larger for functions/classes
rules: rules ,
minCharactersPerChunk: 50
});
Why this works:
Respects code structure (functions, classes)
Large chunks keep functions/methods together
Line-level splitting for smaller units
Tuning recommendations
Start conservative
// Begin with safe defaults
const chunker = await ExuluChunkers . sentence . create ({
chunkSize: 512 ,
chunkOverlap: 50 ,
minSentencesPerChunk: 1 ,
minCharactersPerSentence: 10
});
// Test and adjust based on:
// - Search result quality
// - Chunk count and costs
// - Context preservation
Monitor chunk statistics
const chunks = await chunker ( text );
const avgTokens = chunks . reduce (( sum , c ) => sum + c . tokenCount , 0 ) / chunks . length ;
const maxTokens = Math . max ( ... chunks . map ( c => c . tokenCount ));
const minTokens = Math . min ( ... chunks . map ( c => c . tokenCount ));
console . log ( `Chunks: ${ chunks . length } ` );
console . log ( `Avg tokens: ${ avgTokens . toFixed ( 2 ) } ` );
console . log ( `Max tokens: ${ maxTokens } ` );
console . log ( `Min tokens: ${ minTokens } ` );
// Adjust configuration based on statistics
Test with real data
// Sample representative documents
const sampleDocs = [
"Short article content..." ,
"Medium length blog post..." ,
"Very long technical documentation..."
];
// Test chunking
for ( const doc of sampleDocs ) {
const chunks = await chunker ( doc );
console . log ( `Doc length: ${ doc . length } chars` );
console . log ( `Chunks: ${ chunks . length } ` );
console . log ( `Avg chunk: ${ doc . length / chunks . length } chars` );
console . log ( "---" );
}
// Adjust based on results
Common pitfalls
Overlap >= chunk size : The chunker will throw an error. Ensure chunkOverlap < chunkSize.
Chunk size too small : Very small chunks lose context. Minimum recommended: 128 tokens.
No overlap on narrative content : Natural language benefits from overlap. Use 10-20% overlap for continuity.
Wrong chunker for content type : Use SentenceChunker for natural language, RecursiveChunker for structured content.
Best practices
Match embedding model : Set chunkSize to 60-80% of your embedding modelโs token limit to leave room for metadata.
Use overlap for RAG : Overlap improves retrieval quality in RAG systems by ensuring context isnโt lost at boundaries.
Custom rules for domain content : If your documents have consistent structure (e.g., legal docs, medical records), create custom RecursiveRules to respect that structure.
Next steps
API reference Explore methods and properties
Overview Learn about chunking concepts