Skip to main content

SentenceChunker configuration

Factory method

Create a SentenceChunker using the async factory method:
import { ExuluChunkers } from "@exulu/backend";

const chunker = await ExuluChunkers.sentence.create(options);

Options

chunkSize
number
required
Maximum number of tokens per chunk
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512  // Max 512 tokens per chunk
});
Guidelines:
  • Small chunks (128-256): High granularity, more chunks, higher costs
  • Medium chunks (256-512): Balanced for most use cases
  • Large chunks (512-1024): Less granular, fewer chunks, lower costs
Match your embedding model:
  • OpenAI text-embedding-3-small: 8,191 tokens → use 512-1024
  • OpenAI text-embedding-3-large: 8,191 tokens → use 512-1024
  • Cohere embed-english-v3.0: 512 tokens → use 256-512
chunkOverlap
number
default:0
Number of tokens to overlap between consecutive chunks (default: 0)
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512,
  chunkOverlap: 50  // 50 tokens overlap
});
Guidelines:
  • No overlap (0): No redundancy, sharp boundaries
  • Low overlap (10-20): Minimal context preservation
  • Medium overlap (50-100): Good balance for natural language
  • High overlap (100-200): Maximum context, but increases chunk count
Recommended overlap ratios:
  • 10-15% of chunk size for technical docs
  • 15-20% of chunk size for natural language
  • 20-25% of chunk size for narrative content
Overlap must be less than chunkSize. The chunker throws an error if chunkOverlap >= chunkSize.
minSentencesPerChunk
number
default:1
Minimum number of sentences per chunk (default: 1)
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512,
  minSentencesPerChunk: 2  // At least 2 sentences per chunk
});
Use cases:
  • 1: Allow single-sentence chunks (default)
  • 2-3: Ensure contextual coherence
  • 3+: For documents where individual sentences lack context
minCharactersPerSentence
number
default:10
Minimum character length for a text segment to be considered a sentence (default: 10)
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512,
  minCharactersPerSentence: 20  // Sentences must be at least 20 chars
});
Use cases:
  • 5-10: Allow short sentences (e.g., “Yes.”, “No.”)
  • 10-20: Filter out fragments (default)
  • 20+: Ensure substantive sentences

Complete example

import { ExuluChunkers } from "@exulu/backend";

const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512,                // Max 512 tokens per chunk
  chunkOverlap: 75,              // 75 tokens overlap (15% of chunk size)
  minSentencesPerChunk: 2,       // At least 2 sentences per chunk
  minCharactersPerSentence: 15   // Sentences must be at least 15 chars
});

const text = `
  Machine learning is transforming industries. It enables
  computers to learn patterns from data. This technology
  powers recommendation systems, fraud detection, and more.

  Deep learning is a subset of machine learning. It uses
  neural networks with many layers. These networks can
  recognize complex patterns in images, text, and audio.
`;

const chunks = await chunker(text);

console.log(chunks.length);        // Number of chunks
console.log(chunks[0].text);       // First chunk text
console.log(chunks[0].tokenCount); // Token count

RecursiveChunker configuration

Factory method

Create a RecursiveChunker using the async factory method:
import { ExuluChunkers } from "@exulu/backend";

const chunker = await ExuluChunkers.recursive.function.create(options);

Options

chunkSize
number
required
Maximum number of tokens per chunk
const chunker = await ExuluChunkers.recursive.function.create({
  chunkSize: 1024  // Max 1024 tokens per chunk
});
Same guidelines as SentenceChunker chunkSize.
rules
RecursiveRules
default:"default rules"
Recursive splitting rules defining the hierarchy (default: paragraphs → sentences → pauses → words → tokens)
// Use default rules
const chunker = await ExuluChunkers.recursive.function.create({
  chunkSize: 1024
  // rules not specified = default rules
});

// Or specify custom rules
const rules = new ExuluChunkers.recursive.rules({
  levels: [
    { delimiters: ["\n\n"] },            // Split by double newline
    { delimiters: [". ", "! ", "? "] },  // Then sentences
    { whitespace: true }                 // Then whitespace
  ]
});

const customChunker = await ExuluChunkers.recursive.function.create({
  chunkSize: 1024,
  rules: rules
});
Default rules hierarchy:
  1. Paragraphs: ["\n\n", "\r\n", "\n", "\r"]
  2. Sentences: [". ", "! ", "? "]
  3. Pauses: ["{", "}", '"', "[", "]", "<", ">", "(", ")", ":", ";", ",", "—", "|", "~", "-", "...", "”, ”’”]`
  4. Words: whitespace: true
  5. Tokens: No delimiters (fallback to token-level splitting)
minCharactersPerChunk
number
default:50
Minimum character length for a chunk (default: 50)
const chunker = await ExuluChunkers.recursive.function.create({
  chunkSize: 1024,
  minCharactersPerChunk: 100  // Chunks must be at least 100 chars
});
Guidelines:
  • 20-50: Allow smaller chunks
  • 50-100: Default range
  • 100+: Ensure substantive chunks

Complete example

import { ExuluChunkers } from "@exulu/backend";

// Custom rules for markdown documents
const rules = new ExuluChunkers.recursive.rules({
  levels: [
    { delimiters: ["\n## ", "\n### "] },  // Split by headers
    { delimiters: ["\n\n"] },              // Then paragraphs
    { delimiters: [". ", "! ", "? "] },   // Then sentences
    { whitespace: true }                   // Then words
  ]
});

const chunker = await ExuluChunkers.recursive.function.create({
  chunkSize: 1024,
  rules: rules,
  minCharactersPerChunk: 75
});

const markdown = `
## Introduction

Machine learning enables computers to learn from data.
It powers many modern applications.

## Applications

Recommendation systems use ML to suggest content.
Fraud detection systems identify suspicious activity.

## Future Directions

The field continues to evolve rapidly.
New techniques emerge regularly.
`;

const chunks = await chunker(markdown);

for (const chunk of chunks) {
  console.log(`Level ${chunk.level}: ${chunk.text.slice(0, 50)}...`);
  console.log(`Tokens: ${chunk.tokenCount}`);
}

RecursiveRules configuration

Constructor

Create custom recursive rules:
import { ExuluChunkers } from "@exulu/backend";

const rules = new ExuluChunkers.recursive.rules({
  levels: [...]  // Array of RecursiveLevelData
});

Levels

levels
RecursiveLevelData[]
Array of recursive levels defining the splitting hierarchy
const rules = new ExuluChunkers.recursive.rules({
  levels: [
    { delimiters: ["\n\n"] },
    { delimiters: [". "] },
    { whitespace: true }
  ]
});
Each level is a RecursiveLevelData object with:
delimiters
string | string[]
Delimiter(s) to use for splitting at this level
// Single delimiter
{ delimiters: "\n\n" }

// Multiple delimiters
{ delimiters: [". ", "! ", "? "] }
whitespace
boolean
default:false
Whether to split on whitespace at this level (default: false)
// Split on any whitespace character
{ whitespace: true }
Cannot use both delimiters and whitespace in the same level. They are mutually exclusive.
includeDelim
'prev' | 'next'
default:"prev"
Whether to include the delimiter in the previous or next chunk (default: “prev”)
// Delimiter stays with previous chunk
{ delimiters: [". "], includeDelim: "prev" }

// Delimiter moves to next chunk
{ delimiters: ["\n## "], includeDelim: "next" }
Use cases:
  • "prev": For punctuation (sentences keep their periods)
  • "next": For headers (headers stay with their content)

RecursiveLevel examples

const rules = new ExuluChunkers.recursive.rules({
  levels: [
    // Split by headers (keep header with content)
    {
      delimiters: ["\n# ", "\n## ", "\n### "],
      includeDelim: "next"
    },
    // Split by paragraphs
    { delimiters: ["\n\n"] },
    // Split by sentences
    { delimiters: [". ", "! ", "? "] },
    // Split by words
    { whitespace: true }
  ]
});

Configuration patterns

Natural language documents

// Optimized for articles, blog posts, documentation
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512,
  chunkOverlap: 75,              // 15% overlap
  minSentencesPerChunk: 2,       // At least 2 sentences
  minCharactersPerSentence: 15
});
Why this works:
  • 512 tokens fits most embedding models
  • 15% overlap preserves context
  • Minimum 2 sentences ensures coherence
  • 15 char minimum filters fragments

Technical documentation

// Optimized for API docs, guides, tutorials
const rules = new ExuluChunkers.recursive.rules({
  levels: [
    { delimiters: ["\n## ", "\n### "] },  // Headers
    { delimiters: ["```"] },              // Code blocks
    { delimiters: ["\n\n"] },             // Paragraphs
    { delimiters: [". "] },               // Sentences
    { whitespace: true }                   // Words
  ]
});

const chunker = await ExuluChunkers.recursive.function.create({
  chunkSize: 1024,              // Larger for code examples
  rules: rules,
  minCharactersPerChunk: 100
});
Why this works:
  • Respects structural boundaries (headers, code)
  • 1024 tokens accommodates code examples
  • 100 char minimum ensures substantive chunks

Long-form content

// Optimized for books, papers, long articles
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 768,               // Larger chunks
  chunkOverlap: 150,            // ~20% overlap
  minSentencesPerChunk: 3,      // More context per chunk
  minCharactersPerSentence: 20
});
Why this works:
  • Larger chunks capture more context
  • Higher overlap maintains narrative flow
  • Minimum 3 sentences ensures coherence
  • 20 char minimum ensures quality sentences
// Optimized for precise search results
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 256,               // Smaller chunks
  chunkOverlap: 25,             // ~10% overlap
  minSentencesPerChunk: 1,      // Allow single sentences
  minCharactersPerSentence: 10
});
Why this works:
  • Smaller chunks = more precise results
  • Lower overlap = less redundancy
  • Single sentences allowed for granularity

Code files

// Optimized for source code
const rules = new ExuluChunkers.recursive.rules({
  levels: [
    { delimiters: ["\nclass ", "\nfunction ", "\nconst "] },  // Top-level declarations
    { delimiters: ["{\n", "}\n"] },                           // Blocks
    { delimiters: ["\n"] },                                   // Lines
    { whitespace: true }                                       // Words
  ]
});

const chunker = await ExuluChunkers.recursive.function.create({
  chunkSize: 2048,              // Larger for functions/classes
  rules: rules,
  minCharactersPerChunk: 50
});
Why this works:
  • Respects code structure (functions, classes)
  • Large chunks keep functions/methods together
  • Line-level splitting for smaller units

Tuning recommendations

Start conservative

// Begin with safe defaults
const chunker = await ExuluChunkers.sentence.create({
  chunkSize: 512,
  chunkOverlap: 50,
  minSentencesPerChunk: 1,
  minCharactersPerSentence: 10
});

// Test and adjust based on:
// - Search result quality
// - Chunk count and costs
// - Context preservation

Monitor chunk statistics

const chunks = await chunker(text);

const avgTokens = chunks.reduce((sum, c) => sum + c.tokenCount, 0) / chunks.length;
const maxTokens = Math.max(...chunks.map(c => c.tokenCount));
const minTokens = Math.min(...chunks.map(c => c.tokenCount));

console.log(`Chunks: ${chunks.length}`);
console.log(`Avg tokens: ${avgTokens.toFixed(2)}`);
console.log(`Max tokens: ${maxTokens}`);
console.log(`Min tokens: ${minTokens}`);

// Adjust configuration based on statistics

Test with real data

// Sample representative documents
const sampleDocs = [
  "Short article content...",
  "Medium length blog post...",
  "Very long technical documentation..."
];

// Test chunking
for (const doc of sampleDocs) {
  const chunks = await chunker(doc);
  console.log(`Doc length: ${doc.length} chars`);
  console.log(`Chunks: ${chunks.length}`);
  console.log(`Avg chunk: ${doc.length / chunks.length} chars`);
  console.log("---");
}

// Adjust based on results

Common pitfalls

Overlap >= chunk size: The chunker will throw an error. Ensure chunkOverlap < chunkSize.
Chunk size too small: Very small chunks lose context. Minimum recommended: 128 tokens.
No overlap on narrative content: Natural language benefits from overlap. Use 10-20% overlap for continuity.
Wrong chunker for content type: Use SentenceChunker for natural language, RecursiveChunker for structured content.

Best practices

Match embedding model: Set chunkSize to 60-80% of your embedding model’s token limit to leave room for metadata.
Use overlap for RAG: Overlap improves retrieval quality in RAG systems by ensuring context isn’t lost at boundaries.
Custom rules for domain content: If your documents have consistent structure (e.g., legal docs, medical records), create custom RecursiveRules to respect that structure.

Next steps