Documentation Index
Fetch the complete documentation index at: https://docs.exulu.com/llms.txt
Use this file to discover all available pages before exploring further.
Constructor parameters
ExuluEval requires specific configuration to define evaluation behavior:new ExuluEval({
id: string;
name: string;
description: string;
llm: boolean;
execute: (params) => Promise<number>;
config?: { name: string; description: string }[];
queue?: Promise<ExuluQueueConfig>;
})
Unique identifier for the evaluation function
Human-readable name for the evaluation
Description of what this evaluation measures
Whether this evaluation uses an LLM for scoring (LLM-as-judge)
Function that performs the evaluation and returns a score from 0-100
Optional configuration parameters for the evaluation function
Optional queue configuration for running evaluations as background jobs
Execute function
Theexecute function receives evaluation parameters and must return a score between 0 and 100:
execute: async ({
agent, // Agent database record
backend, // ExuluAgent instance
messages, // Conversation messages
testCase, // Test case with expected output
config // Optional runtime configuration
}) => {
// Your evaluation logic
return score; // Must be 0-100
}
Parameters
The agent database record being evaluated
interface Agent {
id: string;
name: string;
description: string;
// ... other agent properties
}
ExuluAgent instance for generating responses or using LLM-as-judge
Array of conversation messages including inputs and generated response
interface UIMessage {
role: "user" | "assistant" | "system";
content: string;
toolInvocations?: ToolInvocation[];
}
Test case containing inputs and expected outputs
interface TestCase {
id: string;
name: string;
description?: string;
inputs: UIMessage[];
expected_output: string;
expected_tools?: string[];
expected_knowledge_sources?: string[];
expected_agent_tools?: string[];
}
Runtime configuration values (optional)
Configuration patterns
Basic exact match evaluation
import { ExuluEval } from "@exulu/backend";
const exactMatchEval = new ExuluEval({
id: "exact_match",
name: "Exact Match",
description: "Returns 100 if response exactly matches expected output, 0 otherwise",
llm: false,
execute: async ({ messages, testCase }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
return response === testCase.expected_output ? 100 : 0;
}
});
Partial match with scoring
const partialMatchEval = new ExuluEval({
id: "partial_match",
name: "Partial Match",
description: "Scores based on how much of expected output appears in response",
llm: false,
execute: async ({ messages, testCase }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content?.toLowerCase() || "";
const expected = testCase.expected_output.toLowerCase();
// Split into words
const expectedWords = expected.split(/\s+/);
const matchedWords = expectedWords.filter(word =>
response.includes(word)
);
return (matchedWords.length / expectedWords.length) * 100;
}
});
Keyword presence evaluation
const keywordEval = new ExuluEval({
id: "keyword_presence",
name: "Keyword Presence",
description: "Checks if response contains required keywords",
llm: false,
execute: async ({ messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content?.toLowerCase() || "";
const keywords = config?.keywords || [];
if (keywords.length === 0) return 100;
const foundKeywords = keywords.filter(kw =>
response.includes(kw.toLowerCase())
);
return (foundKeywords.length / keywords.length) * 100;
},
config: [
{
name: "keywords",
description: "Array of keywords that should appear in response"
}
]
});
// Run with config
const score = await keywordEval.run(
agent,
backend,
testCase,
messages,
{ keywords: ["weather", "temperature", "San Francisco"] }
);
LLM-as-judge evaluation
const llmJudgeEval = new ExuluEval({
id: "llm_judge",
name: "LLM Judge",
description: "Uses an LLM to evaluate response quality",
llm: true,
execute: async ({ backend, messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
const judgePrompt = `
You are an expert evaluator. Rate the following response on a scale of 0-100.
Test Case: ${testCase.name}
Description: ${testCase.description || "N/A"}
Expected Output:
${testCase.expected_output}
Actual Response:
${response}
Criteria:
1. Accuracy: Does it match the expected output?
2. Completeness: Does it address all required aspects?
3. Clarity: Is it well-structured and understandable?
4. Relevance: Does it stay on topic?
Respond with ONLY a number from 0 to 100. No explanation.
`.trim();
const result = await backend.generateSync({
prompt: judgePrompt,
agentInstance: await loadAgent(config?.judgeAgentId || "default_judge"),
statistics: { label: "eval", trigger: "llm_judge" }
});
const score = parseInt(result.text.trim());
if (isNaN(score)) {
console.warn(`LLM judge returned non-numeric: ${result.text}`);
return 0;
}
return Math.max(0, Math.min(100, score));
},
config: [
{
name: "judgeAgentId",
description: "Agent ID to use for evaluation (must support text generation)"
}
]
});
Tool usage evaluation
const toolUsageEval = new ExuluEval({
id: "tool_usage",
name: "Tool Usage",
description: "Checks if agent used expected tools",
llm: false,
execute: async ({ messages, testCase }) => {
// Extract tool calls from conversation
const toolCalls = messages
.flatMap(msg => msg.toolInvocations || [])
.map(inv => inv.toolName);
const expectedTools = testCase.expected_tools || [];
// If no tools expected, check that no tools were used
if (expectedTools.length === 0) {
return toolCalls.length === 0 ? 100 : 0;
}
// Check if all expected tools were used
const usedExpected = expectedTools.filter(tool =>
toolCalls.includes(tool)
);
return (usedExpected.length / expectedTools.length) * 100;
}
});
Regex pattern matching
const regexMatchEval = new ExuluEval({
id: "regex_match",
name: "Regex Pattern Match",
description: "Checks if response matches regex pattern",
llm: false,
execute: async ({ messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
const pattern = config?.pattern;
if (!pattern) {
throw new Error("Regex pattern required in config");
}
const regex = new RegExp(pattern, config?.flags || "");
return regex.test(response) ? 100 : 0;
},
config: [
{
name: "pattern",
description: "Regex pattern to match"
},
{
name: "flags",
description: "Regex flags (e.g., 'i' for case-insensitive)"
}
]
});
// Run with regex config
const score = await regexMatchEval.run(
agent,
backend,
testCase,
messages,
{
pattern: "\\d{2}°[FC]", // Matches temperature like "68°F"
flags: "i"
}
);
Length-based evaluation
const lengthEval = new ExuluEval({
id: "response_length",
name: "Response Length",
description: "Scores based on response length within acceptable range",
llm: false,
execute: async ({ messages, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
const length = response.length;
const minLength = config?.minLength || 0;
const maxLength = config?.maxLength || Infinity;
const targetLength = config?.targetLength;
// If within range, score based on proximity to target
if (length < minLength) {
return Math.max(0, (length / minLength) * 100);
}
if (length > maxLength) {
return Math.max(0, 100 - ((length - maxLength) / maxLength) * 100);
}
// Within range
if (targetLength) {
const deviation = Math.abs(length - targetLength);
const maxDeviation = Math.max(
targetLength - minLength,
maxLength - targetLength
);
return Math.max(0, 100 - (deviation / maxDeviation) * 50);
}
return 100;
},
config: [
{
name: "minLength",
description: "Minimum acceptable character count"
},
{
name: "maxLength",
description: "Maximum acceptable character count"
},
{
name: "targetLength",
description: "Ideal character count (optional)"
}
]
});
Composite evaluation
Combine multiple evaluation criteria:const compositeEval = new ExuluEval({
id: "composite",
name: "Composite Evaluation",
description: "Combines multiple evaluation criteria with weights",
llm: false,
execute: async ({ messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
let totalScore = 0;
let totalWeight = 0;
// Criteria 1: Contains expected output (weight: 50%)
const containsExpected = response.includes(testCase.expected_output);
totalScore += containsExpected ? 50 : 0;
totalWeight += 50;
// Criteria 2: Reasonable length (weight: 20%)
const isReasonableLength = response.length >= 50 && response.length <= 500;
totalScore += isReasonableLength ? 20 : 0;
totalWeight += 20;
// Criteria 3: Uses tools if expected (weight: 30%)
const toolCalls = messages.flatMap(msg => msg.toolInvocations || []);
const expectedTools = testCase.expected_tools || [];
if (expectedTools.length > 0) {
const toolsUsed = expectedTools.every(tool =>
toolCalls.some(call => call.toolName === tool)
);
totalScore += toolsUsed ? 30 : 0;
totalWeight += 30;
} else {
totalScore += 30; // No tools expected, full points
totalWeight += 30;
}
return (totalScore / totalWeight) * 100;
}
});
Queue configuration
Run evaluations as background jobs using ExuluQueues:import { ExuluEval, ExuluQueues } from "@exulu/backend";
const backgroundEval = new ExuluEval({
id: "background_eval",
name: "Background Evaluation",
description: "Runs as queued job",
llm: true,
execute: async ({ backend, messages, testCase }) => {
// Long-running evaluation logic
return 85;
},
queue: Promise.resolve({
connection: await ExuluQueues.getConnection(),
name: "evaluations",
prefix: "{exulu}",
defaultJobOptions: {
attempts: 3,
backoff: {
type: "exponential",
delay: 2000
},
removeOnComplete: true,
removeOnFail: false
}
})
});
Advanced patterns
Semantic similarity evaluation
Use embeddings to measure semantic similarity:import { ExuluEval, ExuluEmbedder, ExuluVariables } from "@exulu/backend";
const semanticSimilarityEval = new ExuluEval({
id: "semantic_similarity",
name: "Semantic Similarity",
description: "Measures semantic similarity using embeddings",
llm: false,
execute: async ({ messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
const embedder = new ExuluEmbedder({
id: "eval_embedder",
name: "Evaluation Embedder",
provider: "openai",
model: "text-embedding-3-small",
vectorDimensions: 1536,
authenticationInformation: await ExuluVariables.get("openai_api_key")
});
const [responseEmb, expectedEmb] = await embedder.generate([
response,
testCase.expected_output
]);
// Cosine similarity
const similarity = cosineSimilarity(responseEmb, expectedEmb);
// Scale to 0-100
return similarity * 100;
}
});
function cosineSimilarity(a: number[], b: number[]): number {
const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
return dotProduct / (magnitudeA * magnitudeB);
}
Multi-aspect LLM judge
Evaluate multiple aspects separately:const multiAspectJudgeEval = new ExuluEval({
id: "multi_aspect_judge",
name: "Multi-Aspect LLM Judge",
description: "Evaluates multiple aspects with separate LLM calls",
llm: true,
execute: async ({ backend, messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
const aspects = [
{
name: "accuracy",
weight: 40,
prompt: "Rate the accuracy of this response (0-100):"
},
{
name: "clarity",
weight: 30,
prompt: "Rate the clarity and readability (0-100):"
},
{
name: "completeness",
weight: 30,
prompt: "Rate how complete the response is (0-100):"
}
];
let totalScore = 0;
let totalWeight = 0;
for (const aspect of aspects) {
const judgePrompt = `
${aspect.prompt}
Expected: ${testCase.expected_output}
Actual: ${response}
Respond with ONLY a number 0-100.
`.trim();
const result = await backend.generateSync({
prompt: judgePrompt,
agentInstance: await loadAgent(config?.judgeAgentId || "default_judge"),
statistics: { label: "eval", trigger: "multi_aspect_judge" }
});
const score = parseInt(result.text.trim());
if (!isNaN(score)) {
totalScore += Math.max(0, Math.min(100, score)) * aspect.weight;
totalWeight += aspect.weight;
}
}
return totalWeight > 0 ? totalScore / totalWeight : 0;
},
config: [
{
name: "judgeAgentId",
description: "Agent ID for LLM judge"
}
]
});
A/B testing evaluation
Compare two agent configurations:async function compareAgents(
agentA: Agent,
agentB: Agent,
backendA: ExuluAgent,
backendB: ExuluAgent,
testCases: TestCase[],
evals: ExuluEval[]
) {
const resultsA = [];
const resultsB = [];
for (const testCase of testCases) {
// Generate response from Agent A
const responseA = await backendA.generateSync({
prompt: testCase.inputs[testCase.inputs.length - 1].content,
agentInstance: await loadAgent(agentA.id),
statistics: { label: "ab_test", trigger: "test" }
});
const messagesA = [
...testCase.inputs,
{ role: "assistant", content: responseA.text }
];
// Generate response from Agent B
const responseB = await backendB.generateSync({
prompt: testCase.inputs[testCase.inputs.length - 1].content,
agentInstance: await loadAgent(agentB.id),
statistics: { label: "ab_test", trigger: "test" }
});
const messagesB = [
...testCase.inputs,
{ role: "assistant", content: responseB.text }
];
// Run evaluations on both
for (const eval of evals) {
const scoreA = await eval.run(agentA, backendA, testCase, messagesA);
const scoreB = await eval.run(agentB, backendB, testCase, messagesB);
resultsA.push({ testCase: testCase.name, eval: eval.name, score: scoreA });
resultsB.push({ testCase: testCase.name, eval: eval.name, score: scoreB });
}
}
// Calculate averages
const avgA = resultsA.reduce((sum, r) => sum + r.score, 0) / resultsA.length;
const avgB = resultsB.reduce((sum, r) => sum + r.score, 0) / resultsB.length;
return {
agentA: { results: resultsA, average: avgA },
agentB: { results: resultsB, average: avgB },
winner: avgA > avgB ? "Agent A" : "Agent B"
};
}
Best practices
Weighted scoring: For composite evaluations, use weighted scoring to prioritize important criteria.
Error handling: Always handle errors in execute functions and return 0 or throw descriptive errors.
LLM judge reliability: LLM judges can be inconsistent. Run multiple times or use temperature=0 for deterministic results.
Config validation: Validate config parameters at the start of execute functions to provide clear error messages.
Next steps
API reference
Explore all methods and properties
Overview
Learn about evaluation concepts
ExuluQueues
Run evaluations as background jobs
ExuluAgent
Create agents to evaluate