Constructor parameters
ExuluEval requires specific configuration to define evaluation behavior:Copy
new ExuluEval({
id: string;
name: string;
description: string;
llm: boolean;
execute: (params) => Promise<number>;
config?: { name: string; description: string }[];
queue?: Promise<ExuluQueueConfig>;
})
Unique identifier for the evaluation function
Human-readable name for the evaluation
Description of what this evaluation measures
Whether this evaluation uses an LLM for scoring (LLM-as-judge)
Function that performs the evaluation and returns a score from 0-100
Optional configuration parameters for the evaluation function
Optional queue configuration for running evaluations as background jobs
Execute function
Theexecute function receives evaluation parameters and must return a score between 0 and 100:
Copy
execute: async ({
agent, // Agent database record
backend, // ExuluAgent instance
messages, // Conversation messages
testCase, // Test case with expected output
config // Optional runtime configuration
}) => {
// Your evaluation logic
return score; // Must be 0-100
}
Parameters
The agent database record being evaluated
Copy
interface Agent {
id: string;
name: string;
description: string;
// ... other agent properties
}
ExuluAgent instance for generating responses or using LLM-as-judge
Array of conversation messages including inputs and generated response
Copy
interface UIMessage {
role: "user" | "assistant" | "system";
content: string;
toolInvocations?: ToolInvocation[];
}
Test case containing inputs and expected outputs
Copy
interface TestCase {
id: string;
name: string;
description?: string;
inputs: UIMessage[];
expected_output: string;
expected_tools?: string[];
expected_knowledge_sources?: string[];
expected_agent_tools?: string[];
}
Runtime configuration values (optional)
Configuration patterns
Basic exact match evaluation
Copy
import { ExuluEval } from "@exulu/backend";
const exactMatchEval = new ExuluEval({
id: "exact_match",
name: "Exact Match",
description: "Returns 100 if response exactly matches expected output, 0 otherwise",
llm: false,
execute: async ({ messages, testCase }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
return response === testCase.expected_output ? 100 : 0;
}
});
Partial match with scoring
Copy
const partialMatchEval = new ExuluEval({
id: "partial_match",
name: "Partial Match",
description: "Scores based on how much of expected output appears in response",
llm: false,
execute: async ({ messages, testCase }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content?.toLowerCase() || "";
const expected = testCase.expected_output.toLowerCase();
// Split into words
const expectedWords = expected.split(/\s+/);
const matchedWords = expectedWords.filter(word =>
response.includes(word)
);
return (matchedWords.length / expectedWords.length) * 100;
}
});
Keyword presence evaluation
Copy
const keywordEval = new ExuluEval({
id: "keyword_presence",
name: "Keyword Presence",
description: "Checks if response contains required keywords",
llm: false,
execute: async ({ messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content?.toLowerCase() || "";
const keywords = config?.keywords || [];
if (keywords.length === 0) return 100;
const foundKeywords = keywords.filter(kw =>
response.includes(kw.toLowerCase())
);
return (foundKeywords.length / keywords.length) * 100;
},
config: [
{
name: "keywords",
description: "Array of keywords that should appear in response"
}
]
});
// Run with config
const score = await keywordEval.run(
agent,
backend,
testCase,
messages,
{ keywords: ["weather", "temperature", "San Francisco"] }
);
LLM-as-judge evaluation
Copy
const llmJudgeEval = new ExuluEval({
id: "llm_judge",
name: "LLM Judge",
description: "Uses an LLM to evaluate response quality",
llm: true,
execute: async ({ backend, messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
const judgePrompt = `
You are an expert evaluator. Rate the following response on a scale of 0-100.
Test Case: ${testCase.name}
Description: ${testCase.description || "N/A"}
Expected Output:
${testCase.expected_output}
Actual Response:
${response}
Criteria:
1. Accuracy: Does it match the expected output?
2. Completeness: Does it address all required aspects?
3. Clarity: Is it well-structured and understandable?
4. Relevance: Does it stay on topic?
Respond with ONLY a number from 0 to 100. No explanation.
`.trim();
const result = await backend.generateSync({
prompt: judgePrompt,
agentInstance: await loadAgent(config?.judgeAgentId || "default_judge"),
statistics: { label: "eval", trigger: "llm_judge" }
});
const score = parseInt(result.text.trim());
if (isNaN(score)) {
console.warn(`LLM judge returned non-numeric: ${result.text}`);
return 0;
}
return Math.max(0, Math.min(100, score));
},
config: [
{
name: "judgeAgentId",
description: "Agent ID to use for evaluation (must support text generation)"
}
]
});
Tool usage evaluation
Copy
const toolUsageEval = new ExuluEval({
id: "tool_usage",
name: "Tool Usage",
description: "Checks if agent used expected tools",
llm: false,
execute: async ({ messages, testCase }) => {
// Extract tool calls from conversation
const toolCalls = messages
.flatMap(msg => msg.toolInvocations || [])
.map(inv => inv.toolName);
const expectedTools = testCase.expected_tools || [];
// If no tools expected, check that no tools were used
if (expectedTools.length === 0) {
return toolCalls.length === 0 ? 100 : 0;
}
// Check if all expected tools were used
const usedExpected = expectedTools.filter(tool =>
toolCalls.includes(tool)
);
return (usedExpected.length / expectedTools.length) * 100;
}
});
Regex pattern matching
Copy
const regexMatchEval = new ExuluEval({
id: "regex_match",
name: "Regex Pattern Match",
description: "Checks if response matches regex pattern",
llm: false,
execute: async ({ messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
const pattern = config?.pattern;
if (!pattern) {
throw new Error("Regex pattern required in config");
}
const regex = new RegExp(pattern, config?.flags || "");
return regex.test(response) ? 100 : 0;
},
config: [
{
name: "pattern",
description: "Regex pattern to match"
},
{
name: "flags",
description: "Regex flags (e.g., 'i' for case-insensitive)"
}
]
});
// Run with regex config
const score = await regexMatchEval.run(
agent,
backend,
testCase,
messages,
{
pattern: "\\d{2}°[FC]", // Matches temperature like "68°F"
flags: "i"
}
);
Length-based evaluation
Copy
const lengthEval = new ExuluEval({
id: "response_length",
name: "Response Length",
description: "Scores based on response length within acceptable range",
llm: false,
execute: async ({ messages, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
const length = response.length;
const minLength = config?.minLength || 0;
const maxLength = config?.maxLength || Infinity;
const targetLength = config?.targetLength;
// If within range, score based on proximity to target
if (length < minLength) {
return Math.max(0, (length / minLength) * 100);
}
if (length > maxLength) {
return Math.max(0, 100 - ((length - maxLength) / maxLength) * 100);
}
// Within range
if (targetLength) {
const deviation = Math.abs(length - targetLength);
const maxDeviation = Math.max(
targetLength - minLength,
maxLength - targetLength
);
return Math.max(0, 100 - (deviation / maxDeviation) * 50);
}
return 100;
},
config: [
{
name: "minLength",
description: "Minimum acceptable character count"
},
{
name: "maxLength",
description: "Maximum acceptable character count"
},
{
name: "targetLength",
description: "Ideal character count (optional)"
}
]
});
Composite evaluation
Combine multiple evaluation criteria:Copy
const compositeEval = new ExuluEval({
id: "composite",
name: "Composite Evaluation",
description: "Combines multiple evaluation criteria with weights",
llm: false,
execute: async ({ messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
let totalScore = 0;
let totalWeight = 0;
// Criteria 1: Contains expected output (weight: 50%)
const containsExpected = response.includes(testCase.expected_output);
totalScore += containsExpected ? 50 : 0;
totalWeight += 50;
// Criteria 2: Reasonable length (weight: 20%)
const isReasonableLength = response.length >= 50 && response.length <= 500;
totalScore += isReasonableLength ? 20 : 0;
totalWeight += 20;
// Criteria 3: Uses tools if expected (weight: 30%)
const toolCalls = messages.flatMap(msg => msg.toolInvocations || []);
const expectedTools = testCase.expected_tools || [];
if (expectedTools.length > 0) {
const toolsUsed = expectedTools.every(tool =>
toolCalls.some(call => call.toolName === tool)
);
totalScore += toolsUsed ? 30 : 0;
totalWeight += 30;
} else {
totalScore += 30; // No tools expected, full points
totalWeight += 30;
}
return (totalScore / totalWeight) * 100;
}
});
Queue configuration
Run evaluations as background jobs using ExuluQueues:Copy
import { ExuluEval, ExuluQueues } from "@exulu/backend";
const backgroundEval = new ExuluEval({
id: "background_eval",
name: "Background Evaluation",
description: "Runs as queued job",
llm: true,
execute: async ({ backend, messages, testCase }) => {
// Long-running evaluation logic
return 85;
},
queue: Promise.resolve({
connection: await ExuluQueues.getConnection(),
name: "evaluations",
prefix: "{exulu}",
defaultJobOptions: {
attempts: 3,
backoff: {
type: "exponential",
delay: 2000
},
removeOnComplete: true,
removeOnFail: false
}
})
});
Advanced patterns
Semantic similarity evaluation
Use embeddings to measure semantic similarity:Copy
import { ExuluEval, ExuluEmbedder, ExuluVariables } from "@exulu/backend";
const semanticSimilarityEval = new ExuluEval({
id: "semantic_similarity",
name: "Semantic Similarity",
description: "Measures semantic similarity using embeddings",
llm: false,
execute: async ({ messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
const embedder = new ExuluEmbedder({
id: "eval_embedder",
name: "Evaluation Embedder",
provider: "openai",
model: "text-embedding-3-small",
vectorDimensions: 1536,
authenticationInformation: await ExuluVariables.get("openai_api_key")
});
const [responseEmb, expectedEmb] = await embedder.generate([
response,
testCase.expected_output
]);
// Cosine similarity
const similarity = cosineSimilarity(responseEmb, expectedEmb);
// Scale to 0-100
return similarity * 100;
}
});
function cosineSimilarity(a: number[], b: number[]): number {
const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
return dotProduct / (magnitudeA * magnitudeB);
}
Multi-aspect LLM judge
Evaluate multiple aspects separately:Copy
const multiAspectJudgeEval = new ExuluEval({
id: "multi_aspect_judge",
name: "Multi-Aspect LLM Judge",
description: "Evaluates multiple aspects with separate LLM calls",
llm: true,
execute: async ({ backend, messages, testCase, config }) => {
const lastMessage = messages[messages.length - 1];
const response = lastMessage?.content || "";
const aspects = [
{
name: "accuracy",
weight: 40,
prompt: "Rate the accuracy of this response (0-100):"
},
{
name: "clarity",
weight: 30,
prompt: "Rate the clarity and readability (0-100):"
},
{
name: "completeness",
weight: 30,
prompt: "Rate how complete the response is (0-100):"
}
];
let totalScore = 0;
let totalWeight = 0;
for (const aspect of aspects) {
const judgePrompt = `
${aspect.prompt}
Expected: ${testCase.expected_output}
Actual: ${response}
Respond with ONLY a number 0-100.
`.trim();
const result = await backend.generateSync({
prompt: judgePrompt,
agentInstance: await loadAgent(config?.judgeAgentId || "default_judge"),
statistics: { label: "eval", trigger: "multi_aspect_judge" }
});
const score = parseInt(result.text.trim());
if (!isNaN(score)) {
totalScore += Math.max(0, Math.min(100, score)) * aspect.weight;
totalWeight += aspect.weight;
}
}
return totalWeight > 0 ? totalScore / totalWeight : 0;
},
config: [
{
name: "judgeAgentId",
description: "Agent ID for LLM judge"
}
]
});
A/B testing evaluation
Compare two agent configurations:Copy
async function compareAgents(
agentA: Agent,
agentB: Agent,
backendA: ExuluAgent,
backendB: ExuluAgent,
testCases: TestCase[],
evals: ExuluEval[]
) {
const resultsA = [];
const resultsB = [];
for (const testCase of testCases) {
// Generate response from Agent A
const responseA = await backendA.generateSync({
prompt: testCase.inputs[testCase.inputs.length - 1].content,
agentInstance: await loadAgent(agentA.id),
statistics: { label: "ab_test", trigger: "test" }
});
const messagesA = [
...testCase.inputs,
{ role: "assistant", content: responseA.text }
];
// Generate response from Agent B
const responseB = await backendB.generateSync({
prompt: testCase.inputs[testCase.inputs.length - 1].content,
agentInstance: await loadAgent(agentB.id),
statistics: { label: "ab_test", trigger: "test" }
});
const messagesB = [
...testCase.inputs,
{ role: "assistant", content: responseB.text }
];
// Run evaluations on both
for (const eval of evals) {
const scoreA = await eval.run(agentA, backendA, testCase, messagesA);
const scoreB = await eval.run(agentB, backendB, testCase, messagesB);
resultsA.push({ testCase: testCase.name, eval: eval.name, score: scoreA });
resultsB.push({ testCase: testCase.name, eval: eval.name, score: scoreB });
}
}
// Calculate averages
const avgA = resultsA.reduce((sum, r) => sum + r.score, 0) / resultsA.length;
const avgB = resultsB.reduce((sum, r) => sum + r.score, 0) / resultsB.length;
return {
agentA: { results: resultsA, average: avgA },
agentB: { results: resultsB, average: avgB },
winner: avgA > avgB ? "Agent A" : "Agent B"
};
}
Best practices
Weighted scoring: For composite evaluations, use weighted scoring to prioritize important criteria.
Error handling: Always handle errors in execute functions and return 0 or throw descriptive errors.
LLM judge reliability: LLM judges can be inconsistent. Run multiple times or use temperature=0 for deterministic results.
Config validation: Validate config parameters at the start of execute functions to provide clear error messages.