Overview
ExuluEval is a class for creating custom evaluation functions that measure and score agent performance against test cases. Evaluations allow you to systematically test your agents, track quality over time, and identify areas for improvement.
What is ExuluEval?
ExuluEval provides a framework for defining evaluation logic that:
Scores agent responses : Returns a score from 0-100 based on custom criteria
Runs against test cases : Evaluates agent behavior using structured test inputs
Supports any evaluation method : Custom logic, LLM-as-judge, regex matching, or any scoring approach
Integrates with queues : Can be run as background jobs using ExuluQueues
Enables A/B testing : Compare different agent configurations, prompts, or models
Custom scoring logic Write any evaluation function in TypeScript
Test cases Structured inputs with expected outputs
LLM-as-judge Use LLMs to evaluate response quality
Queue integration Run evaluations as background jobs
Why use evaluations?
Evaluations help you:
Quantify agent performance with consistent scoring criteria across all responses
Catch performance degradation when updating prompts, models, or tools
A/B test different agent setups to find the best performing configuration
Monitor evaluation scores over time to verify that changes improve quality
Build CI/CD pipelines that fail if evaluation scores drop below thresholds
Quick start
import { ExuluEval } from "@exulu/backend" ;
// Create an evaluation function
const exactMatchEval = new ExuluEval ({
id: "exact_match" ,
name: "Exact Match" ,
description: "Checks if response exactly matches expected output" ,
llm: false , // Not using LLM-as-judge
execute : async ({ messages , testCase }) => {
const lastMessage = messages [ messages . length - 1 ];
const response = lastMessage ?. content || "" ;
return response === testCase . expected_output ? 100 : 0 ;
}
});
// Run against a test case
const score = await exactMatchEval . run (
agent , // Agent database record
backend , // ExuluAgent instance
testCase , // Test case with inputs and expected output
messages // Conversation messages
);
console . log ( `Score: ${ score } /100` );
Evaluation types
Custom logic evaluations
Write any scoring logic in TypeScript:
const containsKeywordEval = new ExuluEval ({
id: "contains_keyword" ,
name: "Contains Keyword" ,
description: "Checks if response contains required keywords" ,
llm: false ,
execute : async ({ messages , testCase , config }) => {
const lastMessage = messages [ messages . length - 1 ];
const response = lastMessage ?. content ?. toLowerCase () || "" ;
const keywords = config ?. keywords || [];
const foundKeywords = keywords . filter ( kw => response . includes ( kw . toLowerCase ()));
return ( foundKeywords . length / keywords . length ) * 100 ;
},
config: [
{
name: "keywords" ,
description: "List of keywords that should appear in the response"
}
]
});
LLM-as-judge evaluations
Use an LLM to evaluate response quality:
const llmJudgeEval = new ExuluEval ({
id: "llm_judge_quality" ,
name: "LLM Judge - Quality" ,
description: "Uses an LLM to evaluate response quality" ,
llm: true , // Using LLM
execute : async ({ backend , messages , testCase , config }) => {
const lastMessage = messages [ messages . length - 1 ];
const response = lastMessage ?. content || "" ;
const prompt = `
You are an expert evaluator. Rate the quality of this response on a scale of 0-100.
Test Case: ${ testCase . name }
Expected: ${ testCase . expected_output }
Actual Response: ${ response }
Consider:
- Accuracy: Does it match the expected output?
- Completeness: Does it address all aspects?
- Clarity: Is it well-structured and clear?
Respond with ONLY a number from 0-100.
` ;
const result = await backend . generateSync ({
prompt ,
agentInstance: await loadAgent ( config ?. judgeAgentId ),
statistics: { label: "eval" , trigger: "system" }
});
const score = parseInt ( result . text );
return isNaN ( score ) ? 0 : Math . max ( 0 , Math . min ( 100 , score ));
},
config: [
{
name: "judgeAgentId" ,
description: "Agent ID to use as judge"
}
]
});
Check if the agent used the correct tools:
const toolUsageEval = new ExuluEval ({
id: "tool_usage" ,
name: "Tool Usage" ,
description: "Checks if agent used expected tools" ,
llm: false ,
execute : async ({ messages , testCase }) => {
// Extract tool calls from messages
const toolCalls = messages
. flatMap ( msg => msg . toolInvocations || [])
. map ( inv => inv . toolName );
const expectedTools = testCase . expected_tools || [];
if ( expectedTools . length === 0 ) return 100 ;
const usedExpected = expectedTools . filter ( tool => toolCalls . includes ( tool ));
return ( usedExpected . length / expectedTools . length ) * 100 ;
}
});
Similarity evaluations
Use embeddings to measure semantic similarity:
import { ExuluEval , ExuluEmbedder } from "@exulu/backend" ;
const similarityEval = new ExuluEval ({
id: "semantic_similarity" ,
name: "Semantic Similarity" ,
description: "Measures semantic similarity between response and expected output" ,
llm: false ,
execute : async ({ messages , testCase }) => {
const lastMessage = messages [ messages . length - 1 ];
const response = lastMessage ?. content || "" ;
const embedder = new ExuluEmbedder ({
id: "eval_embedder" ,
name: "Eval Embedder" ,
provider: "openai" ,
model: "text-embedding-3-small" ,
vectorDimensions: 1536 ,
authenticationInformation: await ExuluVariables . get ( "openai_api_key" )
});
const [ responseEmb , expectedEmb ] = await embedder . generate ([
response ,
testCase . expected_output
]);
// Cosine similarity
const similarity = cosineSimilarity ( responseEmb , expectedEmb );
return similarity * 100 ;
}
});
function cosineSimilarity ( a : number [], b : number []) : number {
const dotProduct = a . reduce (( sum , val , i ) => sum + val * b [ i ], 0 );
const magnitudeA = Math . sqrt ( a . reduce (( sum , val ) => sum + val * val , 0 ));
const magnitudeB = Math . sqrt ( b . reduce (( sum , val ) => sum + val * val , 0 ));
return dotProduct / ( magnitudeA * magnitudeB );
}
Test cases
Test cases define the inputs and expected outputs for evaluations:
interface TestCase {
id : string ;
name : string ;
description ?: string ;
inputs : UIMessage []; // Input conversation
expected_output : string ; // Expected response
expected_tools ?: string []; // Expected tool calls
expected_knowledge_sources ?: string []; // Expected contexts used
expected_agent_tools ?: string []; // Expected agent tools
createdAt : string ;
updatedAt : string ;
}
Example test case:
const testCase : TestCase = {
id: "tc_001" ,
name: "Weather query" ,
description: "User asks about weather" ,
inputs: [
{
role: "user" ,
content: "What's the weather like in San Francisco?"
}
],
expected_output: "Based on current data, it's 68°F and sunny in San Francisco." ,
expected_tools: [ "get_weather" ],
expected_knowledge_sources: [],
expected_agent_tools: [],
createdAt: "2025-01-15T10:00:00Z" ,
updatedAt: "2025-01-15T10:00:00Z"
};
Running evaluations
Basic evaluation run
import { ExuluEval , ExuluAgent } from "@exulu/backend" ;
const eval = new ExuluEval ({
id: "my_eval" ,
name: "My Evaluation" ,
description: "Custom evaluation" ,
llm: false ,
execute : async ({ messages , testCase }) => {
// Your scoring logic
return 85 ; // Score from 0-100
}
});
// Run evaluation
const score = await eval . run (
agent , // Agent DB record
backend , // ExuluAgent instance
testCase , // TestCase
messages , // UIMessage[]
config // Optional config
);
console . log ( `Score: ${ score } /100` );
Batch evaluation
Run multiple evaluations on a test suite:
async function runEvaluations (
agent : Agent ,
backend : ExuluAgent ,
testCases : TestCase [],
evals : ExuluEval []
) {
const results = [];
for ( const testCase of testCases ) {
// Generate response
const response = await backend . generateSync ({
prompt: testCase . inputs [ testCase . inputs . length - 1 ]. content ,
agentInstance: await loadAgent ( agent . id ),
statistics: { label: "eval" , trigger: "test" }
});
const messages = [
... testCase . inputs ,
{ role: "assistant" , content: response . text }
];
// Run all evals on this test case
for ( const eval of evals ) {
const score = await eval . run ( agent , backend , testCase , messages );
results . push ({
testCaseId: testCase . id ,
testCaseName: testCase . name ,
evalId: eval . id ,
evalName: eval . name ,
score
});
}
}
return results ;
}
Integration with ExuluQueues
Run evaluations as background jobs:
import { ExuluEval , ExuluQueues } from "@exulu/backend" ;
// Create eval with queue config
const eval = new ExuluEval ({
id: "background_eval" ,
name: "Background Evaluation" ,
description: "Runs as background job" ,
llm: true ,
execute : async ({ backend , messages , testCase }) => {
// Evaluation logic
return 90 ;
},
queue: Promise . resolve ({
connection: await ExuluQueues . getConnection (),
name: "evaluations" ,
prefix: "{exulu}" ,
defaultJobOptions: {
attempts: 3 ,
backoff: { type: "exponential" , delay: 2000 }
}
})
});
// Queue the evaluation job
// (Implementation depends on your worker setup)
Best practices
Start simple : Begin with basic evaluations (exact match, keyword presence) before building complex LLM-as-judge evaluations.
Multiple evaluations : Use multiple evaluation functions to assess different aspects (accuracy, tone, tool usage, etc.).
Score range : Evaluation functions must return a score between 0 and 100. Scores outside this range will throw an error.
Test case quality : Good test cases are specific, representative of real usage, and have clear expected outputs.
When to use ExuluEval
Test agent behavior during development to catch issues early
Compare prompt variations to find the best performing instructions
Evaluate the same agent with different LLM models (GPT-4 vs Claude vs Gemini)
Automated testing in deployment pipelines to prevent regressions
Continuous evaluation in production to track performance over time
Evaluation workflow
Next steps