Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.exulu.com/llms.txt

Use this file to discover all available pages before exploring further.

ExuluEval class

class ExuluEval {
  public id: string;
  public name: string;
  public description: string;
  public llm: boolean;
  public config?: { name: string; description: string }[];
  public queue?: Promise<ExuluQueueConfig>;

  constructor(params: ExuluEvalParams);
  async run(
    agent: Agent,
    backend: ExuluAgent,
    testCase: TestCase,
    messages: UIMessage[],
    config?: Record<string, any>
  ): Promise<number>;
}

Constructor

Creates a new evaluation function instance.
new ExuluEval(params: ExuluEvalParams)

Parameters

params
ExuluEvalParams
required
Configuration object for the evaluation function
interface ExuluEvalParams {
  id: string;
  name: string;
  description: string;
  llm: boolean;
  execute: (params: ExecuteParams) => Promise<number>;
  config?: { name: string; description: string }[];
  queue?: Promise<ExuluQueueConfig>;
}
params.id
string
required
Unique identifier for this evaluation function
params.name
string
required
Human-readable name
params.description
string
required
Description of what this evaluation measures
params.llm
boolean
required
Whether this evaluation uses an LLM (LLM-as-judge)
params.execute
function
required
Function that performs the evaluation
async (params: {
  agent: Agent;
  backend: ExuluAgent;
  messages: UIMessage[];
  testCase: TestCase;
  config?: Record<string, any>;
}) => Promise<number>
Must return a score between 0 and 100
params.config
array
Optional configuration schema
{
  name: string;        // Config parameter name
  description: string; // What this parameter does
}[]
params.queue
Promise<ExuluQueueConfig>
Optional queue configuration for background execution

Example

import { ExuluEval } from "@exulu/backend";

const eval = new ExuluEval({
  id: "exact_match",
  name: "Exact Match",
  description: "Checks if response exactly matches expected output",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const response = messages[messages.length - 1]?.content || "";
    return response === testCase.expected_output ? 100 : 0;
  }
});

Properties

id

id
string
Unique identifier for this evaluation function
const evalId = eval.id; // "exact_match"

name

name
string
Human-readable name for the evaluation
const evalName = eval.name; // "Exact Match"

description

description
string
Description of what this evaluation measures
const evalDesc = eval.description; // "Checks if response exactly matches expected output"

llm

llm
boolean
Whether this evaluation uses an LLM for scoring
const usesLLM = eval.llm; // false

config

config
array | undefined
Configuration schema defining runtime parameters
{
  name: string;
  description: string;
}[]
const configSchema = eval.config;
// [{ name: "threshold", description: "Minimum score threshold" }]

queue

queue
Promise<ExuluQueueConfig> | undefined
Queue configuration for background execution
const queueConfig = await eval.queue;

Methods

run()

Executes the evaluation function and returns a score.
async run(
  agent: Agent,
  backend: ExuluAgent,
  testCase: TestCase,
  messages: UIMessage[],
  config?: Record<string, any>
): Promise<number>
agent
Agent
required
Agent database record being evaluated
interface Agent {
  id: string;
  name: string;
  description: string;
  // ... other properties
}
backend
ExuluAgent
required
ExuluAgent instance for generating responses or using LLM-as-judge
testCase
TestCase
required
Test case with inputs and expected outputs
interface TestCase {
  id: string;
  name: string;
  description?: string;
  inputs: UIMessage[];
  expected_output: string;
  expected_tools?: string[];
  expected_knowledge_sources?: string[];
  expected_agent_tools?: string[];
  createdAt: string;
  updatedAt: string;
}
messages
UIMessage[]
required
Conversation messages including inputs and agent responses
interface UIMessage {
  role: "user" | "assistant" | "system";
  content: string;
  toolInvocations?: ToolInvocation[];
}
config
Record<string, any>
Optional runtime configuration values
return
Promise<number>
Score from 0 to 100
Example:
const score = await eval.run(
  agent,
  backend,
  testCase,
  messages,
  { threshold: 80 }
);

console.log(`Score: ${score}/100`);
Error handling:
try {
  const score = await eval.run(agent, backend, testCase, messages);
  console.log(`Score: ${score}`);
} catch (error) {
  console.error("Evaluation failed:", error.message);
  // Error: Eval function must return a score between 0 and 100, got 150
}
Throws:
  • Error if execute function returns score < 0 or > 100
  • Error if execute function throws an error

Type definitions

ExuluEvalParams

interface ExuluEvalParams {
  id: string;
  name: string;
  description: string;
  llm: boolean;
  execute: (params: {
    agent: Agent;
    backend: ExuluAgent;
    messages: UIMessage[];
    testCase: TestCase;
    config?: Record<string, any>;
  }) => Promise<number>;
  config?: {
    name: string;
    description: string;
  }[];
  queue?: Promise<ExuluQueueConfig>;
}

TestCase

interface TestCase {
  id: string;
  name: string;
  description?: string;
  inputs: UIMessage[];                   // Input messages
  expected_output: string;               // Expected response
  expected_tools?: string[];             // Expected tool names
  expected_knowledge_sources?: string[]; // Expected context IDs
  expected_agent_tools?: string[];       // Expected agent tool IDs
  createdAt: string;
  updatedAt: string;
}

UIMessage

interface UIMessage {
  role: "user" | "assistant" | "system";
  content: string;
  toolInvocations?: ToolInvocation[];
}

interface ToolInvocation {
  toolName: string;
  toolCallId: string;
  args: Record<string, any>;
  result?: any;
}

Usage examples

Basic exact match

import { ExuluEval } from "@exulu/backend";

const exactMatch = new ExuluEval({
  id: "exact_match",
  name: "Exact Match",
  description: "100 if exact match, 0 otherwise",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const response = messages[messages.length - 1]?.content || "";
    return response === testCase.expected_output ? 100 : 0;
  }
});

const score = await exactMatch.run(agent, backend, testCase, messages);
console.log(`Score: ${score}/100`);

Keyword evaluation with config

const keywordEval = new ExuluEval({
  id: "keyword_check",
  name: "Keyword Check",
  description: "Checks for presence of keywords",
  llm: false,
  execute: async ({ messages, config }) => {
    const response = messages[messages.length - 1]?.content?.toLowerCase() || "";
    const keywords = config?.keywords || [];

    if (keywords.length === 0) return 100;

    const found = keywords.filter(kw => response.includes(kw.toLowerCase()));
    return (found.length / keywords.length) * 100;
  },
  config: [
    {
      name: "keywords",
      description: "Array of required keywords"
    }
  ]
});

const score = await keywordEval.run(
  agent,
  backend,
  testCase,
  messages,
  { keywords: ["weather", "temperature"] }
);

LLM-as-judge

const llmJudge = new ExuluEval({
  id: "llm_judge",
  name: "LLM Quality Judge",
  description: "Uses LLM to evaluate response quality",
  llm: true,
  execute: async ({ backend, messages, testCase, config }) => {
    const response = messages[messages.length - 1]?.content || "";

    const judgePrompt = `
Rate this response on a scale of 0-100.

Expected: ${testCase.expected_output}
Actual: ${response}

Respond with ONLY a number 0-100.
    `.trim();

    const result = await backend.generateSync({
      prompt: judgePrompt,
      agentInstance: await loadAgent(config?.judgeAgentId),
      statistics: { label: "eval", trigger: "llm_judge" }
    });

    const score = parseInt(result.text.trim());
    return isNaN(score) ? 0 : Math.max(0, Math.min(100, score));
  },
  config: [
    {
      name: "judgeAgentId",
      description: "Agent to use for judging"
    }
  ]
});

const score = await llmJudge.run(
  agent,
  backend,
  testCase,
  messages,
  { judgeAgentId: "claude_opus_judge" }
);

Tool usage evaluation

const toolUsageEval = new ExuluEval({
  id: "tool_usage",
  name: "Tool Usage Check",
  description: "Verifies correct tools were used",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const toolCalls = messages
      .flatMap(msg => msg.toolInvocations || [])
      .map(inv => inv.toolName);

    const expectedTools = testCase.expected_tools || [];

    if (expectedTools.length === 0) {
      return toolCalls.length === 0 ? 100 : 0;
    }

    const usedExpected = expectedTools.filter(tool =>
      toolCalls.includes(tool)
    );

    return (usedExpected.length / expectedTools.length) * 100;
  }
});

const score = await toolUsageEval.run(agent, backend, testCase, messages);

Batch evaluation

async function runAllEvaluations(
  agent: Agent,
  backend: ExuluAgent,
  testCases: TestCase[],
  evaluations: ExuluEval[]
) {
  const results = [];

  for (const testCase of testCases) {
    // Generate response
    const response = await backend.generateSync({
      prompt: testCase.inputs[testCase.inputs.length - 1].content,
      agentInstance: await loadAgent(agent.id),
      statistics: { label: "eval", trigger: "test" }
    });

    const messages = [
      ...testCase.inputs,
      { role: "assistant", content: response.text }
    ];

    // Run all evaluations
    for (const evaluation of evaluations) {
      const score = await evaluation.run(agent, backend, testCase, messages);

      results.push({
        testCaseId: testCase.id,
        testCaseName: testCase.name,
        evaluationId: evaluation.id,
        evaluationName: evaluation.name,
        score
      });
    }
  }

  return results;
}

// Use
const results = await runAllEvaluations(
  agent,
  backend,
  testCases,
  [exactMatch, keywordEval, toolUsageEval]
);

console.log("Results:", results);

Evaluation suite

import { ExuluEval } from "@exulu/backend";

class EvaluationSuite {
  private evaluations: ExuluEval[] = [];

  add(evaluation: ExuluEval) {
    this.evaluations.push(evaluation);
  }

  async runAll(
    agent: Agent,
    backend: ExuluAgent,
    testCase: TestCase,
    messages: UIMessage[],
    config?: Record<string, any>
  ) {
    const results = await Promise.all(
      this.evaluations.map(async (eval) => ({
        id: eval.id,
        name: eval.name,
        score: await eval.run(agent, backend, testCase, messages, config)
      }))
    );

    return {
      testCase: testCase.name,
      evaluations: results,
      average: results.reduce((sum, r) => sum + r.score, 0) / results.length,
      passed: results.every(r => r.score >= (config?.threshold || 80))
    };
  }
}

// Use
const suite = new EvaluationSuite();
suite.add(exactMatch);
suite.add(keywordEval);
suite.add(toolUsageEval);

const result = await suite.runAll(agent, backend, testCase, messages);
console.log("Suite result:", result);

Composite evaluation

const compositeEval = new ExuluEval({
  id: "composite",
  name: "Composite Evaluation",
  description: "Combines multiple criteria with weights",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const response = messages[messages.length - 1]?.content || "";
    let totalScore = 0;

    // Accuracy (50%)
    const containsExpected = response.includes(testCase.expected_output);
    totalScore += containsExpected ? 50 : 0;

    // Length (20%)
    const isReasonableLength = response.length >= 50 && response.length <= 500;
    totalScore += isReasonableLength ? 20 : 0;

    // Tool usage (30%)
    const toolCalls = messages.flatMap(msg => msg.toolInvocations || []);
    const expectedTools = testCase.expected_tools || [];
    if (expectedTools.length > 0) {
      const toolsUsed = expectedTools.every(tool =>
        toolCalls.some(call => call.toolName === tool)
      );
      totalScore += toolsUsed ? 30 : 0;
    } else {
      totalScore += 30;
    }

    return totalScore;
  }
});

Error handling

const safeEval = new ExuluEval({
  id: "safe_eval",
  name: "Safe Evaluation",
  description: "Evaluation with comprehensive error handling",
  llm: false,
  execute: async ({ messages, testCase, config }) => {
    try {
      const response = messages[messages.length - 1]?.content;

      if (!response) {
        console.warn("No response content found");
        return 0;
      }

      // Your evaluation logic
      const score = computeScore(response, testCase.expected_output);

      // Validate score range
      if (score < 0 || score > 100) {
        throw new Error(`Score out of range: ${score}`);
      }

      return score;
    } catch (error) {
      console.error(`Evaluation error: ${error.message}`);
      throw error; // Re-throw for ExuluEval to handle
    }
  }
});

// Run with error handling
try {
  const score = await safeEval.run(agent, backend, testCase, messages);
  console.log(`Score: ${score}`);
} catch (error) {
  console.error("Evaluation failed:", error.message);
  // Handle failure (log, alert, retry, etc.)
}

Integration patterns

With test management system

interface EvaluationResult {
  evaluationId: string;
  testCaseId: string;
  score: number;
  timestamp: string;
  agentId: string;
  passed: boolean;
}

async function runAndStoreEvaluation(
  evaluation: ExuluEval,
  agent: Agent,
  backend: ExuluAgent,
  testCase: TestCase,
  messages: UIMessage[],
  threshold: number = 80
): Promise<EvaluationResult> {
  const score = await evaluation.run(agent, backend, testCase, messages);

  const result: EvaluationResult = {
    evaluationId: evaluation.id,
    testCaseId: testCase.id,
    score,
    timestamp: new Date().toISOString(),
    agentId: agent.id,
    passed: score >= threshold
  };

  // Store in database
  const { db } = await postgresClient();
  await db.into("evaluation_results").insert(result);

  return result;
}

CI/CD integration

async function runCIPipeline(
  agent: Agent,
  backend: ExuluAgent,
  testCases: TestCase[],
  evaluations: ExuluEval[],
  minPassRate: number = 0.8
) {
  const results = [];

  for (const testCase of testCases) {
    const response = await backend.generateSync({
      prompt: testCase.inputs[testCase.inputs.length - 1].content,
      agentInstance: await loadAgent(agent.id),
      statistics: { label: "ci", trigger: "test" }
    });

    const messages = [
      ...testCase.inputs,
      { role: "assistant", content: response.text }
    ];

    for (const evaluation of evaluations) {
      const score = await evaluation.run(agent, backend, testCase, messages);
      results.push({ testCase: testCase.name, eval: evaluation.name, score });
    }
  }

  const averageScore = results.reduce((sum, r) => sum + r.score, 0) / results.length;
  const passRate = results.filter(r => r.score >= 80).length / results.length;

  if (passRate < minPassRate) {
    throw new Error(
      `CI failed: Pass rate ${passRate.toFixed(2)} below minimum ${minPassRate}. ` +
      `Average score: ${averageScore.toFixed(2)}/100`
    );
  }

  console.log(`✓ CI passed: ${passRate.toFixed(2)} pass rate, ${averageScore.toFixed(2)} avg score`);
  return { averageScore, passRate, results };
}

Best practices

Validate inputs: Check that messages and testCase have expected structure before running evaluation logic.
Score range: Always ensure your execute function returns a value between 0 and 100, inclusive.
LLM consistency: LLM judges can be inconsistent. Use temperature=0 for more deterministic scoring.
Multiple evaluations: Use multiple evaluation functions to assess different aspects (accuracy, style, tool usage).

Next steps

Configuration guide

Learn about evaluation configuration

Overview

Understand evaluation concepts

ExuluAgent

Create agents to evaluate

ExuluQueues

Run evaluations as background jobs