Skip to main content

ExuluEval class

class ExuluEval {
  public id: string;
  public name: string;
  public description: string;
  public llm: boolean;
  public config?: { name: string; description: string }[];
  public queue?: Promise<ExuluQueueConfig>;

  constructor(params: ExuluEvalParams);
  async run(
    agent: Agent,
    backend: ExuluAgent,
    testCase: TestCase,
    messages: UIMessage[],
    config?: Record<string, any>
  ): Promise<number>;
}

Constructor

Creates a new evaluation function instance.
new ExuluEval(params: ExuluEvalParams)

Parameters

params
ExuluEvalParams
required
Configuration object for the evaluation function
interface ExuluEvalParams {
  id: string;
  name: string;
  description: string;
  llm: boolean;
  execute: (params: ExecuteParams) => Promise<number>;
  config?: { name: string; description: string }[];
  queue?: Promise<ExuluQueueConfig>;
}
params.id
string
required
Unique identifier for this evaluation function
params.name
string
required
Human-readable name
params.description
string
required
Description of what this evaluation measures
params.llm
boolean
required
Whether this evaluation uses an LLM (LLM-as-judge)
params.execute
function
required
Function that performs the evaluation
async (params: {
  agent: Agent;
  backend: ExuluAgent;
  messages: UIMessage[];
  testCase: TestCase;
  config?: Record<string, any>;
}) => Promise<number>
Must return a score between 0 and 100
params.config
array
Optional configuration schema
{
  name: string;        // Config parameter name
  description: string; // What this parameter does
}[]
params.queue
Promise<ExuluQueueConfig>
Optional queue configuration for background execution

Example

import { ExuluEval } from "@exulu/backend";

const eval = new ExuluEval({
  id: "exact_match",
  name: "Exact Match",
  description: "Checks if response exactly matches expected output",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const response = messages[messages.length - 1]?.content || "";
    return response === testCase.expected_output ? 100 : 0;
  }
});

Properties

id

id
string
Unique identifier for this evaluation function
const evalId = eval.id; // "exact_match"

name

name
string
Human-readable name for the evaluation
const evalName = eval.name; // "Exact Match"

description

description
string
Description of what this evaluation measures
const evalDesc = eval.description; // "Checks if response exactly matches expected output"

llm

llm
boolean
Whether this evaluation uses an LLM for scoring
const usesLLM = eval.llm; // false

config

config
array | undefined
Configuration schema defining runtime parameters
{
  name: string;
  description: string;
}[]
const configSchema = eval.config;
// [{ name: "threshold", description: "Minimum score threshold" }]

queue

queue
Promise<ExuluQueueConfig> | undefined
Queue configuration for background execution
const queueConfig = await eval.queue;

Methods

run()

Executes the evaluation function and returns a score.
async run(
  agent: Agent,
  backend: ExuluAgent,
  testCase: TestCase,
  messages: UIMessage[],
  config?: Record<string, any>
): Promise<number>
agent
Agent
required
Agent database record being evaluated
interface Agent {
  id: string;
  name: string;
  description: string;
  // ... other properties
}
backend
ExuluAgent
required
ExuluAgent instance for generating responses or using LLM-as-judge
testCase
TestCase
required
Test case with inputs and expected outputs
interface TestCase {
  id: string;
  name: string;
  description?: string;
  inputs: UIMessage[];
  expected_output: string;
  expected_tools?: string[];
  expected_knowledge_sources?: string[];
  expected_agent_tools?: string[];
  createdAt: string;
  updatedAt: string;
}
messages
UIMessage[]
required
Conversation messages including inputs and agent responses
interface UIMessage {
  role: "user" | "assistant" | "system";
  content: string;
  toolInvocations?: ToolInvocation[];
}
config
Record<string, any>
Optional runtime configuration values
return
Promise<number>
Score from 0 to 100
Example:
const score = await eval.run(
  agent,
  backend,
  testCase,
  messages,
  { threshold: 80 }
);

console.log(`Score: ${score}/100`);
Error handling:
try {
  const score = await eval.run(agent, backend, testCase, messages);
  console.log(`Score: ${score}`);
} catch (error) {
  console.error("Evaluation failed:", error.message);
  // Error: Eval function must return a score between 0 and 100, got 150
}
Throws:
  • Error if execute function returns score < 0 or > 100
  • Error if execute function throws an error

Type definitions

ExuluEvalParams

interface ExuluEvalParams {
  id: string;
  name: string;
  description: string;
  llm: boolean;
  execute: (params: {
    agent: Agent;
    backend: ExuluAgent;
    messages: UIMessage[];
    testCase: TestCase;
    config?: Record<string, any>;
  }) => Promise<number>;
  config?: {
    name: string;
    description: string;
  }[];
  queue?: Promise<ExuluQueueConfig>;
}

TestCase

interface TestCase {
  id: string;
  name: string;
  description?: string;
  inputs: UIMessage[];                   // Input messages
  expected_output: string;               // Expected response
  expected_tools?: string[];             // Expected tool names
  expected_knowledge_sources?: string[]; // Expected context IDs
  expected_agent_tools?: string[];       // Expected agent tool IDs
  createdAt: string;
  updatedAt: string;
}

UIMessage

interface UIMessage {
  role: "user" | "assistant" | "system";
  content: string;
  toolInvocations?: ToolInvocation[];
}

interface ToolInvocation {
  toolName: string;
  toolCallId: string;
  args: Record<string, any>;
  result?: any;
}

Usage examples

Basic exact match

import { ExuluEval } from "@exulu/backend";

const exactMatch = new ExuluEval({
  id: "exact_match",
  name: "Exact Match",
  description: "100 if exact match, 0 otherwise",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const response = messages[messages.length - 1]?.content || "";
    return response === testCase.expected_output ? 100 : 0;
  }
});

const score = await exactMatch.run(agent, backend, testCase, messages);
console.log(`Score: ${score}/100`);

Keyword evaluation with config

const keywordEval = new ExuluEval({
  id: "keyword_check",
  name: "Keyword Check",
  description: "Checks for presence of keywords",
  llm: false,
  execute: async ({ messages, config }) => {
    const response = messages[messages.length - 1]?.content?.toLowerCase() || "";
    const keywords = config?.keywords || [];

    if (keywords.length === 0) return 100;

    const found = keywords.filter(kw => response.includes(kw.toLowerCase()));
    return (found.length / keywords.length) * 100;
  },
  config: [
    {
      name: "keywords",
      description: "Array of required keywords"
    }
  ]
});

const score = await keywordEval.run(
  agent,
  backend,
  testCase,
  messages,
  { keywords: ["weather", "temperature"] }
);

LLM-as-judge

const llmJudge = new ExuluEval({
  id: "llm_judge",
  name: "LLM Quality Judge",
  description: "Uses LLM to evaluate response quality",
  llm: true,
  execute: async ({ backend, messages, testCase, config }) => {
    const response = messages[messages.length - 1]?.content || "";

    const judgePrompt = `
Rate this response on a scale of 0-100.

Expected: ${testCase.expected_output}
Actual: ${response}

Respond with ONLY a number 0-100.
    `.trim();

    const result = await backend.generateSync({
      prompt: judgePrompt,
      agentInstance: await loadAgent(config?.judgeAgentId),
      statistics: { label: "eval", trigger: "llm_judge" }
    });

    const score = parseInt(result.text.trim());
    return isNaN(score) ? 0 : Math.max(0, Math.min(100, score));
  },
  config: [
    {
      name: "judgeAgentId",
      description: "Agent to use for judging"
    }
  ]
});

const score = await llmJudge.run(
  agent,
  backend,
  testCase,
  messages,
  { judgeAgentId: "claude_opus_judge" }
);

Tool usage evaluation

const toolUsageEval = new ExuluEval({
  id: "tool_usage",
  name: "Tool Usage Check",
  description: "Verifies correct tools were used",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const toolCalls = messages
      .flatMap(msg => msg.toolInvocations || [])
      .map(inv => inv.toolName);

    const expectedTools = testCase.expected_tools || [];

    if (expectedTools.length === 0) {
      return toolCalls.length === 0 ? 100 : 0;
    }

    const usedExpected = expectedTools.filter(tool =>
      toolCalls.includes(tool)
    );

    return (usedExpected.length / expectedTools.length) * 100;
  }
});

const score = await toolUsageEval.run(agent, backend, testCase, messages);

Batch evaluation

async function runAllEvaluations(
  agent: Agent,
  backend: ExuluAgent,
  testCases: TestCase[],
  evaluations: ExuluEval[]
) {
  const results = [];

  for (const testCase of testCases) {
    // Generate response
    const response = await backend.generateSync({
      prompt: testCase.inputs[testCase.inputs.length - 1].content,
      agentInstance: await loadAgent(agent.id),
      statistics: { label: "eval", trigger: "test" }
    });

    const messages = [
      ...testCase.inputs,
      { role: "assistant", content: response.text }
    ];

    // Run all evaluations
    for (const evaluation of evaluations) {
      const score = await evaluation.run(agent, backend, testCase, messages);

      results.push({
        testCaseId: testCase.id,
        testCaseName: testCase.name,
        evaluationId: evaluation.id,
        evaluationName: evaluation.name,
        score
      });
    }
  }

  return results;
}

// Use
const results = await runAllEvaluations(
  agent,
  backend,
  testCases,
  [exactMatch, keywordEval, toolUsageEval]
);

console.log("Results:", results);

Evaluation suite

import { ExuluEval } from "@exulu/backend";

class EvaluationSuite {
  private evaluations: ExuluEval[] = [];

  add(evaluation: ExuluEval) {
    this.evaluations.push(evaluation);
  }

  async runAll(
    agent: Agent,
    backend: ExuluAgent,
    testCase: TestCase,
    messages: UIMessage[],
    config?: Record<string, any>
  ) {
    const results = await Promise.all(
      this.evaluations.map(async (eval) => ({
        id: eval.id,
        name: eval.name,
        score: await eval.run(agent, backend, testCase, messages, config)
      }))
    );

    return {
      testCase: testCase.name,
      evaluations: results,
      average: results.reduce((sum, r) => sum + r.score, 0) / results.length,
      passed: results.every(r => r.score >= (config?.threshold || 80))
    };
  }
}

// Use
const suite = new EvaluationSuite();
suite.add(exactMatch);
suite.add(keywordEval);
suite.add(toolUsageEval);

const result = await suite.runAll(agent, backend, testCase, messages);
console.log("Suite result:", result);

Composite evaluation

const compositeEval = new ExuluEval({
  id: "composite",
  name: "Composite Evaluation",
  description: "Combines multiple criteria with weights",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const response = messages[messages.length - 1]?.content || "";
    let totalScore = 0;

    // Accuracy (50%)
    const containsExpected = response.includes(testCase.expected_output);
    totalScore += containsExpected ? 50 : 0;

    // Length (20%)
    const isReasonableLength = response.length >= 50 && response.length <= 500;
    totalScore += isReasonableLength ? 20 : 0;

    // Tool usage (30%)
    const toolCalls = messages.flatMap(msg => msg.toolInvocations || []);
    const expectedTools = testCase.expected_tools || [];
    if (expectedTools.length > 0) {
      const toolsUsed = expectedTools.every(tool =>
        toolCalls.some(call => call.toolName === tool)
      );
      totalScore += toolsUsed ? 30 : 0;
    } else {
      totalScore += 30;
    }

    return totalScore;
  }
});

Error handling

const safeEval = new ExuluEval({
  id: "safe_eval",
  name: "Safe Evaluation",
  description: "Evaluation with comprehensive error handling",
  llm: false,
  execute: async ({ messages, testCase, config }) => {
    try {
      const response = messages[messages.length - 1]?.content;

      if (!response) {
        console.warn("No response content found");
        return 0;
      }

      // Your evaluation logic
      const score = computeScore(response, testCase.expected_output);

      // Validate score range
      if (score < 0 || score > 100) {
        throw new Error(`Score out of range: ${score}`);
      }

      return score;
    } catch (error) {
      console.error(`Evaluation error: ${error.message}`);
      throw error; // Re-throw for ExuluEval to handle
    }
  }
});

// Run with error handling
try {
  const score = await safeEval.run(agent, backend, testCase, messages);
  console.log(`Score: ${score}`);
} catch (error) {
  console.error("Evaluation failed:", error.message);
  // Handle failure (log, alert, retry, etc.)
}

Integration patterns

With test management system

interface EvaluationResult {
  evaluationId: string;
  testCaseId: string;
  score: number;
  timestamp: string;
  agentId: string;
  passed: boolean;
}

async function runAndStoreEvaluation(
  evaluation: ExuluEval,
  agent: Agent,
  backend: ExuluAgent,
  testCase: TestCase,
  messages: UIMessage[],
  threshold: number = 80
): Promise<EvaluationResult> {
  const score = await evaluation.run(agent, backend, testCase, messages);

  const result: EvaluationResult = {
    evaluationId: evaluation.id,
    testCaseId: testCase.id,
    score,
    timestamp: new Date().toISOString(),
    agentId: agent.id,
    passed: score >= threshold
  };

  // Store in database
  const { db } = await postgresClient();
  await db.into("evaluation_results").insert(result);

  return result;
}

CI/CD integration

async function runCIPipeline(
  agent: Agent,
  backend: ExuluAgent,
  testCases: TestCase[],
  evaluations: ExuluEval[],
  minPassRate: number = 0.8
) {
  const results = [];

  for (const testCase of testCases) {
    const response = await backend.generateSync({
      prompt: testCase.inputs[testCase.inputs.length - 1].content,
      agentInstance: await loadAgent(agent.id),
      statistics: { label: "ci", trigger: "test" }
    });

    const messages = [
      ...testCase.inputs,
      { role: "assistant", content: response.text }
    ];

    for (const evaluation of evaluations) {
      const score = await evaluation.run(agent, backend, testCase, messages);
      results.push({ testCase: testCase.name, eval: evaluation.name, score });
    }
  }

  const averageScore = results.reduce((sum, r) => sum + r.score, 0) / results.length;
  const passRate = results.filter(r => r.score >= 80).length / results.length;

  if (passRate < minPassRate) {
    throw new Error(
      `CI failed: Pass rate ${passRate.toFixed(2)} below minimum ${minPassRate}. ` +
      `Average score: ${averageScore.toFixed(2)}/100`
    );
  }

  console.log(`✓ CI passed: ${passRate.toFixed(2)} pass rate, ${averageScore.toFixed(2)} avg score`);
  return { averageScore, passRate, results };
}

Best practices

Validate inputs: Check that messages and testCase have expected structure before running evaluation logic.
Score range: Always ensure your execute function returns a value between 0 and 100, inclusive.
LLM consistency: LLM judges can be inconsistent. Use temperature=0 for more deterministic scoring.
Multiple evaluations: Use multiple evaluation functions to assess different aspects (accuracy, style, tool usage).

Next steps