Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.exulu.com/llms.txt

Use this file to discover all available pages before exploring further.

Constructor parameters

ExuluEval requires specific configuration to define evaluation behavior:
new ExuluEval({
  id: string;
  name: string;
  description: string;
  llm: boolean;
  execute: (params) => Promise<number>;
  config?: { name: string; description: string }[];
  queue?: Promise<ExuluQueueConfig>;
})
id
string
required
Unique identifier for the evaluation function
name
string
required
Human-readable name for the evaluation
description
string
required
Description of what this evaluation measures
llm
boolean
required
Whether this evaluation uses an LLM for scoring (LLM-as-judge)
execute
function
required
Function that performs the evaluation and returns a score from 0-100
config
array
Optional configuration parameters for the evaluation function
queue
Promise<ExuluQueueConfig>
Optional queue configuration for running evaluations as background jobs

Execute function

The execute function receives evaluation parameters and must return a score between 0 and 100:
execute: async ({
  agent,      // Agent database record
  backend,    // ExuluAgent instance
  messages,   // Conversation messages
  testCase,   // Test case with expected output
  config      // Optional runtime configuration
}) => {
  // Your evaluation logic
  return score; // Must be 0-100
}

Parameters

agent
Agent
The agent database record being evaluated
interface Agent {
  id: string;
  name: string;
  description: string;
  // ... other agent properties
}
backend
ExuluAgent
ExuluAgent instance for generating responses or using LLM-as-judge
messages
UIMessage[]
Array of conversation messages including inputs and generated response
interface UIMessage {
  role: "user" | "assistant" | "system";
  content: string;
  toolInvocations?: ToolInvocation[];
}
testCase
TestCase
Test case containing inputs and expected outputs
interface TestCase {
  id: string;
  name: string;
  description?: string;
  inputs: UIMessage[];
  expected_output: string;
  expected_tools?: string[];
  expected_knowledge_sources?: string[];
  expected_agent_tools?: string[];
}
config
Record<string, any>
Runtime configuration values (optional)

Configuration patterns

Basic exact match evaluation

import { ExuluEval } from "@exulu/backend";

const exactMatchEval = new ExuluEval({
  id: "exact_match",
  name: "Exact Match",
  description: "Returns 100 if response exactly matches expected output, 0 otherwise",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    return response === testCase.expected_output ? 100 : 0;
  }
});

Partial match with scoring

const partialMatchEval = new ExuluEval({
  id: "partial_match",
  name: "Partial Match",
  description: "Scores based on how much of expected output appears in response",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content?.toLowerCase() || "";
    const expected = testCase.expected_output.toLowerCase();

    // Split into words
    const expectedWords = expected.split(/\s+/);
    const matchedWords = expectedWords.filter(word =>
      response.includes(word)
    );

    return (matchedWords.length / expectedWords.length) * 100;
  }
});

Keyword presence evaluation

const keywordEval = new ExuluEval({
  id: "keyword_presence",
  name: "Keyword Presence",
  description: "Checks if response contains required keywords",
  llm: false,
  execute: async ({ messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content?.toLowerCase() || "";

    const keywords = config?.keywords || [];
    if (keywords.length === 0) return 100;

    const foundKeywords = keywords.filter(kw =>
      response.includes(kw.toLowerCase())
    );

    return (foundKeywords.length / keywords.length) * 100;
  },
  config: [
    {
      name: "keywords",
      description: "Array of keywords that should appear in response"
    }
  ]
});

// Run with config
const score = await keywordEval.run(
  agent,
  backend,
  testCase,
  messages,
  { keywords: ["weather", "temperature", "San Francisco"] }
);

LLM-as-judge evaluation

const llmJudgeEval = new ExuluEval({
  id: "llm_judge",
  name: "LLM Judge",
  description: "Uses an LLM to evaluate response quality",
  llm: true,
  execute: async ({ backend, messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    const judgePrompt = `
You are an expert evaluator. Rate the following response on a scale of 0-100.

Test Case: ${testCase.name}
Description: ${testCase.description || "N/A"}

Expected Output:
${testCase.expected_output}

Actual Response:
${response}

Criteria:
1. Accuracy: Does it match the expected output?
2. Completeness: Does it address all required aspects?
3. Clarity: Is it well-structured and understandable?
4. Relevance: Does it stay on topic?

Respond with ONLY a number from 0 to 100. No explanation.
    `.trim();

    const result = await backend.generateSync({
      prompt: judgePrompt,
      agentInstance: await loadAgent(config?.judgeAgentId || "default_judge"),
      statistics: { label: "eval", trigger: "llm_judge" }
    });

    const score = parseInt(result.text.trim());

    if (isNaN(score)) {
      console.warn(`LLM judge returned non-numeric: ${result.text}`);
      return 0;
    }

    return Math.max(0, Math.min(100, score));
  },
  config: [
    {
      name: "judgeAgentId",
      description: "Agent ID to use for evaluation (must support text generation)"
    }
  ]
});

Tool usage evaluation

const toolUsageEval = new ExuluEval({
  id: "tool_usage",
  name: "Tool Usage",
  description: "Checks if agent used expected tools",
  llm: false,
  execute: async ({ messages, testCase }) => {
    // Extract tool calls from conversation
    const toolCalls = messages
      .flatMap(msg => msg.toolInvocations || [])
      .map(inv => inv.toolName);

    const expectedTools = testCase.expected_tools || [];

    // If no tools expected, check that no tools were used
    if (expectedTools.length === 0) {
      return toolCalls.length === 0 ? 100 : 0;
    }

    // Check if all expected tools were used
    const usedExpected = expectedTools.filter(tool =>
      toolCalls.includes(tool)
    );

    return (usedExpected.length / expectedTools.length) * 100;
  }
});

Regex pattern matching

const regexMatchEval = new ExuluEval({
  id: "regex_match",
  name: "Regex Pattern Match",
  description: "Checks if response matches regex pattern",
  llm: false,
  execute: async ({ messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    const pattern = config?.pattern;
    if (!pattern) {
      throw new Error("Regex pattern required in config");
    }

    const regex = new RegExp(pattern, config?.flags || "");
    return regex.test(response) ? 100 : 0;
  },
  config: [
    {
      name: "pattern",
      description: "Regex pattern to match"
    },
    {
      name: "flags",
      description: "Regex flags (e.g., 'i' for case-insensitive)"
    }
  ]
});

// Run with regex config
const score = await regexMatchEval.run(
  agent,
  backend,
  testCase,
  messages,
  {
    pattern: "\\d{2}°[FC]",  // Matches temperature like "68°F"
    flags: "i"
  }
);

Length-based evaluation

const lengthEval = new ExuluEval({
  id: "response_length",
  name: "Response Length",
  description: "Scores based on response length within acceptable range",
  llm: false,
  execute: async ({ messages, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";
    const length = response.length;

    const minLength = config?.minLength || 0;
    const maxLength = config?.maxLength || Infinity;
    const targetLength = config?.targetLength;

    // If within range, score based on proximity to target
    if (length < minLength) {
      return Math.max(0, (length / minLength) * 100);
    }

    if (length > maxLength) {
      return Math.max(0, 100 - ((length - maxLength) / maxLength) * 100);
    }

    // Within range
    if (targetLength) {
      const deviation = Math.abs(length - targetLength);
      const maxDeviation = Math.max(
        targetLength - minLength,
        maxLength - targetLength
      );
      return Math.max(0, 100 - (deviation / maxDeviation) * 50);
    }

    return 100;
  },
  config: [
    {
      name: "minLength",
      description: "Minimum acceptable character count"
    },
    {
      name: "maxLength",
      description: "Maximum acceptable character count"
    },
    {
      name: "targetLength",
      description: "Ideal character count (optional)"
    }
  ]
});

Composite evaluation

Combine multiple evaluation criteria:
const compositeEval = new ExuluEval({
  id: "composite",
  name: "Composite Evaluation",
  description: "Combines multiple evaluation criteria with weights",
  llm: false,
  execute: async ({ messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    let totalScore = 0;
    let totalWeight = 0;

    // Criteria 1: Contains expected output (weight: 50%)
    const containsExpected = response.includes(testCase.expected_output);
    totalScore += containsExpected ? 50 : 0;
    totalWeight += 50;

    // Criteria 2: Reasonable length (weight: 20%)
    const isReasonableLength = response.length >= 50 && response.length <= 500;
    totalScore += isReasonableLength ? 20 : 0;
    totalWeight += 20;

    // Criteria 3: Uses tools if expected (weight: 30%)
    const toolCalls = messages.flatMap(msg => msg.toolInvocations || []);
    const expectedTools = testCase.expected_tools || [];
    if (expectedTools.length > 0) {
      const toolsUsed = expectedTools.every(tool =>
        toolCalls.some(call => call.toolName === tool)
      );
      totalScore += toolsUsed ? 30 : 0;
      totalWeight += 30;
    } else {
      totalScore += 30; // No tools expected, full points
      totalWeight += 30;
    }

    return (totalScore / totalWeight) * 100;
  }
});

Queue configuration

Run evaluations as background jobs using ExuluQueues:
import { ExuluEval, ExuluQueues } from "@exulu/backend";

const backgroundEval = new ExuluEval({
  id: "background_eval",
  name: "Background Evaluation",
  description: "Runs as queued job",
  llm: true,
  execute: async ({ backend, messages, testCase }) => {
    // Long-running evaluation logic
    return 85;
  },
  queue: Promise.resolve({
    connection: await ExuluQueues.getConnection(),
    name: "evaluations",
    prefix: "{exulu}",
    defaultJobOptions: {
      attempts: 3,
      backoff: {
        type: "exponential",
        delay: 2000
      },
      removeOnComplete: true,
      removeOnFail: false
    }
  })
});

Advanced patterns

Semantic similarity evaluation

Use embeddings to measure semantic similarity:
import { ExuluEval, ExuluEmbedder, ExuluVariables } from "@exulu/backend";

const semanticSimilarityEval = new ExuluEval({
  id: "semantic_similarity",
  name: "Semantic Similarity",
  description: "Measures semantic similarity using embeddings",
  llm: false,
  execute: async ({ messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    const embedder = new ExuluEmbedder({
      id: "eval_embedder",
      name: "Evaluation Embedder",
      provider: "openai",
      model: "text-embedding-3-small",
      vectorDimensions: 1536,
      authenticationInformation: await ExuluVariables.get("openai_api_key")
    });

    const [responseEmb, expectedEmb] = await embedder.generate([
      response,
      testCase.expected_output
    ]);

    // Cosine similarity
    const similarity = cosineSimilarity(responseEmb, expectedEmb);

    // Scale to 0-100
    return similarity * 100;
  }
});

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dotProduct / (magnitudeA * magnitudeB);
}

Multi-aspect LLM judge

Evaluate multiple aspects separately:
const multiAspectJudgeEval = new ExuluEval({
  id: "multi_aspect_judge",
  name: "Multi-Aspect LLM Judge",
  description: "Evaluates multiple aspects with separate LLM calls",
  llm: true,
  execute: async ({ backend, messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    const aspects = [
      {
        name: "accuracy",
        weight: 40,
        prompt: "Rate the accuracy of this response (0-100):"
      },
      {
        name: "clarity",
        weight: 30,
        prompt: "Rate the clarity and readability (0-100):"
      },
      {
        name: "completeness",
        weight: 30,
        prompt: "Rate how complete the response is (0-100):"
      }
    ];

    let totalScore = 0;
    let totalWeight = 0;

    for (const aspect of aspects) {
      const judgePrompt = `
${aspect.prompt}

Expected: ${testCase.expected_output}
Actual: ${response}

Respond with ONLY a number 0-100.
      `.trim();

      const result = await backend.generateSync({
        prompt: judgePrompt,
        agentInstance: await loadAgent(config?.judgeAgentId || "default_judge"),
        statistics: { label: "eval", trigger: "multi_aspect_judge" }
      });

      const score = parseInt(result.text.trim());
      if (!isNaN(score)) {
        totalScore += Math.max(0, Math.min(100, score)) * aspect.weight;
        totalWeight += aspect.weight;
      }
    }

    return totalWeight > 0 ? totalScore / totalWeight : 0;
  },
  config: [
    {
      name: "judgeAgentId",
      description: "Agent ID for LLM judge"
    }
  ]
});

A/B testing evaluation

Compare two agent configurations:
async function compareAgents(
  agentA: Agent,
  agentB: Agent,
  backendA: ExuluAgent,
  backendB: ExuluAgent,
  testCases: TestCase[],
  evals: ExuluEval[]
) {
  const resultsA = [];
  const resultsB = [];

  for (const testCase of testCases) {
    // Generate response from Agent A
    const responseA = await backendA.generateSync({
      prompt: testCase.inputs[testCase.inputs.length - 1].content,
      agentInstance: await loadAgent(agentA.id),
      statistics: { label: "ab_test", trigger: "test" }
    });

    const messagesA = [
      ...testCase.inputs,
      { role: "assistant", content: responseA.text }
    ];

    // Generate response from Agent B
    const responseB = await backendB.generateSync({
      prompt: testCase.inputs[testCase.inputs.length - 1].content,
      agentInstance: await loadAgent(agentB.id),
      statistics: { label: "ab_test", trigger: "test" }
    });

    const messagesB = [
      ...testCase.inputs,
      { role: "assistant", content: responseB.text }
    ];

    // Run evaluations on both
    for (const eval of evals) {
      const scoreA = await eval.run(agentA, backendA, testCase, messagesA);
      const scoreB = await eval.run(agentB, backendB, testCase, messagesB);

      resultsA.push({ testCase: testCase.name, eval: eval.name, score: scoreA });
      resultsB.push({ testCase: testCase.name, eval: eval.name, score: scoreB });
    }
  }

  // Calculate averages
  const avgA = resultsA.reduce((sum, r) => sum + r.score, 0) / resultsA.length;
  const avgB = resultsB.reduce((sum, r) => sum + r.score, 0) / resultsB.length;

  return {
    agentA: { results: resultsA, average: avgA },
    agentB: { results: resultsB, average: avgB },
    winner: avgA > avgB ? "Agent A" : "Agent B"
  };
}

Best practices

Weighted scoring: For composite evaluations, use weighted scoring to prioritize important criteria.
Error handling: Always handle errors in execute functions and return 0 or throw descriptive errors.
LLM judge reliability: LLM judges can be inconsistent. Run multiple times or use temperature=0 for deterministic results.
Config validation: Validate config parameters at the start of execute functions to provide clear error messages.

Next steps

API reference

Explore all methods and properties

Overview

Learn about evaluation concepts

ExuluQueues

Run evaluations as background jobs

ExuluAgent

Create agents to evaluate