Configuration

Constructor parameters

ExuluEval requires specific configuration to define evaluation behavior:

new ExuluEval({
  id: string;
  name: string;
  description: string;
  llm: boolean;
  execute: (params) => Promise<number>;
  config?: { name: string; description: string }[];
  queue?: Promise<ExuluQueueConfig>;
})

string

required

Unique identifier for the evaluation function

name

string

required

Human-readable name for the evaluation

description

string

required

Description of what this evaluation measures

llm

boolean

required

Whether this evaluation uses an LLM for scoring (LLM-as-judge)

execute

function

required

Function that performs the evaluation and returns a score from 0-100

config

array

Optional configuration parameters for the evaluation function

queue

Promise<ExuluQueueConfig>

Optional queue configuration for running evaluations as background jobs

Execute function

The execute function receives evaluation parameters and must return a score between 0 and 100:

execute: async ({
  agent,      // Agent database record
  backend,    // ExuluAgent instance
  messages,   // Conversation messages
  testCase,   // Test case with expected output
  config      // Optional runtime configuration
}) => {
  // Your evaluation logic
  return score; // Must be 0-100
}

Parameters

agent

Agent

The agent database record being evaluated

interface Agent {
  id: string;
  name: string;
  description: string;
  // ... other agent properties
}

backend

ExuluAgent

ExuluAgent instance for generating responses or using LLM-as-judge

messages

UIMessage[]

Array of conversation messages including inputs and generated response

interface UIMessage {
  role: "user" | "assistant" | "system";
  content: string;
  toolInvocations?: ToolInvocation[];
}

testCase

TestCase

Test case containing inputs and expected outputs

interface TestCase {
  id: string;
  name: string;
  description?: string;
  inputs: UIMessage[];
  expected_output: string;
  expected_tools?: string[];
  expected_knowledge_sources?: string[];
  expected_agent_tools?: string[];
}

config

Record<string, any>

Runtime configuration values (optional)

Configuration patterns

Basic exact match evaluation

import { ExuluEval } from "@exulu/backend";

const exactMatchEval = new ExuluEval({
  id: "exact_match",
  name: "Exact Match",
  description: "Returns 100 if response exactly matches expected output, 0 otherwise",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    return response === testCase.expected_output ? 100 : 0;
  }
});

Partial match with scoring

const partialMatchEval = new ExuluEval({
  id: "partial_match",
  name: "Partial Match",
  description: "Scores based on how much of expected output appears in response",
  llm: false,
  execute: async ({ messages, testCase }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content?.toLowerCase() || "";
    const expected = testCase.expected_output.toLowerCase();

    // Split into words
    const expectedWords = expected.split(/\s+/);
    const matchedWords = expectedWords.filter(word =>
      response.includes(word)
    );

    return (matchedWords.length / expectedWords.length) * 100;
  }
});

Keyword presence evaluation

const keywordEval = new ExuluEval({
  id: "keyword_presence",
  name: "Keyword Presence",
  description: "Checks if response contains required keywords",
  llm: false,
  execute: async ({ messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content?.toLowerCase() || "";

    const keywords = config?.keywords || [];
    if (keywords.length === 0) return 100;

    const foundKeywords = keywords.filter(kw =>
      response.includes(kw.toLowerCase())
    );

    return (foundKeywords.length / keywords.length) * 100;
  },
  config: [
    {
      name: "keywords",
      description: "Array of keywords that should appear in response"
    }
  ]
});

// Run with config
const score = await keywordEval.run(
  agent,
  backend,
  testCase,
  messages,
  { keywords: ["weather", "temperature", "San Francisco"] }
);

LLM-as-judge evaluation

const llmJudgeEval = new ExuluEval({
  id: "llm_judge",
  name: "LLM Judge",
  description: "Uses an LLM to evaluate response quality",
  llm: true,
  execute: async ({ backend, messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    const judgePrompt = `
You are an expert evaluator. Rate the following response on a scale of 0-100.

Test Case: ${testCase.name}
Description: ${testCase.description || "N/A"}

Expected Output:
${testCase.expected_output}

Actual Response:
${response}

Criteria:
1. Accuracy: Does it match the expected output?
2. Completeness: Does it address all required aspects?
3. Clarity: Is it well-structured and understandable?
4. Relevance: Does it stay on topic?

Respond with ONLY a number from 0 to 100. No explanation.
    `.trim();

    const result = await backend.generateSync({
      prompt: judgePrompt,
      agentInstance: await loadAgent(config?.judgeAgentId || "default_judge"),
      statistics: { label: "eval", trigger: "llm_judge" }
    });

    const score = parseInt(result.text.trim());

    if (isNaN(score)) {
      console.warn(`LLM judge returned non-numeric: ${result.text}`);
      return 0;
    }

    return Math.max(0, Math.min(100, score));
  },
  config: [
    {
      name: "judgeAgentId",
      description: "Agent ID to use for evaluation (must support text generation)"
    }
  ]
});

Tool usage evaluation

const toolUsageEval = new ExuluEval({
  id: "tool_usage",
  name: "Tool Usage",
  description: "Checks if agent used expected tools",
  llm: false,
  execute: async ({ messages, testCase }) => {
    // Extract tool calls from conversation
    const toolCalls = messages
      .flatMap(msg => msg.toolInvocations || [])
      .map(inv => inv.toolName);

    const expectedTools = testCase.expected_tools || [];

    // If no tools expected, check that no tools were used
    if (expectedTools.length === 0) {
      return toolCalls.length === 0 ? 100 : 0;
    }

    // Check if all expected tools were used
    const usedExpected = expectedTools.filter(tool =>
      toolCalls.includes(tool)
    );

    return (usedExpected.length / expectedTools.length) * 100;
  }
});

Regex pattern matching

const regexMatchEval = new ExuluEval({
  id: "regex_match",
  name: "Regex Pattern Match",
  description: "Checks if response matches regex pattern",
  llm: false,
  execute: async ({ messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    const pattern = config?.pattern;
    if (!pattern) {
      throw new Error("Regex pattern required in config");
    }

    const regex = new RegExp(pattern, config?.flags || "");
    return regex.test(response) ? 100 : 0;
  },
  config: [
    {
      name: "pattern",
      description: "Regex pattern to match"
    },
    {
      name: "flags",
      description: "Regex flags (e.g., 'i' for case-insensitive)"
    }
  ]
});

// Run with regex config
const score = await regexMatchEval.run(
  agent,
  backend,
  testCase,
  messages,
  {
    pattern: "\\d{2}°[FC]",  // Matches temperature like "68°F"
    flags: "i"
  }
);

Length-based evaluation

const lengthEval = new ExuluEval({
  id: "response_length",
  name: "Response Length",
  description: "Scores based on response length within acceptable range",
  llm: false,
  execute: async ({ messages, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";
    const length = response.length;

    const minLength = config?.minLength || 0;
    const maxLength = config?.maxLength || Infinity;
    const targetLength = config?.targetLength;

    // If within range, score based on proximity to target
    if (length < minLength) {
      return Math.max(0, (length / minLength) * 100);
    }

    if (length > maxLength) {
      return Math.max(0, 100 - ((length - maxLength) / maxLength) * 100);
    }

    // Within range
    if (targetLength) {
      const deviation = Math.abs(length - targetLength);
      const maxDeviation = Math.max(
        targetLength - minLength,
        maxLength - targetLength
      );
      return Math.max(0, 100 - (deviation / maxDeviation) * 50);
    }

    return 100;
  },
  config: [
    {
      name: "minLength",
      description: "Minimum acceptable character count"
    },
    {
      name: "maxLength",
      description: "Maximum acceptable character count"
    },
    {
      name: "targetLength",
      description: "Ideal character count (optional)"
    }
  ]
});

Composite evaluation

Combine multiple evaluation criteria:

const compositeEval = new ExuluEval({
  id: "composite",
  name: "Composite Evaluation",
  description: "Combines multiple evaluation criteria with weights",
  llm: false,
  execute: async ({ messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    let totalScore = 0;
    let totalWeight = 0;

    // Criteria 1: Contains expected output (weight: 50%)
    const containsExpected = response.includes(testCase.expected_output);
    totalScore += containsExpected ? 50 : 0;
    totalWeight += 50;

    // Criteria 2: Reasonable length (weight: 20%)
    const isReasonableLength = response.length >= 50 && response.length <= 500;
    totalScore += isReasonableLength ? 20 : 0;
    totalWeight += 20;

    // Criteria 3: Uses tools if expected (weight: 30%)
    const toolCalls = messages.flatMap(msg => msg.toolInvocations || []);
    const expectedTools = testCase.expected_tools || [];
    if (expectedTools.length > 0) {
      const toolsUsed = expectedTools.every(tool =>
        toolCalls.some(call => call.toolName === tool)
      );
      totalScore += toolsUsed ? 30 : 0;
      totalWeight += 30;
    } else {
      totalScore += 30; // No tools expected, full points
      totalWeight += 30;
    }

    return (totalScore / totalWeight) * 100;
  }
});

Queue configuration

Run evaluations as background jobs using ExuluQueues:

import { ExuluEval, ExuluQueues } from "@exulu/backend";

const backgroundEval = new ExuluEval({
  id: "background_eval",
  name: "Background Evaluation",
  description: "Runs as queued job",
  llm: true,
  execute: async ({ backend, messages, testCase }) => {
    // Long-running evaluation logic
    return 85;
  },
  queue: Promise.resolve({
    connection: await ExuluQueues.getConnection(),
    name: "evaluations",
    prefix: "{exulu}",
    defaultJobOptions: {
      attempts: 3,
      backoff: {
        type: "exponential",
        delay: 2000
      },
      removeOnComplete: true,
      removeOnFail: false
    }
  })
});

Advanced patterns

Semantic similarity evaluation

Use embeddings to measure semantic similarity:

import { ExuluEval, ExuluEmbedder, ExuluVariables } from "@exulu/backend";

const semanticSimilarityEval = new ExuluEval({
  id: "semantic_similarity",
  name: "Semantic Similarity",
  description: "Measures semantic similarity using embeddings",
  llm: false,
  execute: async ({ messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    const embedder = new ExuluEmbedder({
      id: "eval_embedder",
      name: "Evaluation Embedder",
      provider: "openai",
      model: "text-embedding-3-small",
      vectorDimensions: 1536,
      authenticationInformation: await ExuluVariables.get("openai_api_key")
    });

    const [responseEmb, expectedEmb] = await embedder.generate([
      response,
      testCase.expected_output
    ]);

    // Cosine similarity
    const similarity = cosineSimilarity(responseEmb, expectedEmb);

    // Scale to 0-100
    return similarity * 100;
  }
});

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dotProduct / (magnitudeA * magnitudeB);
}

Multi-aspect LLM judge

Evaluate multiple aspects separately:

const multiAspectJudgeEval = new ExuluEval({
  id: "multi_aspect_judge",
  name: "Multi-Aspect LLM Judge",
  description: "Evaluates multiple aspects with separate LLM calls",
  llm: true,
  execute: async ({ backend, messages, testCase, config }) => {
    const lastMessage = messages[messages.length - 1];
    const response = lastMessage?.content || "";

    const aspects = [
      {
        name: "accuracy",
        weight: 40,
        prompt: "Rate the accuracy of this response (0-100):"
      },
      {
        name: "clarity",
        weight: 30,
        prompt: "Rate the clarity and readability (0-100):"
      },
      {
        name: "completeness",
        weight: 30,
        prompt: "Rate how complete the response is (0-100):"
      }
    ];

    let totalScore = 0;
    let totalWeight = 0;

    for (const aspect of aspects) {
      const judgePrompt = `
${aspect.prompt}

Expected: ${testCase.expected_output}
Actual: ${response}

Respond with ONLY a number 0-100.
      `.trim();

      const result = await backend.generateSync({
        prompt: judgePrompt,
        agentInstance: await loadAgent(config?.judgeAgentId || "default_judge"),
        statistics: { label: "eval", trigger: "multi_aspect_judge" }
      });

      const score = parseInt(result.text.trim());
      if (!isNaN(score)) {
        totalScore += Math.max(0, Math.min(100, score)) * aspect.weight;
        totalWeight += aspect.weight;
      }
    }

    return totalWeight > 0 ? totalScore / totalWeight : 0;
  },
  config: [
    {
      name: "judgeAgentId",
      description: "Agent ID for LLM judge"
    }
  ]
});

A/B testing evaluation

Compare two agent configurations:

async function compareAgents(
  agentA: Agent,
  agentB: Agent,
  backendA: ExuluAgent,
  backendB: ExuluAgent,
  testCases: TestCase[],
  evals: ExuluEval[]
) {
  const resultsA = [];
  const resultsB = [];

  for (const testCase of testCases) {
    // Generate response from Agent A
    const responseA = await backendA.generateSync({
      prompt: testCase.inputs[testCase.inputs.length - 1].content,
      agentInstance: await loadAgent(agentA.id),
      statistics: { label: "ab_test", trigger: "test" }
    });

    const messagesA = [
      ...testCase.inputs,
      { role: "assistant", content: responseA.text }
    ];

    // Generate response from Agent B
    const responseB = await backendB.generateSync({
      prompt: testCase.inputs[testCase.inputs.length - 1].content,
      agentInstance: await loadAgent(agentB.id),
      statistics: { label: "ab_test", trigger: "test" }
    });

    const messagesB = [
      ...testCase.inputs,
      { role: "assistant", content: responseB.text }
    ];

    // Run evaluations on both
    for (const eval of evals) {
      const scoreA = await eval.run(agentA, backendA, testCase, messagesA);
      const scoreB = await eval.run(agentB, backendB, testCase, messagesB);

      resultsA.push({ testCase: testCase.name, eval: eval.name, score: scoreA });
      resultsB.push({ testCase: testCase.name, eval: eval.name, score: scoreB });
    }
  }

  // Calculate averages
  const avgA = resultsA.reduce((sum, r) => sum + r.score, 0) / resultsA.length;
  const avgB = resultsB.reduce((sum, r) => sum + r.score, 0) / resultsB.length;

  return {
    agentA: { results: resultsA, average: avgA },
    agentB: { results: resultsB, average: avgB },
    winner: avgA > avgB ? "Agent A" : "Agent B"
  };
}

Best practices

Weighted scoring: For composite evaluations, use weighted scoring to prioritize important criteria.

Error handling: Always handle errors in execute functions and return 0 or throw descriptive errors.

LLM judge reliability: LLM judges can be inconsistent. Run multiple times or use temperature=0 for deterministic results.

Config validation: Validate config parameters at the start of execute functions to provide clear error messages.

Next steps

API reference

Explore all methods and properties

Overview

Learn about evaluation concepts

ExuluQueues

Run evaluations as background jobs

ExuluAgent

Create agents to evaluate

Getting started

Core classes

Frontend

GraphQL API

Core Types

Constructor parameters

Execute function

Parameters

Configuration patterns

Basic exact match evaluation

Partial match with scoring

Keyword presence evaluation

LLM-as-judge evaluation

Tool usage evaluation

Regex pattern matching

Length-based evaluation

Composite evaluation

Queue configuration

Advanced patterns

Semantic similarity evaluation

Multi-aspect LLM judge

A/B testing evaluation

Best practices

Next steps

API reference

Overview

ExuluQueues

ExuluAgent

Getting started

Core classes

Frontend

GraphQL API

Core Types

Documentation Index

​Constructor parameters

​Execute function

​Parameters

​Configuration patterns

​Basic exact match evaluation

​Partial match with scoring

​Keyword presence evaluation

​LLM-as-judge evaluation

​Tool usage evaluation

​Regex pattern matching

​Length-based evaluation

​Composite evaluation

​Queue configuration

​Advanced patterns

​Semantic similarity evaluation

​Multi-aspect LLM judge

​A/B testing evaluation

​Best practices

​Next steps

API reference

Overview

ExuluQueues

ExuluAgent

Constructor parameters

Execute function

Parameters

Configuration patterns

Basic exact match evaluation

Partial match with scoring

Keyword presence evaluation

LLM-as-judge evaluation

Tool usage evaluation

Regex pattern matching

Length-based evaluation

Composite evaluation

Queue configuration

Advanced patterns

Semantic similarity evaluation

Multi-aspect LLM judge

A/B testing evaluation

Best practices

Next steps