Genkit for Node.js 1.0 is now available and production-ready! Learn more

此页面由 Cloud Translation API 翻译。

编写 Genkit 评估器

您可以扩展 Firebase Genkit 以支持自定义评估，方法是使用 LLM 作为评判者，或通过程序化（启发词语）评估。

评估者定义

评估器是用于评估 LLM 回答的函数。自动评估主要有两种方法：启发词语评估和基于 LLM 的评估。在启发词语方法中，您需要定义一个确定性函数。相比之下，在基于 LLM 的评估中，系统会将内容反馈给 LLM，并要求 LLM 根据提示中设置的条件对输出进行评分。

ai.defineEvaluator 方法（用于在 Genkit 中定义评估器操作）支持这两种方法。本文档探讨了如何将此方法用于启发词语和基于 LLM 的评估的一些示例。

基于 LLM 的评估程序

基于 LLM 的评估器利用 LLM 来评估生成式 AI 特征的 input、context 和 output。

Genkit 中基于 LLM 的评估器由 3 个组件组成：

提示
评分函数
评估器操作

定义提示

在此示例中，评估者利用 LLM 来确定食物（output）是否美味。首先，向 LLM 提供上下文，然后描述您希望它执行的操作，最后，提供一些示例来作为其回答的基础。

Genkit 的 definePrompt 实用程序提供了一种简单的方法来定义包含输入和输出验证的提示。以下代码展示了如何使用 definePrompt 设置评估提示。

import { z } from "genkit";

const DELICIOUSNESS_VALUES = ['yes', 'no', 'maybe'] as const;

const DeliciousnessDetectionResponseSchema = z.object({
  reason: z.string(),
  verdict: z.enum(DELICIOUSNESS_VALUES),
});

function getDeliciousnessPrompt(ai: Genkit) {
  return  ai.definePrompt({
      name: 'deliciousnessPrompt',
      input: {
        schema: z.object({
          responseToTest: z.string(),
        }),
      },
      output: {
        schema: DeliciousnessDetectionResponseSchema,
      }
      prompt: `You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict.

      Examples:
      Output: Chicken parm sandwich
      Response: { "reason": "A classic and beloved dish.", "verdict": "yes" }

      Output: Boston Logan Airport tarmac
      Response: { "reason": "Not edible.", "verdict": "no" }

      Output: A juicy piece of gossip
      Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" }

      New Output: {{ responseToTest }}
      Response:
      `
  });
}

定义评分函数

定义一个函数，该函数接受包含 output 的示例（如提示所需），并为结果评分。Genkit 测试用例将 input 作为必填字段，将 output 和 context 作为可选字段。评估者有责任验证评估所需的所有字段是否均已填写。

import { ModelArgument } from 'genkit';
import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

/**
 * Score an individual test case for delciousness.
 */
export async function deliciousnessScore<
  CustomModelOptions extends z.ZodTypeAny,
>(
  ai: Genkit,
  judgeLlm: ModelArgument<CustomModelOptions>,
  dataPoint: BaseEvalDataPoint,
  judgeConfig?: CustomModelOptions
): Promise<Score> {
  const d = dataPoint;
  // Validate the input has required fields
  if (!d.output) {
    throw new Error('Output is required for Deliciousness detection');
  }

  // Hydrate the prompt and generate an evaluation result
  const deliciousnessPrompt = getDeliciousnessPrompt(ai);
  const response = await deliciousnessPrompt(
    {
      responseToTest: d.output as string,
    },
    {
      model: judgeLlm,
      config: judgeConfig,
    }
  );

  // Parse the output
  const parsedResponse = response.output;
  if (!parsedResponse) {
    throw new Error(`Unable to parse evaluator response: ${response.text}`);
  }

  // Return a scored response
  return {
    score: parsedResponse.verdict,
    details: { reasoning: parsedResponse.reason },
  };
}

定义评估器操作

最后一步是编写一个用于定义 EvaluatorAction 的函数。

import { EvaluatorAction } from 'genkit/evaluator';

/**
 * Create the Deliciousness evaluator action.
 */
export function createDeliciousnessEvaluator<
  ModelCustomOptions extends z.ZodTypeAny,
>(
  ai: Genkit,
  judge: ModelArgument<ModelCustomOptions>,
  judgeConfig?: z.infer<ModelCustomOptions>
): EvaluatorAction {
  return ai.defineEvaluator(
    {
      name: `myCustomEvals/deliciousnessEvaluator`,
      displayName: 'Deliciousness',
      definition: 'Determines if output is considered delicous.',
      isBilled: true,
    },
    async (datapoint: BaseEvalDataPoint) => {
      const score = await deliciousnessScore(ai, judge, datapoint, judgeConfig);
      return {
        testCaseId: datapoint.testCaseId,
        evaluation: score,
      };
    }
  );
}

defineEvaluator 方法类似于其他 Genkit 构造函数，例如 defineFlow 和 defineRetriever。此方法需要提供 EvaluatorFn 作为回调。EvaluatorFn 方法接受 BaseEvalDataPoint 对象（与待评估数据集中的单个条目相对应），以及可选的自定义选项参数（如果指定）。该函数会处理数据点并返回 EvalResponse 对象。

BaseEvalDataPoint 和 EvalResponse 的 Zod 架构如下所示。

`BaseEvalDataPoint`

export const BaseEvalDataPoint = z.object({
  testCaseId: z.string(),
  input: z.unknown(),
  output: z.unknown().optional(),
  context: z.array(z.unknown()).optional(),
  reference: z.unknown().optional(),
  testCaseId: z.string().optional(),
  traceIds: z.array(z.string()).optional(),
});

export const EvalResponse = z.object({
  sampleIndex: z.number().optional(),
  testCaseId: z.string(),
  traceId: z.string().optional(),
  spanId: z.string().optional(),
  evaluation: z.union([ScoreSchema, z.array(ScoreSchema)]),
});

`ScoreSchema`

const ScoreSchema = z.object({
  id: z.string().describe('Optional ID to differentiate multiple scores').optional(),
  score: z.union([z.number(), z.string(), z.boolean()]).optional(),
  error: z.string().optional(),
  details: z
    .object({
      reasoning: z.string().optional(),
    })
    .passthrough()
    .optional(),
});

借助 defineEvaluator 对象，用户可以为评估程序提供名称、用户可读的显示名称和定义。显示名称和定义会与评估结果一起显示在开发者界面中。它还有一个可选的 isBilled 字段，用于标记此评估器是否会导致结算（例如，它使用的是付费 LLM 或 API）。如果评估者需要付费，界面会在 CLI 中提示用户进行确认，然后才允许用户运行评估。此步骤有助于防止意外支出。

启发词语评估程序

启发词语评估器可以是用于评估生成式 AI 功能的 input、context 或 output 的任何函数。

Genkit 中的启发词语评估器由 2 个组件组成：

评分函数
评估器操作

定义评分函数

与基于 LLM 的评估器一样，请定义评分函数。在这种情况下，评分函数不需要评判 LLM。

import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

const US_PHONE_REGEX =
  /[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4}/i;

/**
 * Scores whether a datapoint output contains a US Phone number.
 */
export async function usPhoneRegexScore(
  dataPoint: BaseEvalDataPoint
): Promise<Score> {
  const d = dataPoint;
  if (!d.output || typeof d.output !== 'string') {
    throw new Error('String output is required for regex matching');
  }
  const matches = US_PHONE_REGEX.test(d.output as string);
  const reasoning = matches
    ? `Output matched US_PHONE_REGEX`
    : `Output did not match US_PHONE_REGEX`;
  return {
    score: matches,
    details: { reasoning },
  };
}

定义评估器操作

import { Genkit } from 'genkit';
import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator';

/**
 * Configures a regex evaluator to match a US phone number.
 */
export function createUSPhoneRegexEvaluator(ai: Genkit): EvaluatorAction {
  return ai.defineEvaluator(
    {
      name: `myCustomEvals/usPhoneRegexEvaluator`,
      displayName: "Regex Match for US PHONE NUMBER",
      definition: "Uses Regex to check if output matches a US phone number",
      isBilled: false,
    },
    async (datapoint: BaseEvalDataPoint) => {
      const score = await usPhoneRegexScore(datapoint);
      return {
        testCaseId: datapoint.testCaseId,
        evaluation: score,
      };
    }
  );
}

归纳总结

插件定义

通过在初始化 Genkit 时安装插件，将插件注册到框架。如需定义新插件，请使用 genkitPlugin 辅助方法在插件上下文中实例化所有 Genkit 操作。

此代码示例展示了两个评估器：基于 LLM 的美味程度评估器，以及基于正则表达式的美国电话号码评估器。在插件上下文中实例化这些评估器会将它们注册到插件。

import { GenkitPlugin, genkitPlugin } from 'genkit/plugin';

export function myCustomEvals<
  ModelCustomOptions extends z.ZodTypeAny
>(options: {
  judge: ModelArgument<ModelCustomOptions>;
  judgeConfig?: ModelCustomOptions;
}): GenkitPlugin {
  // Define the new plugin
  return genkitPlugin("myCustomEvals", async (ai: Genkit) => {
    const { judge, judgeConfig } = options;

    // The plugin instatiates our custom evaluators within the context
    // of the `ai` object, making them available
    // throughout our Genkit application.
    createDeliciousnessEvaluator(ai, judge, judgeConfig);
    createUSPhoneRegexEvaluator(ai);
  });
}
export default myCustomEvals;

配置 Genkit

将 myCustomEvals 插件添加到 Genkit 配置中。

如需使用 Gemini 进行评估，请停用安全设置，以便评估者接受、检测和评分可能有害的内容。

import { gemini15Pro } from '@genkit-ai/googleai';

const ai = genkit({
  plugins: [
    vertexAI(),
    ...
    myCustomEvals({
      judge: gemini15Pro,
    }),
  ],
  ...
});

使用自定义评估器

在 Genkit 应用上下文中（通过插件或直接）实例化自定义评估器后，即可使用它们。以下示例展示了如何使用一些示例输入和输出来试用美味度评估器。

1. 创建一个包含以下内容的 JSON 文件“deliciousness_dataset.json”：

[
  {
    "testCaseId": "delicous_mango",
    "input": "What is a super delicious fruit",
    "output": "A perfectly ripe mango – sweet, juicy, and with a hint of tropical sunshine."
  },
  {
    "testCaseId": "disgusting_soggy_cereal",
    "input": "What is something that is tasty when fresh but less tasty after some time?",
    "output": "Stale, flavorless cereal that's been sitting in the box too long."
  }
]

2. 使用 Genkit CLI 针对这些测试用例运行评估器。

# Start your genkit runtime
genkit start -- <command to start your app>
genkit eval:run deliciousness_dataset.json --evaluators=myCustomEvals/deliciousnessEvaluator

3. 前往 `localhost:4000/evaluate`，在 Genkit 界面中查看结果。

请务必注意，随着您使用标准数据集或方法对自定义评估器进行基准测试，对其的信心会越来越大。迭代改进此类基准测试的结果，以提升评估者的表现，直到达到目标质量水平。