autoevals
Universal library for evaluating AI models
Description
Autoevals
Autoevals is a tool to quickly and easily evaluate AI model outputs.
It bundles together a variety of automatic evaluation methods including:
- LLM-as-a-judge
- Heuristic (e.g. Levenshtein distance)
- Statistical (e.g. BLEU)
Autoevals is developed by the team at Braintrust.
Autoevals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.
You can also create your own model-graded evaluations with Autoevals. It's easy to add custom prompts, parse outputs, and manage exceptions.
<div className="hidden">Requirements
- Python 3.9 or higher
- Compatible with both OpenAI Python SDK v0.x and v1.x
Installation
<div className="tabs">TypeScript
npm install autoevals
Python
pip install autoevals
</div>
Getting started
Use Autoevals to model-grade an example LLM completion using the Factuality prompt.
By default, Autoevals uses your OPENAI_API_KEY environment variable to authenticate with OpenAI's API.
Python
from autoevals.llm import *
import asyncio
# Create a new LLM-based evaluator
evaluator = Factuality()
# Synchronous evaluation
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"
# Using the synchronous API
result = evaluator(output, expected, input=input)
print(f"Factuality score (sync): {result.score}")
print(f"Factuality metadata (sync): {result.metadata['rationale']}")
# Using the asynchronous API
async def main():
result = await evaluator.eval_async(output, expected, input=input)
print(f"Factuality score (async): {result.score}")
print(f"Factuality metadata (async): {result.metadata['rationale']}")
# Run the async example
asyncio.run(main())
TypeScript
import { Factuality } from "autoevals";
(async () => {
const input = "Which country has the highest population?";
const output = "People's Republic of China";
const expected = "China";
const result = await Factuality({ output, expected, input });
console.log(`Factuality score: ${result.score}`);
console.log(`Factuality metadata: ${result.metadata?.rationale}`);
})();
</div>
Using other AI providers
When you use Autoevals, it will look for an OPENAI_BASE_URL environment variable to use as the base for requests to an OpenAI compatible API. If OPENAI_BASE_URL is not set, it will default to the AI proxy.
If you choose to use the proxy, you'll also get:
- Simplified access to many AI providers
- Reduced costs with automatic request caching
- Increased observability when you enable logging to Braintrust
The proxy is free to use, even if you don't have a Braintrust account.
If you have a Braintrust account, you can optionally set the BRAINTRUST_API_KEY environment variable instead of OPENAI_API_KEY to unlock additional features like logging and monitoring. You can also route requests to supported AI providers and models or custom models you have configured in Braintrust.
Python
# NOTE: ensure BRAINTRUST_API_KEY is set in your environment and OPENAI_API_KEY is not set
from autoevals.llm import *
# Create an LLM-based evaluator using the Claude 3.5 Sonnet model from Anthropic
evaluator = Factuality(model="claude-3-5-sonnet-latest")
# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"
result = evaluator(output, expected, input=input)
# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")
TypeScript
// NOTE: ensure BRAINTRUST_API_KEY is set in your environment and OPENAI_API_KEY is not set
import { Factuality } from "autoevals";
(async () => {
const input = "Which country has the highest population?";
const output = "People's Republic of China";
const expected = "China";
// Run an LLM-based evaluator using the Claude 3.5 Sonnet model from Anthropic
const result = await Factuality({
model: "claude-3-5-sonnet-latest",
output,
expected,
input,
});
// The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
console.log(`Factuality score: ${result.score}`);
console.log(`Factuality metadata: ${result.metadata?.rationale}`);
})();
</div>
Custom client configuration
There are two ways you can configure a custom client when you need to use a different OpenAI compatible API:
- Global configuration: Initialize a client that will be used by all evaluators
- Instance configuration: Configure a client for a specific evaluator
Global configuration
Set up a client that all your evaluators will use:
<div className="tabs">Python
import openai
import asyncio
from autoevals import init
from autoevals.llm import Factuality
client = init(openai.AsyncOpenAI(base_url="https://api.openai.com/v1/"))
async def main():
evaluator = Factuality()
result = await evaluator.eval_async(
input="What is the speed of light in a vacuum?",
output="The speed of light in a vacuum is 299,792,458 meters per second.",
expected="The speed of light in a vacuum is approximately 300,000 kilometers per second."
)
print(f"Factuality score: {result.score}")
asyncio.run(main())
TypeScript
import OpenAI from "openai";
import { init, Factuality } from "autoevals";
const client = new OpenAI({
baseURL: "https://api.openai.com/v1/",
});
init({ client });
(async () => {
const result = await Factuality({
input: "What is the speed of light in a vacuum?",
output: "The speed of light in a vacuum is 299,792,458 meters per second.",
expected:
"The speed of light in a vacuum is approximately 300,000 kilometers per second (or precisely 299,792,458 meters per second).",
});
console.log("Factuality Score:", result);
})();
</div>
Instance configuration
Configure a client for a specific evaluator instance:
<div className="tabs">Python
import openai
from autoevals.llm import Factuality
custom_client = openai.OpenAI(base_url="https://custom-api.example.com/v1/")
evaluator = Factuality(client=custom_client)
TypeScript
import OpenAI from "openai";
import { Factuality } from "autoevals";
(async () => {
const customClient = new OpenAI({
baseURL: "https://custom-api.example.com/v1/",
});
const result = await Factuality({
client: customClient,
output: "Paris is the capital of France",
expected:
"Paris is the capital of France and has a population of over 2 million",
input: "Tell me about Paris",
});
console.log(result);
})();
</div>
Using Braintrust with Autoevals (optional)
Once you grade an output using Autoevals, you can optionally use Braintrust to log and compare your evaluation results. This integration is completely optional and not required for using Autoevals.
<div className="tabs">TypeScript
Create a file named example.eval.js (it must take the form *.eval.[ts|tsx|js|jsx]):
import { Eval } from "braintrust";
import { Factuality } from "autoevals";
Eval("Autoevals", {
data: () => [
{
input: "Which country has the highest population?",
expected: "China",
},
],
task: () => "People's Republic of China",
scores: [Factuality],
});
Then, run
npx braintrust run example.eval.js
Python
Create a file named eval_example.py (it must take the form eval_*.py):
import braintrust
from autoevals.llm import Factuality
Eval(
"Autoevals",
data=lambda: [
dict(
input="Which country has the highest population?",
expected="China",
),
],
task=lambda *args: "People's Republic of China",
scores=[Factuality],
)
</div>
Supported evaluation methods
LLM-as-a-judge evaluations
- Battle
- Closed QA
- Humor
- Factuality
- Moderation
- Security
- Summarization
- SQL
- Translation
- Fine-tuned binary classifiers
RAG evaluations
- Context precision
- Context relevancy
- Context recall
- Context entity recall
- Faithfulness
- Answer relevancy
- Answer similarity
- Answer correctness
Composite evaluations
- Semantic list contains
- JSON validity
Embedding evaluations
- Embedding similarity
Heuristic evaluations
- Levenshtein distance
- Exact match
- Numeric difference
- JSON diff
For detailed documentation on all scorers, including parameters, score ranges, and usage examples, see the Scorer Reference.
Custom evaluation prompts
Autoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:
<div className="tabs">Python
from autoevals import LLMClassifier
# Define a prompt prefix for a LLMClassifier (returns just one answer)
prompt_prefix = """
You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.
I'm going to provide you with the issue description, and two possible titles.
Issue Description: {{input}}
1: {{output}}
2: {{expected}}
"""
# Define the scoring mechanism
# 1 if the generat