---
title: Overview
description: Define repeatable scored checks for an Eve agent with defineEval and run them with eve eval.
---

# Overview


An eval is a scored check that runs your agent against real sessions and grades the result, catching regressions when you change a prompt or a tool. Drive the agent through one or more turns, assert on what it did (the run completed, the right tool ran, the reply contains the right text), and optionally ship the results to Braintrust.

Evals exercise the same HTTP surface your users hit. The runner boots (or targets) a real agent server, drives sessions through the [TypeScript client](../guides/client/overview) protocol, and grades what comes back, so a passing eval means the agent booted, accepted a request, and produced the result you asserted.

## `defineEval`

Eve discovers evals under the app-root `evals/` directory, in `.eval.ts` files. Each file is one eval by default. A file can also default-export an array to fan out over a dataset (see [Cases](./cases)). The file path is the eval's identity, so you don't author an `id` or `name`. Directories group related evals (`evals/weather/brooklyn-forecast.eval.ts` becomes id `weather/brooklyn-forecast`).

```text
my-agent/
├── agent/
├── evals/
│   ├── evals.config.ts
│   ├── smoke.eval.ts
│   └── weather/
│       ├── brooklyn-forecast.eval.ts
│       └── no-tools-for-greetings.eval.ts
└── package.json
```

An eval is a single `async test(t)` function. You drive the agent with `t` and assert on the run with the same `t`:

```ts title="evals/weather/brooklyn-forecast.eval.ts"
import { defineEval } from "eve/evals";
import { includes } from "eve/evals/expect";

export default defineEval({
  description: "Basic message and tool-usage coverage for the weather agent.",
  async test(t) {
    await t.send("What is the weather in Brooklyn?");
    t.completed();
    t.calledTool("get_weather");
    t.check(t.reply, includes("Sunny"));
  },
});
```

`test` is the only required field. The rest are optional: `description`, `judge`, `tags`, `metadata`, `timeoutMs`, and `reporters`. The init template adds `evals/**/*.ts` to `tsconfig.json`, so your eval code type-checks alongside the app.

## `evals.config.ts`

Every `evals/` directory needs exactly one `evals.config.ts` at its root. It declares the defaults every eval shares:

```ts title="evals/evals.config.ts"
import { defineEvalConfig } from "eve/evals";
import { Braintrust } from "eve/evals/reporters";

export default defineEvalConfig({
  judge: { model: "openai/gpt-5.4-mini" },
  reporters: [Braintrust({ projectName: "my-agent" })],
});
```

Everything is optional. `judge` sets the default model for [LLM-as-judge](./judge) assertions (`t.judge.*`); a tree of fully deterministic evals can omit it. `reporters`, `maxConcurrency`, and `timeoutMs` round out the defaults. Config `reporters` observe every eval in the run, so set one `Braintrust()` here instead of adding it to each eval. CLI flags (`--max-concurrency`, `--timeout`) and per-eval values take precedence over the config defaults.

## The `t` context

`t` is both the driver and the assertion surface. There are no separate `input`, `run`, `checks`, or `scores` fields. You write ordinary control flow, sending turns and asserting inline.

* **Drive** the agent: `t.send(...)`, `t.respond(...)`, `t.respondAll(...)`, `t.sendFile(...)`, `t.expectInputRequests(...)`, `t.newSession()`. Read what came back with `t.reply` (the last assistant message), `t.sessionId`, and `t.events`. See [Cases](./cases).
* **Assert** with three surfaces, covered next.

## Three assertion surfaces

Each surface matches a genuinely different kind of judgment:

* **Run-level methods** read the whole run, like `t.completed()`, `t.calledTool("get_weather")`, `t.usedNoTools()`, and `t.toolOrder([...])`. They take no value because they observe the run itself. See [Assertions](./assertions).
* **`t.check(value, assertion)`** grades an explicit value with a deterministic builder from `eve/evals/expect`, such as `t.check(t.reply, includes("sunny"))`. Grade `t.reply`, an intermediate draft, parsed JSON, or anything else. See [Assertions](./assertions).
* **`t.judge.autoevals.*`** is the LLM-as-judge surface, like `t.judge.autoevals.closedQA("cites a source")`. It grades `t.reply` by default and uses the configured judge model, never the agent under test. See [Judge](./judge).

## Gate vs soft

Every assertion returns a chainable handle, so severity rides on the assertion itself. There is no separate thresholds map.

* **Gates** are hard. A failed gate marks the eval `failed` and `eve eval` exits non-zero. Run-level methods, `includes`, `equals`, and `matches` are gates by default.
* **Soft** assertions are tracked data. They land in reports and artifacts, and a below-threshold soft assertion marks the eval `scored` (visible but not fatal, unless you pass `--strict`). `similarity` and every `t.judge.*` assertion are soft by default. A soft assertion with no threshold is tracked-only and never fails.

Override per assertion: `.gate(threshold?)` promotes to a hard gate, `.soft(threshold?)` demotes to tracked, and `.atLeast(threshold)` is a soft assertion with a bar.

```ts
t.completed(); // gate
t.calledTool("get_weather").soft(); // record as a metric, don't gate
t.judge.autoevals.closedQA("cites a source"); // soft, tracked (no threshold)
t.judge.autoevals.factuality(reference).atLeast(0.7); // soft, gated under --strict at 0.7
```

## Run evals with eve eval

```bash
eve eval                       # run all discovered evals against a local dev server
eve eval weather               # run one eval, or every eval under evals/weather/
eve eval --url https://<app>   # target an existing server or deployment
```

Exit code `0` means every eval passed its gates. See [Running evals](./running) for the full flag list, exit codes, and CI guidance.

## A good baseline

Most apps do fine with a few small smoke evals. Assert behavior with `t.completed()` plus one or two content checks, keep dataset fixtures in `evals/data/`, and reach for a judge or Braintrust only when you need fuzzy grading or shared result review. In CI, run `eve eval --strict` so soft threshold misses fail the build too.

## What to read next

The rest of this section covers each piece:

* [Cases](./cases): single-turn evals, scripted multi-turn evals, and dataset fan-out
* [Assertions](./assertions): run-level methods and `t.check` value assertions, with matchers and severity
* [Judge](./judge): LLM-as-judge grading and the judge model
* [Targets](./targets): local vs remote targets for the same eval files
* [Reporters](./reporters): Braintrust experiments and JUnit XML
* [Running evals](./running): the `eve eval` CLI, exit codes, and artifacts
* [Tools](../tools): the surface most evals assert on


---

For a semantic overview of all documentation, see [/sitemap.md](/sitemap.md)

For an index of all available documentation, see [/llms.txt](/llms.txt)

For agent-facing discovery, including API and MCP surfaces, see [/agents.md](/agents.md)