---
title: Assertions
description: Run-level methods, t.check value assertions, the matcher mini-language, and gate vs soft severity.
---

# Assertions



Assertions are how an eval grades what its `test(t)` function produced. Each one **records** a result onto `t` and returns a chainable handle. The runner reads the recorded results to compute the verdict, so a single run reports every failing assertion rather than dying on the first. There are two deterministic surfaces: run-level methods on `t`, and `t.check` for grading a specific value. For model-graded assertions, see [Judge](./judge).

## Run-level assertions

Run-level assertions read the whole run, so they take no value. They are methods on `t` and gate by default. Several key off whether a run **parked**: paused on an unanswered human-in-the-loop (HITL) input request, waiting for an approval or answer before it can continue.

| Assertion                                           | Asserts                                                                                 |
| --------------------------------------------------- | --------------------------------------------------------------------------------------- |
| `t.completed()`                                     | The run did not fail and did not park on unanswered HITL input                          |
| `t.didNotFail()`                                    | No terminal failure and no `turn.failed`/`step.failed` events (parked runs pass)        |
| `t.waiting()`                                       | The run parked on HITL input (for approval-shaped evals)                                |
| `t.messageIncludes(token)`                          | Joined assistant text contains `token` (string or RegExp)                               |
| `t.outputEquals(value)` / `t.outputMatches(schema)` | Deep equality or Standard Schema (e.g. Zod) validation of the agent's structured output |
| `t.calledTool(name, opts?)`                         | A matching tool call happened (`input`, `output`, `isError`, `times` constraints)       |
| `t.notCalledTool(name)`                             | No call to `name`                                                                       |
| `t.toolOrder([...names])`                           | Tool names appear in order (other calls may interleave)                                 |
| `t.usedNoTools()`                                   | No tool calls at all                                                                    |
| `t.maxToolCalls(n)`                                 | At most `n` tool calls                                                                  |
| `t.noFailedActions()`                               | No tool, subagent, or skill action reported a failure                                   |
| `t.calledSubagent(name, opts?)`                     | A subagent delegation happened (`remoteUrl`, `output` constraints)                      |
| `t.event(predicate, label)`                         | Escape hatch: any predicate over the typed event stream                                 |

`t.completed()` subsumes `t.didNotFail()`, so reach for `completed` unless you specifically want to allow a parked run. The structured output that `t.outputEquals` and `t.outputMatches` read is the agent's structured output (see the [output schema guide](../guides/client/output-schema)).

```ts
await t.send("What is the weather in Brooklyn?");
t.completed();
t.calledTool("get_weather");
```

`t.calledTool` and `t.usedNoTools` are mutually exclusive; assert one or the other, never both in the same run.

## Value assertions with `t.check`

`t.check(value, assertion)` grades an explicit value against a builder from `eve/evals/expect`. The value can be `t.reply`, a turn's `.message`, parsed JSON, or any local you computed:

```ts
import { includes, equals, matches, similarity } from "eve/evals/expect";

t.check(t.reply, includes("sunny")); // substring (gate)
t.check(parsed, equals({ city: "Brooklyn" })); // deep structural equality (gate)
t.check(parsed, matches(WeatherSchema)); // Standard Schema, e.g. Zod (gate)
t.check(t.reply, similarity("Sunny, 72F")); // fuzzy 0–1 Levenshtein (soft)
```

| Builder                | Scores                                           | Default |
| ---------------------- | ------------------------------------------------ | ------- |
| `includes(substring)`  | value (coerced to string) contains `substring`   | gate    |
| `equals(value)`        | deep structural equality                         | gate    |
| `matches(schema)`      | validates against a Standard Schema              | gate    |
| `similarity(expected)` | normalized Levenshtein similarity, 1 = identical | soft    |

Pick the cheapest builder that captures what "correct" means. When exact match is too strict but a judge model is overkill, `similarity` is the middle ground. For nuanced grading, reach for the [judge](./judge).

## The matcher mini-language

`t.calledTool` and `t.calledSubagent` take a matcher object: `{ input, output, isError, times }` for tools, `{ remoteUrl, output }` for subagents. Each field accepts a literal (objects partial-deep-match), a RegExp, or a function. A matcher function receives the value and returns either a boolean (acts as a predicate) or an expected value to compare against (handy for runner-assigned values like environment-provided URLs):

```ts
t.calledTool("bash", { input: { command: /^pwd/ }, isError: false, times: 1 });

t.calledTool("echo", { output: (value) => String(value).includes(marker) });

t.calledSubagent("weather", {
  remoteUrl: () => process.env.WEATHER_AGENT_URL!,
  output: /72F/,
});
```

## Run state and derived facts

Beyond the raw `t.events` stream, the runner derives typed facts the assertions read: tool calls (name, input, output, error state), subagent calls, and HITL input requests. A turn that leaves the session open for a next message is the normal end state of a successful turn; parking on unanswered HITL input is tracked separately, and that is what `t.completed()` and `t.waiting()` key off.

The built-in assertions cover almost everything. When you need to read the stream directly, `t.event(predicate, label)` is the escape hatch:

```ts
t.event(
  (events) =>
    events.some((e) => e.type === "message.completed" && e.data.message?.includes(marker)),
  "assistant reply includes the marker",
);
```

## Severity

Every assertion returns a chainable handle. Severity rides on the assertion, so there is no separate thresholds map to keep in sync.

* `.gate(threshold?)` is hard. A miss marks the eval `failed` and `eve eval` exits non-zero.
* `.soft(threshold?)` is tracked data. A below-threshold miss marks the eval `scored`, fatal only under `--strict`. With no threshold, it is tracked-only and never fails.
* `.atLeast(threshold)` is soft with a bar (equivalent to `.soft(threshold)`).

The defaults are chosen so you rarely set severity. Run-level methods and `includes`/`equals`/`matches` are gates; `similarity` and every `t.judge.*` assertion are soft. Annotate only when you deviate:

```ts
t.calledTool("get_weather").soft(); // record the tool call as a metric, don't gate
t.check(t.reply, similarity("Sunny")).atLeast(0.8); // gate the fuzzy match under --strict
t.check(t.reply, includes("error")).soft(); // track without failing the build
```

## What to read next

* [Judge](./judge): LLM-graded assertions with thresholds
* [Cases](./cases): where assertions attach
* [Running evals](./running): how verdicts map to exit codes


---

For a semantic overview of all documentation, see [/sitemap.md](/sitemap.md)

For an index of all available documentation, see [/llms.txt](/llms.txt)

For agent-facing discovery, including API and MCP surfaces, see [/agents.md](/agents.md)