---
title: Cases
description: Author single-turn and multi-turn evals with test(t), and fan one file out over a dataset.
---

# Cases



Each eval file is one graded case by default, and a single file can fan out over a dataset by default-exporting an array (covered below). The runner executes each `test(t)` function against the target, captures every event, and computes a verdict from the [assertions](./assertions) you recorded. Every eval shares one shape, whether single-turn, multi-turn, human-in-the-loop (HITL), or dataset-driven: one `async test(t)` function that drives the agent and asserts inline.

## Single-turn evals

The common case sends one turn and asserts on the reply. `t.send(input)` resolves once the turn settles, and `t.reply` is the last assistant message:

```ts title="evals/weather/brooklyn-forecast.eval.ts"
import { defineEval } from "eve/evals";
import { includes } from "eve/evals/expect";

export default defineEval({
  async test(t) {
    await t.send("What is the weather in Brooklyn?");
    t.completed();
    t.check(t.reply, includes("Sunny"));
  },
});
```

Some evals only care about behavior, not text. Assert on the run and skip the content check entirely:

```ts title="evals/weather/no-tools-for-greetings.eval.ts"
import { defineEval } from "eve/evals";

export default defineEval({
  async test(t) {
    await t.send("Hello!");
    t.completed();
    t.notCalledTool("get_weather");
  },
});
```

## Organizing with directories

Identity is the file path, so directories are the grouping mechanism. `evals/weather/brooklyn-forecast.eval.ts` gets the id `weather/brooklyn-forecast`, and `eve eval weather` runs everything under `evals/weather/`. Shared constants and helpers live in sibling non-eval files (any name that doesn't end in `.eval.ts`):

```text
evals/
├── weather/
│   ├── shared.ts                    # helpers, not an eval
│   ├── brooklyn-forecast.eval.ts
│   └── no-tools-for-greetings.eval.ts
└── smoke.eval.ts
```

## Multi-turn evals

Drive several turns in sequence for branching, HITL approvals, structured output, attachments, or multiple sessions. Because assertions live in the function, an intermediate value is a local variable. Judge a draft before the next turn overwrites it, then keep going.

```ts title="evals/draft-then-send.eval.ts"
import { defineEval } from "eve/evals";
import { includes } from "eve/evals/expect";

export default defineEval({
  async test(t) {
    const draft = await t.send("Draft the follow-up email.");
    t.check(draft.message, includes("Best regards"));
    t.judge.autoevals.closedQA("professional tone", { on: draft.message }).atLeast(0.6);

    await t.send("Now send it.");
    t.calledTool("send_email");
  },
});
```

For a precondition no built-in assertion expresses, `throw`. A thrown error marks the eval `failed` with the message in the result:

```ts title="evals/session-continuity.eval.ts"
import { defineEval } from "eve/evals";
import { includes } from "eve/evals/expect";

export default defineEval({
  async test(t) {
    await t.send("My favorite word is marigold.");
    const firstSessionId = t.sessionId;

    const second = await t.send("Thanks for remembering.");
    second.expectOk();
    if (t.sessionId !== firstSessionId) {
      throw new Error(`Expected one session; got ${firstSessionId} then ${t.sessionId}.`);
    }

    t.completed();
    t.check(second.message, includes("Thanks for remembering."));
  },
});
```

## The drive API

`t` drives the primary session; `t.newSession()` returns an independent `EveEvalSession` against the same target, whose events feed the same run-level assertions.

* `t.send(input)` sends a turn and waits for it to settle. It accepts the same input as `ClientSession.send()` (a string or a structured message) and resolves to a turn carrying `.message` and `.expectOk()`.
* `t.sendFile(text, path, mediaType?)` attaches a local file as a data URL.
* `t.expectInputRequests(filter?)` asserts the previous turn parked on HITL input and returns the pending requests.
* `t.respond(...responses)` answers specific pending input requests and sends them as the next turn.
* `t.respondAll(optionId)` answers every pending input request with the same option and sends the responses as the next turn.
* `t.reply` is the last assistant message (or `null`); `t.sessionId` is the current session id; `t.events` is the full typed event stream captured so far.

Each `send` (and `respond`/`respondAll`) resolves to a turn whose `expectOk()` throws only when the turn ended failed. A session left open for a next message is the normal end state of a successful turn.

Events from every session are captured in the result and artifacts. `t.log(message)` records debug lines into the eval artifact; `--verbose` also streams them to stdout as evals run. `t.signal` is an `AbortSignal` that fires on timeout.

For driving sessions created outside the eval, by a channel webhook or a schedule, see [Targets](./targets).

## Datasets: exporting an array

To fan one file out over a dataset, default-export an array of `defineEval(...)` values. Eval modules are ESM, so top-level `await` can load anything. Ids derive from the file name plus a zero-padded index in array order (`sql/0000`, `sql/0001`, and so on). The loaders (`loadJson`, `loadYaml` from `eve/evals/loaders`) parse fixture files relative to the app root:

```ts title="evals/sql.eval.ts"
import { defineEval } from "eve/evals";
import { loadYaml } from "eve/evals/loaders";
import { equals } from "eve/evals/expect";

const doc = await loadYaml("evals/data/cases.yaml");
const rows = doc.evals as readonly { task: string; prompt: string; sql: string }[];

export default rows.map((row) =>
  defineEval({
    description: row.task,
    async test(t) {
      await t.send(row.prompt);
      t.completed();
      t.check(t.reply, equals(row.sql));
    },
  }),
);
```

The loaders are meant for fixtures, not runtime agent code.

## What to read next

* [Assertions](./assertions): assert on what the eval did
* [Judge](./judge): grade quality with an LLM judge
* [TypeScript client](../guides/client/messages): the send/turn protocol eval sessions build on


---

For a semantic overview of all documentation, see [/sitemap.md](/sitemap.md)

For an index of all available documentation, see [/llms.txt](/llms.txt)

For agent-facing discovery, including API and MCP surfaces, see [/agents.md](/agents.md)