Documentation
¶
Overview ¶
Package testkit provides a thin HTTP harness for writing phased, real-LLM tests against a live apteva-server. A Session wraps one server + one agent instance and exposes the operations you'd do by hand through the dashboard: set a directive, inject console events, reset the context window, query telemetry, tear down.
Typical shape:
func TestMyAgent(t *testing.T) {
s := testkit.New(t) // auto-starts a local server if
// APTEVA_SERVER_URL isn't live
s.SetDirective(`Respond "pong" to "ping" messages.`)
s.Run("ping replies", func(p *testkit.Phase) {
p.Inject("[console] ping")
p.WaitUntil(30*time.Second, "an iteration completed", func() bool {
return s.Status().Iteration >= 1
})
})
report := s.Report()
t.Logf("tokens=%d cost=$%.4f", report.TokensIn+report.TokensOut, report.Cost)
}
No credentials or client data appear in committed code — tests that target production systems live in core/private_scenarios (gitignored) and read URLs/keys from the environment.
Index ¶
- type ComputerConfig
- type Config
- type Phase
- type PhaseStats
- type Report
- type Session
- func (s *Session) ContextUsage(since time.Time) []ThreadContextUsage
- func (s *Session) Events(typeFilter string, limit int) []TelemetryEvent
- func (s *Session) HasToolCall(toolName string) bool
- func (s *Session) HasToolCallWithPrefix(prefix string) bool
- func (s *Session) Inject(text string)
- func (s *Session) LogContextUsage()
- func (s *Session) Report() Report
- func (s *Session) ResetContext()
- func (s *Session) Run(name string, fn func(p *Phase))
- func (s *Session) SetDirective(directive string)
- func (s *Session) SetProviderModels(provider string, large, medium, small string)
- func (s *Session) SpawnedThreads() []string
- func (s *Session) StatsSince(since time.Time) PhaseStats
- func (s *Session) Status() StatusInfo
- func (s *Session) ToolCallsByThread() map[string][]string
- func (s *Session) Verify(desc string, fn func())
- func (s *Session) WaitIdle(timeout, quiet time.Duration)
- func (s *Session) WaitUntil(timeout time.Duration, desc string, cond func() bool)
- type StatusInfo
- type TelemetryEvent
- type ThreadContextUsage
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ComputerConfig ¶
type ComputerConfig struct {
Type string `json:"type"` // "local" | "browserbase" | "steel" | "browser-engine" | "service"
URL string `json:"url,omitempty"` // for "service", or local CDP endpoint
APIKey string `json:"api_key,omitempty"` // override saved provider API key (rare)
ProjectID string `json:"project_id,omitempty"` // for "browserbase"
Width int `json:"width,omitempty"` // 0 = core default per LLM
Height int `json:"height,omitempty"`
}
ComputerConfig mirrors core.ComputerConfig (we restate the shape here so testkit users don't have to import core just to flip on browser tools). Only Type is required; Width/Height/URL/APIKey/ ProjectID are optional and the server fills in credentials from saved providers when Type is "browserbase" or "steel".
type Config ¶
type Config struct {
ServerURL string // default: env APTEVA_SERVER_URL, else http://localhost:5280
APIKey string // default: env APTEVA_API_KEY
InstanceID int64 // default: env APTEVA_TEST_INSTANCE_ID (if 0 and autostart, one is created)
ProjectID string // default: env APTEVA_TEST_PROJECT_ID
// DisableAutoStart, when true, forbids spawning a server — tests
// then fail fast if APTEVA_SERVER_URL isn't already live. Useful
// for CI where you want the infrastructure pre-set and failures
// to be loud rather than silent. Default is zero-value (autostart
// enabled) because the zero-value of a bool is what Go hands a
// user who writes `testkit.New(t, testkit.Config{...})` without
// thinking about AutoStart — opt-out is the only way to keep the
// "just works" shape.
DisableAutoStart bool
StartTimeout time.Duration // default: 15s — how long to wait for a spawned server to respond to /health
// CreateInstance forces creation of a fresh throwaway instance
// instead of reusing InstanceID / APTEVA_TEST_INSTANCE_ID. The
// instance is deleted via t.Cleanup so each test run is fully
// isolated from the last. Use this for "proper" tests — when you
// want guaranteed no leftover state, no stale threads, no
// cross-contamination between test files.
CreateInstance bool
// InstanceName applies only with CreateInstance. Defaults to a
// randomized "testkit-<hex>" so parallel tests don't collide.
InstanceName string
// Directive is the initial directive for a newly-created instance.
// Set here OR call s.SetDirective() after New — they're equivalent
// except setting it here saves one round-trip.
Directive string
// AttachMCPs lists MCP server names to attach as CATALOG entries
// on the instance — visible to main as a "spawn workers with
// mcp=X" hint but NOT exposed as native tools on main's registry.
// This forces the agent to exercise the spawn path: main decides
// to create a worker, passes the MCP name on spawn, the worker
// connects and uses the tool. That matches the hub-and-spoke
// shape real instances run under and gives the tests visibility
// into thread lifecycle.
//
// Each name must match an already-registered MCP server in the
// project (set up via Dashboard → Settings → MCP Servers or via
// Composio). Credentials and URLs are looked up from the server's
// DB — tests never see them.
AttachMCPs []string
// AttachMCPsMainAccess is a deprecated alias for AttachMCPs kept
// to avoid churning every existing test in one go. The legacy
// main_access flag is gone — every attached MCP is now searchable
// by main, and sub-thread visibility is governed by no_spawn.
// Tests written against this field should migrate to AttachMCPs.
AttachMCPsMainAccess []string
// IncludeAptevaServer / IncludeChannels match the same flags on
// instance creation. Both default to false for test instances —
// most tests don't want the management gateway or the chat bridge
// cluttering their tool list.
IncludeAptevaServer bool
IncludeChannels bool
// Computer enables the browser/computer-use tool surface
// (browser_session + computer_use) on the instance. When non-nil
// testkit forwards this block to PUT /api/instances/:id/config in
// the same call that lands the real directive + MCPs, so core
// spins up its computer backend before the agent's first real
// iteration. nil = no computer mode (default).
//
// Type=="browserbase"/"steel" relies on a saved provider in the
// project for credentials — the server enriches the block before
// forwarding to core. Type=="local" runs against a local Chrome
// (or an existing CDP endpoint if the matching `browser` provider
// has CDP_URL set). Type=="service" points at a custom HTTP
// browser service (URL required).
Computer *ComputerConfig
}
Config lets callers override the automatic environment detection. Most tests should use zero-value New(t) and drive through env vars (APTEVA_SERVER_URL, APTEVA_API_KEY, APTEVA_TEST_INSTANCE_ID).
type Phase ¶
type Phase struct {
// contains filtered or unexported fields
}
Phase is the argument passed to Session.Run's callback. Inside the callback you inject events, poll for conditions, and assert. If any method on Phase fails, the phase (and test) aborts with a clear message identifying which phase failed.
func (*Phase) Inject ¶
Inject is the phase-scoped equivalent of Session.Inject. Here as a convenience so phase callbacks don't close over the session.
func (*Phase) Stats ¶
func (p *Phase) Stats() PhaseStats
Stats returns totals for the events emitted inside this phase so far. Called automatically at end of Run; available mid-phase too for fine-grained assertions (e.g. "this phase should cost under $X").
func (*Phase) Verify ¶
Verify runs assertions. It's just a named wrapper so phase traces include which assertion block ran. Use standard t.Errorf/Fatalf inside.
type PhaseStats ¶
type PhaseStats struct {
Iterations int // llm.done events
ToolCalls int // tool.call events
TokensIn int // sum of llm.done.tokens_in
TokensOut int // sum of llm.done.tokens_out
Cost float64 // sum of server-enriched llm.done.cost_usd
Errors int // llm.error events
}
PhaseStats is the per-window rollup logPhaseStats / StatsSince returns. Time bounds are caller-supplied so the same helper works for phase-scoped, inject-to-now, or arbitrary-slice queries.
type Report ¶
type Report struct {
Iterations int `json:"iterations"`
TokensIn int `json:"tokens_in"`
TokensOut int `json:"tokens_out"`
Cost float64 `json:"cost"`
ToolCalls int `json:"tool_calls"`
Errors int `json:"errors"`
DurationSec float64 `json:"duration_sec"`
}
Report summarises what the session spent since it started. Computed from the server's telemetry-stats endpoint filtered to the session's start time (whichever is smaller: sinceStart vs 1h).
type Session ¶
type Session struct {
// contains filtered or unexported fields
}
Session is the test harness for one apteva-server + one agent instance. Create with New(t). Safe for use only from a single test goroutine.
func New ¶
New wires up a Session against the URL in APTEVA_SERVER_URL (or http://localhost:5280), using APTEVA_API_KEY + APTEVA_TEST_INSTANCE_ID from the environment. If the URL isn't reachable AND AutoStart is enabled (default), New spawns apteva-server with an ephemeral data directory, bootstraps the setup-token flow to mint an API key, and creates a fresh test instance. Everything it spawned is cleaned up via t.Cleanup.
func (*Session) ContextUsage ¶
func (s *Session) ContextUsage(since time.Time) []ThreadContextUsage
ContextUsage walks llm.done events since `since` and returns a peak-tokens snapshot per thread. Used by the end-of-phase and end-of-run reports to surface context bloat before it bites.
func (*Session) Events ¶
func (s *Session) Events(typeFilter string, limit int) []TelemetryEvent
Events fetches recent telemetry events for the session's instance, optionally filtered by type. limit caps the result (default 200 if zero). Ordered newest-first by the server.
Typical use: after injecting an event, poll Events("tool.call",...) until the expected tool name shows up, then assert on the args.
func (*Session) HasToolCall ¶
HasToolCall returns true if a tool.call event for the named tool has been recorded since the session started. The check is exact on the tool name — pass "pushover_send_notification" not "pushover".
func (*Session) HasToolCallWithPrefix ¶
func (*Session) Inject ¶
Inject sends a console event to main. The text is what the agent sees in its [console] block, e.g. "[console] [chat] Hi".
func (*Session) LogContextUsage ¶
func (s *Session) LogContextUsage()
LogContextUsage prints the full-session per-thread context usage table. Call from the test at end-of-run for a final bloat check.
func (*Session) Report ¶
Report fetches cumulative stats since New() or the last ResetContext(). Uses the server's 1h stats window and lets the caller subtract the session start — close enough for iteration-scale tests.
Brief settle window before the fetch: core posts telemetry events to the server asynchronously (batched through forwardLoop), so the most recent iteration's llm.done may not have landed in the DB yet when a fast test wraps up. Waiting ~1s covers the typical forward latency without noticeably slowing tests.
func (*Session) ResetContext ¶
func (s *Session) ResetContext()
ResetContext wipes the main thread's message history and kills every sub-thread. Called automatically by New; call again between phases if you want isolation.
func (*Session) Run ¶
Run executes a named phase. The callback receives a Phase bound to the session. Each phase's runtime + token/cost stats are logged; nothing about Phase auto-resets context — call s.ResetContext() yourself between phases if needed.
func (*Session) SetDirective ¶
SetDirective rewrites the instance's directive. Use this to iterate: edit your test's directive string, re-run, compare reports.
func (*Session) SetProviderModels ¶
SetProviderModels forces an LLM provider to use specific model IDs for large/medium/small tiers on this instance. Useful to pin tests to a specific model regardless of what the user configured in their DB.
func (*Session) SpawnedThreads ¶
SpawnedThreads returns the IDs of sub-threads that were spawned during the session. Driven by thread.spawn telemetry events so it works after the thread has already finished — no polling race with active-thread enumeration.
func (*Session) StatsSince ¶
func (s *Session) StatsSince(since time.Time) PhaseStats
StatsSince walks telemetry for this session's instance and sums iteration / tool-call / token / cost totals for events at or after `since`. Use it to measure a specific phase or inject-to-now window; for whole-session cumulative totals use Report().
func (*Session) Status ¶
func (s *Session) Status() StatusInfo
Status returns the current agent status. Fails the test on HTTP error.
func (*Session) ToolCallsByThread ¶
ToolCallsByThread returns all recorded tool.call events grouped by the thread that made them. Lets tests assert specifically that "some sub-thread (not main) called X" — which is the hub-and-spoke shape we want most tests to validate rather than short-circuiting tool calls straight from main.
func (*Session) Verify ¶
Verify is the session-level counterpart of Phase.Verify — named assertion block so test logs show which checks ran. Intended for tests that don't use s.Run phases.
func (*Session) WaitIdle ¶
HasToolCallWithPrefix returns true when any tool.call event's name starts with prefix. Useful for MCP-backed tools where the agent picks the specific endpoint (pushover_send_notification vs. pushover_send_priority_alert vs. ...) — the test cares that "some pushover tool fired", not which one. WaitIdle blocks until we're confident the scenario has settled. Returns when ANY of these three conditions is met (whichever comes first), bounded by `timeout`:
`quiet` consecutive time with no new telemetry events. Simple plain-silence rule for tests where the agent just stops emitting.
main has paced to a long sleep (≥1m) and a short grace period (`min(quiet, 5s)`) has passed since it did so. Main is the orchestrator — if it's asleep for an hour, nothing meaningful is going to happen regardless of what workers do. This specifically avoids the trap where a worker's 5m pace fires a heartbeat event that resets the quiet timer long after the real work is done.
All live threads have paced with a long sleep. Same spirit as (2) but for tests that finish before main emits a long pace (e.g. when the scenario uses a different top-level strategy).
type StatusInfo ¶
type StatusInfo struct {
Iteration int `json:"iteration"`
Threads int `json:"threads"`
Paused bool `json:"paused"`
Mode string `json:"mode"`
Rate string `json:"rate"`
Model string `json:"model"`
}
StatusInfo is the minimum subset of the /status payload tests care about. Exposed so tests can write conditions like `s.Status().Iteration >= 3` without depending on core types.
type TelemetryEvent ¶
type TelemetryEvent struct {
ID string `json:"id"`
ThreadID string `json:"thread_id"`
Type string `json:"type"`
Time string `json:"time"`
Data map[string]interface{} `json:"data"`
}
TelemetryEvent is the subset of a stored telemetry row tests look at: type (e.g. "tool.call"), thread id, time, and the free-form data map provided by the emitter. Tests typically grep Data for {"name": "pushover_send_notification", ...} or similar.
type ThreadContextUsage ¶
type ThreadContextUsage struct {
ThreadID string
PeakIn int // max tokens_in seen
LastIn int // most recent tokens_in
ContextMax int // model's max_context_tokens (0 if unknown)
PeakMsgs int // max context_msgs
PeakChars int // max context_chars
Iters int // # llm.done events
}
ThreadContextUsage summarises context-window usage for one thread across a telemetry slice. Peaks come from llm.done events.