testkit

package

v0.2.0 Latest Latest Go to latest Published: May 18, 2026 License: MIT Imports: 23 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/apteva/core

Links

Open Source Insights

Documentation ¶

Overview ¶

Package testkit provides a thin HTTP harness for writing phased, real-LLM tests against a live apteva-server. A Session wraps one server + one agent instance and exposes the operations you'd do by hand through the dashboard: set a directive, inject console events, reset the context window, query telemetry, tear down.

Typical shape:

func TestMyAgent(t *testing.T) {
    s := testkit.New(t)           // auto-starts a local server if
                                  // APTEVA_SERVER_URL isn't live
    s.SetDirective(`Respond "pong" to "ping" messages.`)
    s.Run("ping replies", func(p *testkit.Phase) {
        p.Inject("[console] ping")
        p.WaitUntil(30*time.Second, "an iteration completed", func() bool {
            return s.Status().Iteration >= 1
        })
    })
    report := s.Report()
    t.Logf("tokens=%d cost=$%.4f", report.TokensIn+report.TokensOut, report.Cost)
}

No credentials or client data appear in committed code — tests that target production systems live in core/private_scenarios (gitignored) and read URLs/keys from the environment.

Index ¶

type ComputerConfig
type Config
type Phase
type PhaseStats
type Report
type Session
- func New(t *testing.T, cfg ...Config) *Session
type StatusInfo
type TelemetryEvent
type ThreadContextUsage

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ComputerConfig ¶

type ComputerConfig struct {
	Type      string `json:"type"`                 // "local" | "browserbase" | "steel" | "browser-engine" | "service"
	URL       string `json:"url,omitempty"`        // for "service", or local CDP endpoint
	APIKey    string `json:"api_key,omitempty"`    // override saved provider API key (rare)
	ProjectID string `json:"project_id,omitempty"` // for "browserbase"
	Width     int    `json:"width,omitempty"`      // 0 = core default per LLM
	Height    int    `json:"height,omitempty"`
}

ComputerConfig mirrors core.ComputerConfig (we restate the shape here so testkit users don't have to import core just to flip on browser tools). Only Type is required; Width/Height/URL/APIKey/ ProjectID are optional and the server fills in credentials from saved providers when Type is "browserbase" or "steel".

type Config ¶

type Config struct {
	ServerURL  string // default: env APTEVA_SERVER_URL, else http://localhost:5280
	APIKey     string // default: env APTEVA_API_KEY
	InstanceID int64  // default: env APTEVA_TEST_INSTANCE_ID (if 0 and autostart, one is created)
	ProjectID  string // default: env APTEVA_TEST_PROJECT_ID
	// DisableAutoStart, when true, forbids spawning a server — tests
	// then fail fast if APTEVA_SERVER_URL isn't already live. Useful
	// for CI where you want the infrastructure pre-set and failures
	// to be loud rather than silent. Default is zero-value (autostart
	// enabled) because the zero-value of a bool is what Go hands a
	// user who writes `testkit.New(t, testkit.Config{...})` without
	// thinking about AutoStart — opt-out is the only way to keep the
	// "just works" shape.
	DisableAutoStart bool
	StartTimeout     time.Duration // default: 15s — how long to wait for a spawned server to respond to /health

	// CreateInstance forces creation of a fresh throwaway instance
	// instead of reusing InstanceID / APTEVA_TEST_INSTANCE_ID. The
	// instance is deleted via t.Cleanup so each test run is fully
	// isolated from the last. Use this for "proper" tests — when you
	// want guaranteed no leftover state, no stale threads, no
	// cross-contamination between test files.
	CreateInstance bool

	// InstanceName applies only with CreateInstance. Defaults to a
	// randomized "testkit-<hex>" so parallel tests don't collide.
	InstanceName string

	// Directive is the initial directive for a newly-created instance.
	// Set here OR call s.SetDirective() after New — they're equivalent
	// except setting it here saves one round-trip.
	Directive string

	// AttachMCPs lists MCP server names to attach as CATALOG entries
	// on the instance — visible to main as a "spawn workers with
	// mcp=X" hint but NOT exposed as native tools on main's registry.
	// This forces the agent to exercise the spawn path: main decides
	// to create a worker, passes the MCP name on spawn, the worker
	// connects and uses the tool. That matches the hub-and-spoke
	// shape real instances run under and gives the tests visibility
	// into thread lifecycle.
	//
	// Each name must match an already-registered MCP server in the
	// project (set up via Dashboard → Settings → MCP Servers or via
	// Composio). Credentials and URLs are looked up from the server's
	// DB — tests never see them.
	AttachMCPs []string

	// AttachMCPsMainAccess is a deprecated alias for AttachMCPs kept
	// to avoid churning every existing test in one go. The legacy
	// main_access flag is gone — every attached MCP is now searchable
	// by main, and sub-thread visibility is governed by no_spawn.
	// Tests written against this field should migrate to AttachMCPs.
	AttachMCPsMainAccess []string

	// IncludeAptevaServer / IncludeChannels match the same flags on
	// instance creation. Both default to false for test instances —
	// most tests don't want the management gateway or the chat bridge
	// cluttering their tool list.
	IncludeAptevaServer bool
	IncludeChannels     bool

	// Computer enables the browser/computer-use tool surface
	// (browser_session + computer_use) on the instance. When non-nil
	// testkit forwards this block to PUT /api/instances/:id/config in
	// the same call that lands the real directive + MCPs, so core
	// spins up its computer backend before the agent's first real
	// iteration. nil = no computer mode (default).
	//
	// Type=="browserbase"/"steel" relies on a saved provider in the
	// project for credentials — the server enriches the block before
	// forwarding to core. Type=="local" runs against a local Chrome
	// (or an existing CDP endpoint if the matching `browser` provider
	// has CDP_URL set). Type=="service" points at a custom HTTP
	// browser service (URL required).
	Computer *ComputerConfig
}

Config lets callers override the automatic environment detection. Most tests should use zero-value New(t) and drive through env vars (APTEVA_SERVER_URL, APTEVA_API_KEY, APTEVA_TEST_INSTANCE_ID).

type Phase ¶

type Phase struct {
	// contains filtered or unexported fields
}

Phase is the argument passed to Session.Run's callback. Inside the callback you inject events, poll for conditions, and assert. If any method on Phase fails, the phase (and test) aborts with a clear message identifying which phase failed.

func (*Phase) Inject ¶

func (p *Phase) Inject(text string)

Inject is the phase-scoped equivalent of Session.Inject. Here as a convenience so phase callbacks don't close over the session.

func (*Phase) Stats ¶

func (p *Phase) Stats() PhaseStats

Stats returns totals for the events emitted inside this phase so far. Called automatically at end of Run; available mid-phase too for fine-grained assertions (e.g. "this phase should cost under $X").

func (*Phase) Verify ¶

func (p *Phase) Verify(desc string, fn func())

Verify runs assertions. It's just a named wrapper so phase traces include which assertion block ran. Use standard t.Errorf/Fatalf inside.

func (*Phase) WaitUntil ¶

func (p *Phase) WaitUntil(timeout time.Duration, desc string, cond func() bool)

WaitUntil is the phase-scoped shim around Session.WaitUntil — it exists so Phase callbacks can call p.WaitUntil without closing over the session. New tests should prefer Session.WaitUntil directly.

type PhaseStats ¶

type PhaseStats struct {
	Iterations int     // llm.done events
	ToolCalls  int     // tool.call events
	TokensIn   int     // sum of llm.done.tokens_in
	TokensOut  int     // sum of llm.done.tokens_out
	Cost       float64 // sum of server-enriched llm.done.cost_usd
	Errors     int     // llm.error events
}

PhaseStats is the per-window rollup logPhaseStats / StatsSince returns. Time bounds are caller-supplied so the same helper works for phase-scoped, inject-to-now, or arbitrary-slice queries.

type Report ¶

type Report struct {
	Iterations  int     `json:"iterations"`
	TokensIn    int     `json:"tokens_in"`
	TokensOut   int     `json:"tokens_out"`
	Cost        float64 `json:"cost"`
	ToolCalls   int     `json:"tool_calls"`
	Errors      int     `json:"errors"`
	DurationSec float64 `json:"duration_sec"`
}

Report summarises what the session spent since it started. Computed from the server's telemetry-stats endpoint filtered to the session's start time (whichever is smaller: sinceStart vs 1h).

type Session ¶

type Session struct {
	// contains filtered or unexported fields
}

Session is the test harness for one apteva-server + one agent instance. Create with New(t). Safe for use only from a single test goroutine.

func New ¶

func New(t *testing.T, cfg ...Config) *Session

New wires up a Session against the URL in APTEVA_SERVER_URL (or http://localhost:5280), using APTEVA_API_KEY + APTEVA_TEST_INSTANCE_ID from the environment. If the URL isn't reachable AND AutoStart is enabled (default), New spawns apteva-server with an ephemeral data directory, bootstraps the setup-token flow to mint an API key, and creates a fresh test instance. Everything it spawned is cleaned up via t.Cleanup.

func (*Session) ContextUsage ¶

func (s *Session) ContextUsage(since time.Time) []ThreadContextUsage

ContextUsage walks llm.done events since `since` and returns a peak-tokens snapshot per thread. Used by the end-of-phase and end-of-run reports to surface context bloat before it bites.

func (*Session) Events ¶

func (s *Session) Events(typeFilter string, limit int) []TelemetryEvent

Events fetches recent telemetry events for the session's instance, optionally filtered by type. limit caps the result (default 200 if zero). Ordered newest-first by the server.

Typical use: after injecting an event, poll Events("tool.call",...) until the expected tool name shows up, then assert on the args.

func (*Session) HasToolCall ¶

func (s *Session) HasToolCall(toolName string) bool

HasToolCall returns true if a tool.call event for the named tool has been recorded since the session started. The check is exact on the tool name — pass "pushover_send_notification" not "pushover".

func (*Session) HasToolCallWithPrefix ¶

func (s *Session) HasToolCallWithPrefix(prefix string) bool

func (*Session) Inject ¶

func (s *Session) Inject(text string)

Inject sends a console event to main. The text is what the agent sees in its [console] block, e.g. "[console] [chat] Hi".

func (*Session) LogContextUsage ¶

func (s *Session) LogContextUsage()

LogContextUsage prints the full-session per-thread context usage table. Call from the test at end-of-run for a final bloat check.

func (*Session) Report ¶

func (s *Session) Report() Report

Report fetches cumulative stats since New() or the last ResetContext(). Uses the server's 1h stats window and lets the caller subtract the session start — close enough for iteration-scale tests.

Brief settle window before the fetch: core posts telemetry events to the server asynchronously (batched through forwardLoop), so the most recent iteration's llm.done may not have landed in the DB yet when a fast test wraps up. Waiting ~1s covers the typical forward latency without noticeably slowing tests.

func (*Session) ResetContext ¶

func (s *Session) ResetContext()

ResetContext wipes the main thread's message history and kills every sub-thread. Called automatically by New; call again between phases if you want isolation.

func (*Session) Run ¶

func (s *Session) Run(name string, fn func(p *Phase))

Run executes a named phase. The callback receives a Phase bound to the session. Each phase's runtime + token/cost stats are logged; nothing about Phase auto-resets context — call s.ResetContext() yourself between phases if needed.

func (*Session) SetDirective ¶

func (s *Session) SetDirective(directive string)

SetDirective rewrites the instance's directive. Use this to iterate: edit your test's directive string, re-run, compare reports.

func (*Session) SetProviderModels ¶

func (s *Session) SetProviderModels(provider string, large, medium, small string)

SetProviderModels forces an LLM provider to use specific model IDs for large/medium/small tiers on this instance. Useful to pin tests to a specific model regardless of what the user configured in their DB.

func (*Session) SpawnedThreads ¶

func (s *Session) SpawnedThreads() []string

SpawnedThreads returns the IDs of sub-threads that were spawned during the session. Driven by thread.spawn telemetry events so it works after the thread has already finished — no polling race with active-thread enumeration.

func (*Session) StatsSince ¶

func (s *Session) StatsSince(since time.Time) PhaseStats

StatsSince walks telemetry for this session's instance and sums iteration / tool-call / token / cost totals for events at or after `since`. Use it to measure a specific phase or inject-to-now window; for whole-session cumulative totals use Report().

func (*Session) Status ¶

func (s *Session) Status() StatusInfo

Status returns the current agent status. Fails the test on HTTP error.

func (*Session) ToolCallsByThread ¶

func (s *Session) ToolCallsByThread() map[string][]string

ToolCallsByThread returns all recorded tool.call events grouped by the thread that made them. Lets tests assert specifically that "some sub-thread (not main) called X" — which is the hub-and-spoke shape we want most tests to validate rather than short-circuiting tool calls straight from main.

func (*Session) Verify ¶

func (s *Session) Verify(desc string, fn func())

Verify is the session-level counterpart of Phase.Verify — named assertion block so test logs show which checks ran. Intended for tests that don't use s.Run phases.

func (*Session) WaitIdle ¶

func (s *Session) WaitIdle(timeout, quiet time.Duration)

HasToolCallWithPrefix returns true when any tool.call event's name starts with prefix. Useful for MCP-backed tools where the agent picks the specific endpoint (pushover_send_notification vs. pushover_send_priority_alert vs. ...) — the test cares that "some pushover tool fired", not which one. WaitIdle blocks until we're confident the scenario has settled. Returns when ANY of these three conditions is met (whichever comes first), bounded by `timeout`:

`quiet` consecutive time with no new telemetry events. Simple plain-silence rule for tests where the agent just stops emitting.
main has paced to a long sleep (≥1m) and a short grace period (`min(quiet, 5s)`) has passed since it did so. Main is the orchestrator — if it's asleep for an hour, nothing meaningful is going to happen regardless of what workers do. This specifically avoids the trap where a worker's 5m pace fires a heartbeat event that resets the quiet timer long after the real work is done.
All live threads have paced with a long sleep. Same spirit as (2) but for tests that finish before main emits a long pace (e.g. when the scenario uses a different top-level strategy).

func (*Session) WaitUntil ¶

func (s *Session) WaitUntil(timeout time.Duration, desc string, cond func() bool)

WaitUntil polls cond every 500ms and returns when cond returns true or when timeout elapses. On timeout it fatals the test. desc is used in the timeout message and in the 15-second progress nudge so silent waits don't look like a frozen test runner.

type StatusInfo ¶

type StatusInfo struct {
	Iteration int    `json:"iteration"`
	Threads   int    `json:"threads"`
	Paused    bool   `json:"paused"`
	Mode      string `json:"mode"`
	Rate      string `json:"rate"`
	Model     string `json:"model"`
}

StatusInfo is the minimum subset of the /status payload tests care about. Exposed so tests can write conditions like `s.Status().Iteration >= 3` without depending on core types.

type TelemetryEvent ¶

type TelemetryEvent struct {
	ID       string                 `json:"id"`
	ThreadID string                 `json:"thread_id"`
	Type     string                 `json:"type"`
	Time     string                 `json:"time"`
	Data     map[string]interface{} `json:"data"`
}

TelemetryEvent is the subset of a stored telemetry row tests look at: type (e.g. "tool.call"), thread id, time, and the free-form data map provided by the emitter. Tests typically grep Data for {"name": "pushover_send_notification", ...} or similar.

type ThreadContextUsage ¶

type ThreadContextUsage struct {
	ThreadID   string
	PeakIn     int // max tokens_in seen
	LastIn     int // most recent tokens_in
	ContextMax int // model's max_context_tokens (0 if unknown)
	PeakMsgs   int // max context_msgs
	PeakChars  int // max context_chars
	Iters      int // # llm.done events
}

ThreadContextUsage summarises context-window usage for one thread across a telemetry slice. Peaks come from llm.done events.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL