Move integration-test runbook and authoring guide into the component READMEs so the content sits next to the code it describes.
11 KiB
hi — Headscale Integration test runner
hi wraps Docker container orchestration around the tests in
../../integration and extracts debugging artefacts
(logs, database snapshots, MapResponse protocol captures) for post-mortem
analysis.
Read this file in full before running any hi command. The test
runner has sharp edges — wrong flags produce stale containers, lost
artefacts, or hung CI.
For test-authoring patterns (scenario setup, EventuallyWithT,
IntegrationSkip, helper variants), read
../../integration/README.md.
Quick Start
# Verify system requirements (Docker, Go, disk space, images)
go run ./cmd/hi doctor
# Run a single test (the default flags are tuned for development)
go run ./cmd/hi run "TestPingAllByIP"
# Run a database-heavy test against PostgreSQL
go run ./cmd/hi run "TestExpireNode" --postgres
# Pattern matching
go run ./cmd/hi run "TestSubnet*"
Run doctor before the first run in any new environment. Tests
generate ~100 MB of logs per run in control_logs/; doctor verifies
there is enough space and that the required Docker images are available.
Commands
| Command | Purpose |
|---|---|
run [pattern] |
Execute the test(s) matching pattern |
doctor |
Verify system requirements |
clean networks |
Prune unused Docker networks |
clean images |
Clean old test images |
clean containers |
Kill all test containers (dangerous — see below) |
clean cache |
Clean Go module cache volume |
clean all |
Run all cleanup operations |
Flags
Defaults are tuned for single-test development runs. Review before changing.
| Flag | Default | Purpose |
|---|---|---|
--timeout |
120m |
Total test timeout. Use the built-in flag — never wrap with bash timeout. |
--postgres |
false |
Use PostgreSQL instead of SQLite |
--failfast |
true |
Stop on first test failure |
--go-version |
auto | Detected from go.mod (currently 1.26.1) |
--clean-before |
true |
Clean stale (stopped/exited) containers before starting |
--clean-after |
true |
Clean this run's containers after completion |
--keep-on-failure |
false |
Preserve containers for manual inspection on failure |
--logs-dir |
control_logs |
Where to save run artefacts |
--verbose |
false |
Verbose output |
--stats |
false |
Collect container resource-usage stats |
--hs-memory-limit |
0 |
Fail if any headscale container exceeds N MB (0 = disabled) |
--ts-memory-limit |
0 |
Fail if any tailscale container exceeds N MB |
Timeout guidance
The default 120m is generous for a single test. If you must tune it,
these are realistic floors by category:
| Test type | Minimum | Examples |
|---|---|---|
| Basic functionality / CLI | 900s (15m) | TestPingAllByIP, TestCLI* |
| Route / ACL | 1200s (20m) | TestSubnet*, TestACL* |
| HA / failover | 1800s (30m) | TestHASubnetRouter* |
| Long-running | 2100s (35m) | TestNodeOnlineStatus (~12 min body) |
| Full suite | 45m | go test ./integration -timeout 45m |
Never use the shell timeout command around hi. It kills the
process mid-cleanup and leaves stale containers:
timeout 300 go run ./cmd/hi run "TestName" # WRONG — orphaned containers
go run ./cmd/hi run "TestName" --timeout=900s # correct
Concurrent Execution
Multiple hi run invocations can run simultaneously on the same Docker
daemon. Each invocation gets a unique Run ID (format
YYYYMMDD-HHMMSS-6charhash, e.g. 20260409-104215-mdjtzx).
- Container names include the short run ID:
ts-mdjtzx-1-74-fgdyls - Docker labels:
hi.run-id={runID}on every container - Port allocation: dynamic — kernel assigns free ports, no conflicts
- Cleanup isolation: each run cleans only its own containers
- Log directories:
control_logs/{runID}/
# Start three tests in parallel — each gets its own run ID
go run ./cmd/hi run "TestPingAllByIP" &
go run ./cmd/hi run "TestACLAllowUserDst" &
go run ./cmd/hi run "TestOIDCAuthenticationPingAll" &
Safety rules for concurrent runs
- ✅ Your run cleans only containers labelled with its own
hi.run-id - ✅
--clean-beforeremoves only stopped/exited containers - ❌ Never run
docker rm -f $(docker ps -q --filter name=hs-)— this destroys other agents' live test sessions - ❌ Never run
docker system prune -fwhile any tests are running - ❌ Never run
hi clean containers/hi clean allwhile other tests are running — both kill all test containers on the daemon
To identify your own containers:
docker ps --filter "label=hi.run-id=20260409-104215-mdjtzx"
The run ID appears at the top of the hi run output — copy it from
there rather than trying to reconstruct it.
Artefacts
Every run saves debugging artefacts under control_logs/{runID}/:
control_logs/20260409-104215-mdjtzx/
├── hs-<test>-<hash>.stderr.log # headscale server errors
├── hs-<test>-<hash>.stdout.log # headscale server output
├── hs-<test>-<hash>.db # database snapshot (SQLite)
├── hs-<test>-<hash>_metrics.txt # Prometheus metrics dump
├── hs-<test>-<hash>-mapresponses/ # MapResponse protocol captures
├── ts-<client>-<hash>.stderr.log # tailscale client errors
├── ts-<client>-<hash>.stdout.log # tailscale client output
└── ts-<client>-<hash>_status.json # client network-status dump
Artefacts persist after cleanup. Old runs accumulate fast — delete unwanted directories to reclaim disk.
Debugging workflow
When a test fails, read the artefacts in this order:
-
hs-*.stderr.log— headscale server errors, panics, policy evaluation failures. Most issues originate server-side.grep -E "ERROR|panic|FATAL" control_logs/*/hs-*.stderr.log -
ts-*.stderr.log— authentication failures, connectivity issues, DNS resolution problems on the client side. -
MapResponse JSON in
hs-*-mapresponses/— protocol-level debugging for network map generation, peer visibility, route distribution, policy evaluation results.ls control_logs/*/hs-*-mapresponses/ jq '.Peers[] | {Name, Tags, PrimaryRoutes}' \ control_logs/*/hs-*-mapresponses/001.json -
*_status.json— client peer-connectivity state. -
hs-*.db— SQLite snapshot for post-mortem consistency checks.sqlite3 control_logs/<runID>/hs-*.db sqlite> .tables sqlite> .schema nodes sqlite> SELECT id, hostname, user_id, tags FROM nodes WHERE hostname LIKE '%problematic%'; -
*_metrics.txt— Prometheus dumps for latency, NodeStore operation timing, database query performance, memory usage.
Heuristic: infrastructure vs code
Before blaming Docker, disk, or network: read hs-*.stderr.log in
full. In practice, well over 99% of failures are code bugs (policy
evaluation, NodeStore sync, route approval) rather than infrastructure.
Actual infrastructure failures have signature error messages:
| Signature | Cause | Fix |
|---|---|---|
failed to resolve "hs-...": no DNS fallback candidates remain |
Docker DNS | Reset Docker networking |
container creation timeout, no progress >2 min |
Resource exhaustion | docker system prune -f (when no other tests running), retry |
| OOM kills, slow Docker daemon | Too many concurrent tests | Reduce concurrency, wait for completion |
no space left on device |
Disk full | Delete old control_logs/ |
If you don't see a signature error, assume it's a code regression — do not retry hoping the flake goes away.
Common failure patterns (code bugs)
Route advertisement timing
Test asserts route state before the client has finished propagating its
Hostinfo update. Symptom: nodes[0].GetAvailableRoutes() empty when
the test expects a route.
- Wrong fix:
time.Sleep(5 * time.Second)— fragile and slow. - Right fix: wrap the assertion in
EventuallyWithT. See../../integration/README.md.
NodeStore sync issues
Route changes not reflected in the NodeStore snapshot. Symptom: route advertisements in logs but no tracking updates in subsequent reads.
The sync point is State.UpdateNodeFromMapRequest() in
hscontrol/state/state.go. If you added a new kind of client state
update, make sure it lands here.
HA failover: routes disappearing on disconnect
TestHASubnetRouterFailover fails because approved routes vanish when
a subnet router goes offline. This is a bug, not expected behaviour.
Route approval must not be coupled to client connectivity — routes
stay approved; only the primary-route selection is affected by
connectivity.
Policy evaluation race
Symptom: tests that change policy and immediately assert peer visibility fail intermittently. Policy changes trigger async recomputation.
- See recent fixes in
git log -- hscontrol/state/for examples (e.g. thePolicyChangetrigger on every Connect/Disconnect).
SQLite vs PostgreSQL timing differences
Some race conditions only surface on one backend. If a test is flaky,
try the other backend with --postgres:
go run ./cmd/hi run "TestName" --postgres --verbose
PostgreSQL generally has more consistent timing; SQLite can expose races during rapid writes.
Keeping containers for inspection
If you need to inspect a failed test's state manually:
go run ./cmd/hi run "TestName" --keep-on-failure
# containers survive — inspect them
docker exec -it ts-<runID>-<...> /bin/sh
docker logs hs-<runID>-<...>
# clean up manually when done
go run ./cmd/hi clean all # only when no other tests are running