Files
headscale/cmd/hi/README.md
Kristoffer Dalby 70b622fc68 docs: expand cmd/hi and integration READMEs
Move integration-test runbook and authoring guide into the component
READMEs so the content sits next to the code it describes.
2026-04-10 12:30:07 +01:00

11 KiB

hi — Headscale Integration test runner

hi wraps Docker container orchestration around the tests in ../../integration and extracts debugging artefacts (logs, database snapshots, MapResponse protocol captures) for post-mortem analysis.

Read this file in full before running any hi command. The test runner has sharp edges — wrong flags produce stale containers, lost artefacts, or hung CI.

For test-authoring patterns (scenario setup, EventuallyWithT, IntegrationSkip, helper variants), read ../../integration/README.md.

Quick Start

# Verify system requirements (Docker, Go, disk space, images)
go run ./cmd/hi doctor

# Run a single test (the default flags are tuned for development)
go run ./cmd/hi run "TestPingAllByIP"

# Run a database-heavy test against PostgreSQL
go run ./cmd/hi run "TestExpireNode" --postgres

# Pattern matching
go run ./cmd/hi run "TestSubnet*"

Run doctor before the first run in any new environment. Tests generate ~100 MB of logs per run in control_logs/; doctor verifies there is enough space and that the required Docker images are available.

Commands

Command Purpose
run [pattern] Execute the test(s) matching pattern
doctor Verify system requirements
clean networks Prune unused Docker networks
clean images Clean old test images
clean containers Kill all test containers (dangerous — see below)
clean cache Clean Go module cache volume
clean all Run all cleanup operations

Flags

Defaults are tuned for single-test development runs. Review before changing.

Flag Default Purpose
--timeout 120m Total test timeout. Use the built-in flag — never wrap with bash timeout.
--postgres false Use PostgreSQL instead of SQLite
--failfast true Stop on first test failure
--go-version auto Detected from go.mod (currently 1.26.1)
--clean-before true Clean stale (stopped/exited) containers before starting
--clean-after true Clean this run's containers after completion
--keep-on-failure false Preserve containers for manual inspection on failure
--logs-dir control_logs Where to save run artefacts
--verbose false Verbose output
--stats false Collect container resource-usage stats
--hs-memory-limit 0 Fail if any headscale container exceeds N MB (0 = disabled)
--ts-memory-limit 0 Fail if any tailscale container exceeds N MB

Timeout guidance

The default 120m is generous for a single test. If you must tune it, these are realistic floors by category:

Test type Minimum Examples
Basic functionality / CLI 900s (15m) TestPingAllByIP, TestCLI*
Route / ACL 1200s (20m) TestSubnet*, TestACL*
HA / failover 1800s (30m) TestHASubnetRouter*
Long-running 2100s (35m) TestNodeOnlineStatus (~12 min body)
Full suite 45m go test ./integration -timeout 45m

Never use the shell timeout command around hi. It kills the process mid-cleanup and leaves stale containers:

timeout 300 go run ./cmd/hi run "TestName"   # WRONG — orphaned containers
go run ./cmd/hi run "TestName" --timeout=900s  # correct

Concurrent Execution

Multiple hi run invocations can run simultaneously on the same Docker daemon. Each invocation gets a unique Run ID (format YYYYMMDD-HHMMSS-6charhash, e.g. 20260409-104215-mdjtzx).

  • Container names include the short run ID: ts-mdjtzx-1-74-fgdyls
  • Docker labels: hi.run-id={runID} on every container
  • Port allocation: dynamic — kernel assigns free ports, no conflicts
  • Cleanup isolation: each run cleans only its own containers
  • Log directories: control_logs/{runID}/
# Start three tests in parallel — each gets its own run ID
go run ./cmd/hi run "TestPingAllByIP" &
go run ./cmd/hi run "TestACLAllowUserDst" &
go run ./cmd/hi run "TestOIDCAuthenticationPingAll" &

Safety rules for concurrent runs

  • Your run cleans only containers labelled with its own hi.run-id
  • --clean-before removes only stopped/exited containers
  • Never run docker rm -f $(docker ps -q --filter name=hs-) — this destroys other agents' live test sessions
  • Never run docker system prune -f while any tests are running
  • Never run hi clean containers / hi clean all while other tests are running — both kill all test containers on the daemon

To identify your own containers:

docker ps --filter "label=hi.run-id=20260409-104215-mdjtzx"

The run ID appears at the top of the hi run output — copy it from there rather than trying to reconstruct it.

Artefacts

Every run saves debugging artefacts under control_logs/{runID}/:

control_logs/20260409-104215-mdjtzx/
├── hs-<test>-<hash>.stderr.log        # headscale server errors
├── hs-<test>-<hash>.stdout.log        # headscale server output
├── hs-<test>-<hash>.db                # database snapshot (SQLite)
├── hs-<test>-<hash>_metrics.txt       # Prometheus metrics dump
├── hs-<test>-<hash>-mapresponses/     # MapResponse protocol captures
├── ts-<client>-<hash>.stderr.log      # tailscale client errors
├── ts-<client>-<hash>.stdout.log      # tailscale client output
└── ts-<client>-<hash>_status.json     # client network-status dump

Artefacts persist after cleanup. Old runs accumulate fast — delete unwanted directories to reclaim disk.

Debugging workflow

When a test fails, read the artefacts in this order:

  1. hs-*.stderr.log — headscale server errors, panics, policy evaluation failures. Most issues originate server-side.

    grep -E "ERROR|panic|FATAL" control_logs/*/hs-*.stderr.log
    
  2. ts-*.stderr.log — authentication failures, connectivity issues, DNS resolution problems on the client side.

  3. MapResponse JSON in hs-*-mapresponses/ — protocol-level debugging for network map generation, peer visibility, route distribution, policy evaluation results.

    ls control_logs/*/hs-*-mapresponses/
    jq '.Peers[] | {Name, Tags, PrimaryRoutes}' \
        control_logs/*/hs-*-mapresponses/001.json
    
  4. *_status.json — client peer-connectivity state.

  5. hs-*.db — SQLite snapshot for post-mortem consistency checks.

    sqlite3 control_logs/<runID>/hs-*.db
    sqlite> .tables
    sqlite> .schema nodes
    sqlite> SELECT id, hostname, user_id, tags FROM nodes WHERE hostname LIKE '%problematic%';
    
  6. *_metrics.txt — Prometheus dumps for latency, NodeStore operation timing, database query performance, memory usage.

Heuristic: infrastructure vs code

Before blaming Docker, disk, or network: read hs-*.stderr.log in full. In practice, well over 99% of failures are code bugs (policy evaluation, NodeStore sync, route approval) rather than infrastructure.

Actual infrastructure failures have signature error messages:

Signature Cause Fix
failed to resolve "hs-...": no DNS fallback candidates remain Docker DNS Reset Docker networking
container creation timeout, no progress >2 min Resource exhaustion docker system prune -f (when no other tests running), retry
OOM kills, slow Docker daemon Too many concurrent tests Reduce concurrency, wait for completion
no space left on device Disk full Delete old control_logs/

If you don't see a signature error, assume it's a code regression — do not retry hoping the flake goes away.

Common failure patterns (code bugs)

Route advertisement timing

Test asserts route state before the client has finished propagating its Hostinfo update. Symptom: nodes[0].GetAvailableRoutes() empty when the test expects a route.

  • Wrong fix: time.Sleep(5 * time.Second) — fragile and slow.
  • Right fix: wrap the assertion in EventuallyWithT. See ../../integration/README.md.

NodeStore sync issues

Route changes not reflected in the NodeStore snapshot. Symptom: route advertisements in logs but no tracking updates in subsequent reads.

The sync point is State.UpdateNodeFromMapRequest() in hscontrol/state/state.go. If you added a new kind of client state update, make sure it lands here.

HA failover: routes disappearing on disconnect

TestHASubnetRouterFailover fails because approved routes vanish when a subnet router goes offline. This is a bug, not expected behaviour. Route approval must not be coupled to client connectivity — routes stay approved; only the primary-route selection is affected by connectivity.

Policy evaluation race

Symptom: tests that change policy and immediately assert peer visibility fail intermittently. Policy changes trigger async recomputation.

  • See recent fixes in git log -- hscontrol/state/ for examples (e.g. the PolicyChange trigger on every Connect/Disconnect).

SQLite vs PostgreSQL timing differences

Some race conditions only surface on one backend. If a test is flaky, try the other backend with --postgres:

go run ./cmd/hi run "TestName" --postgres --verbose

PostgreSQL generally has more consistent timing; SQLite can expose races during rapid writes.

Keeping containers for inspection

If you need to inspect a failed test's state manually:

go run ./cmd/hi run "TestName" --keep-on-failure
# containers survive — inspect them
docker exec -it ts-<runID>-<...> /bin/sh
docker logs hs-<runID>-<...>
# clean up manually when done
go run ./cmd/hi clean all   # only when no other tests are running