mirror of https://github.com/juanfont/headscale.git synced 2026-04-11 11:37:03 +02:00

Files

Kristoffer Dalby 70b622fc68 docs: expand cmd/hi and integration READMEs

Move integration-test runbook and authoring guide into the component
READMEs so the content sits next to the code it describes.

2026-04-10 12:30:07 +01:00

11 KiB

Raw Blame History

hi — Headscale Integration test runner

hi wraps Docker container orchestration around the tests in ../../integration and extracts debugging artefacts (logs, database snapshots, MapResponse protocol captures) for post-mortem analysis.

Read this file in full before running any hi command. The test runner has sharp edges — wrong flags produce stale containers, lost artefacts, or hung CI.

For test-authoring patterns (scenario setup, EventuallyWithT, IntegrationSkip, helper variants), read ../../integration/README.md.

Quick Start

# Verify system requirements (Docker, Go, disk space, images)
go run ./cmd/hi doctor

# Run a single test (the default flags are tuned for development)
go run ./cmd/hi run "TestPingAllByIP"

# Run a database-heavy test against PostgreSQL
go run ./cmd/hi run "TestExpireNode" --postgres

# Pattern matching
go run ./cmd/hi run "TestSubnet*"

Run doctor before the first run in any new environment. Tests generate ~100 MB of logs per run in control_logs/; doctor verifies there is enough space and that the required Docker images are available.

Commands

Command	Purpose
`run [pattern]`	Execute the test(s) matching `pattern`
`doctor`	Verify system requirements
`clean networks`	Prune unused Docker networks
`clean images`	Clean old test images
`clean containers`	Kill all test containers (dangerous — see below)
`clean cache`	Clean Go module cache volume
`clean all`	Run all cleanup operations

Flags

Defaults are tuned for single-test development runs. Review before changing.

Flag	Default	Purpose
`--timeout`	`120m`	Total test timeout. Use the built-in flag — never wrap with bash `timeout`.
`--postgres`	`false`	Use PostgreSQL instead of SQLite
`--failfast`	`true`	Stop on first test failure
`--go-version`	auto	Detected from `go.mod` (currently 1.26.1)
`--clean-before`	`true`	Clean stale (stopped/exited) containers before starting
`--clean-after`	`true`	Clean this run's containers after completion
`--keep-on-failure`	`false`	Preserve containers for manual inspection on failure
`--logs-dir`	`control_logs`	Where to save run artefacts
`--verbose`	`false`	Verbose output
`--stats`	`false`	Collect container resource-usage stats
`--hs-memory-limit`	`0`	Fail if any headscale container exceeds N MB (0 = disabled)
`--ts-memory-limit`	`0`	Fail if any tailscale container exceeds N MB

Timeout guidance

The default 120m is generous for a single test. If you must tune it, these are realistic floors by category:

Test type	Minimum	Examples
Basic functionality / CLI	900s (15m)	`TestPingAllByIP`, `TestCLI*`
Route / ACL	1200s (20m)	`TestSubnet`, `TestACL`
HA / failover	1800s (30m)	`TestHASubnetRouter*`
Long-running	2100s (35m)	`TestNodeOnlineStatus` (~12 min body)
Full suite	45m	`go test ./integration -timeout 45m`

Never use the shell timeout command around hi. It kills the process mid-cleanup and leaves stale containers:

timeout 300 go run ./cmd/hi run "TestName"   # WRONG — orphaned containers
go run ./cmd/hi run "TestName" --timeout=900s  # correct

Concurrent Execution

Multiple hi run invocations can run simultaneously on the same Docker daemon. Each invocation gets a unique Run ID (format YYYYMMDD-HHMMSS-6charhash, e.g. 20260409-104215-mdjtzx).

Container names include the short run ID: ts-mdjtzx-1-74-fgdyls
Docker labels: hi.run-id={runID} on every container
Port allocation: dynamic — kernel assigns free ports, no conflicts
Cleanup isolation: each run cleans only its own containers
Log directories: control_logs/{runID}/

# Start three tests in parallel — each gets its own run ID
go run ./cmd/hi run "TestPingAllByIP" &
go run ./cmd/hi run "TestACLAllowUserDst" &
go run ./cmd/hi run "TestOIDCAuthenticationPingAll" &

Safety rules for concurrent runs

✅ Your run cleans only containers labelled with its own hi.run-id
✅ --clean-before removes only stopped/exited containers
❌ Never run docker rm -f $(docker ps -q --filter name=hs-) — this destroys other agents' live test sessions
❌ Never run docker system prune -f while any tests are running
❌ Never run hi clean containers / hi clean all while other tests are running — both kill all test containers on the daemon

To identify your own containers:

docker ps --filter "label=hi.run-id=20260409-104215-mdjtzx"

The run ID appears at the top of the hi run output — copy it from there rather than trying to reconstruct it.

Artefacts

Every run saves debugging artefacts under control_logs/{runID}/:

control_logs/20260409-104215-mdjtzx/
├── hs-<test>-<hash>.stderr.log        # headscale server errors
├── hs-<test>-<hash>.stdout.log        # headscale server output
├── hs-<test>-<hash>.db                # database snapshot (SQLite)
├── hs-<test>-<hash>_metrics.txt       # Prometheus metrics dump
├── hs-<test>-<hash>-mapresponses/     # MapResponse protocol captures
├── ts-<client>-<hash>.stderr.log      # tailscale client errors
├── ts-<client>-<hash>.stdout.log      # tailscale client output
└── ts-<client>-<hash>_status.json     # client network-status dump

Artefacts persist after cleanup. Old runs accumulate fast — delete unwanted directories to reclaim disk.

Debugging workflow

When a test fails, read the artefacts in this order:

hs-*.stderr.log — headscale server errors, panics, policy evaluation failures. Most issues originate server-side.
```
grep -E "ERROR|panic|FATAL" control_logs/*/hs-*.stderr.log
```
ts-*.stderr.log — authentication failures, connectivity issues, DNS resolution problems on the client side.
MapResponse JSON in hs-*-mapresponses/ — protocol-level debugging for network map generation, peer visibility, route distribution, policy evaluation results.
```
ls control_logs/*/hs-*-mapresponses/
jq '.Peers[] | {Name, Tags, PrimaryRoutes}' \
    control_logs/*/hs-*-mapresponses/001.json
```
*_status.json — client peer-connectivity state.

hs-*.db — SQLite snapshot for post-mortem consistency checks.

sqlite3 control_logs/<runID>/hs-*.db
sqlite> .tables
sqlite> .schema nodes
sqlite> SELECT id, hostname, user_id, tags FROM nodes WHERE hostname LIKE '%problematic%';

*_metrics.txt — Prometheus dumps for latency, NodeStore operation timing, database query performance, memory usage.

Heuristic: infrastructure vs code

Before blaming Docker, disk, or network: read hs-*.stderr.log in full. In practice, well over 99% of failures are code bugs (policy evaluation, NodeStore sync, route approval) rather than infrastructure.

Actual infrastructure failures have signature error messages:

Signature	Cause	Fix
`failed to resolve "hs-...": no DNS fallback candidates remain`	Docker DNS	Reset Docker networking
`container creation timeout`, no progress >2 min	Resource exhaustion	`docker system prune -f` (when no other tests running), retry
OOM kills, slow Docker daemon	Too many concurrent tests	Reduce concurrency, wait for completion
`no space left on device`	Disk full	Delete old `control_logs/`

If you don't see a signature error, assume it's a code regression — do not retry hoping the flake goes away.

Common failure patterns (code bugs)

Route advertisement timing

Test asserts route state before the client has finished propagating its Hostinfo update. Symptom: nodes[0].GetAvailableRoutes() empty when the test expects a route.

Wrong fix: time.Sleep(5 * time.Second) — fragile and slow.
Right fix: wrap the assertion in EventuallyWithT. See ../../integration/README.md.

NodeStore sync issues

Route changes not reflected in the NodeStore snapshot. Symptom: route advertisements in logs but no tracking updates in subsequent reads.

The sync point is State.UpdateNodeFromMapRequest() in hscontrol/state/state.go. If you added a new kind of client state update, make sure it lands here.

HA failover: routes disappearing on disconnect

TestHASubnetRouterFailover fails because approved routes vanish when a subnet router goes offline. This is a bug, not expected behaviour. Route approval must not be coupled to client connectivity — routes stay approved; only the primary-route selection is affected by connectivity.

Policy evaluation race

Symptom: tests that change policy and immediately assert peer visibility fail intermittently. Policy changes trigger async recomputation.

See recent fixes in git log -- hscontrol/state/ for examples (e.g. the PolicyChange trigger on every Connect/Disconnect).

SQLite vs PostgreSQL timing differences

Some race conditions only surface on one backend. If a test is flaky, try the other backend with --postgres:

go run ./cmd/hi run "TestName" --postgres --verbose

PostgreSQL generally has more consistent timing; SQLite can expose races during rapid writes.

Keeping containers for inspection

If you need to inspect a failed test's state manually:

go run ./cmd/hi run "TestName" --keep-on-failure
# containers survive — inspect them
docker exec -it ts-<runID>-<...> /bin/sh
docker logs hs-<runID>-<...>
# clean up manually when done
go run ./cmd/hi clean all   # only when no other tests are running

11 KiB Raw Blame History