docs: expand cmd/hi and integration READMEs

Move integration-test runbook and authoring guide into the component
READMEs so the content sits next to the code it describes.
This commit is contained in:
Kristoffer Dalby
2026-04-09 15:39:15 +00:00
parent 742878d172
commit 70b622fc68
2 changed files with 584 additions and 17 deletions

View File

@@ -1,6 +1,262 @@
# hi
# hi — Headscale Integration test runner
hi (headscale integration runner) is an entirely "vibe coded" wrapper around our
[integration test suite](../integration). It essentially runs the docker
commands for you with some added benefits of extracting resources like logs and
databases.
`hi` wraps Docker container orchestration around the tests in
[`../../integration`](../../integration) and extracts debugging artefacts
(logs, database snapshots, MapResponse protocol captures) for post-mortem
analysis.
**Read this file in full before running any `hi` command.** The test
runner has sharp edges — wrong flags produce stale containers, lost
artefacts, or hung CI.
For test-authoring patterns (scenario setup, `EventuallyWithT`,
`IntegrationSkip`, helper variants), read
[`../../integration/README.md`](../../integration/README.md).
## Quick Start
```bash
# Verify system requirements (Docker, Go, disk space, images)
go run ./cmd/hi doctor
# Run a single test (the default flags are tuned for development)
go run ./cmd/hi run "TestPingAllByIP"
# Run a database-heavy test against PostgreSQL
go run ./cmd/hi run "TestExpireNode" --postgres
# Pattern matching
go run ./cmd/hi run "TestSubnet*"
```
Run `doctor` before the first `run` in any new environment. Tests
generate ~100 MB of logs per run in `control_logs/`; `doctor` verifies
there is enough space and that the required Docker images are available.
## Commands
| Command | Purpose |
| ------------------ | ---------------------------------------------------- |
| `run [pattern]` | Execute the test(s) matching `pattern` |
| `doctor` | Verify system requirements |
| `clean networks` | Prune unused Docker networks |
| `clean images` | Clean old test images |
| `clean containers` | Kill **all** test containers (dangerous — see below) |
| `clean cache` | Clean Go module cache volume |
| `clean all` | Run all cleanup operations |
## Flags
Defaults are tuned for single-test development runs. Review before
changing.
| Flag | Default | Purpose |
| ------------------- | -------------- | --------------------------------------------------------------------------- |
| `--timeout` | `120m` | Total test timeout. Use the built-in flag — never wrap with bash `timeout`. |
| `--postgres` | `false` | Use PostgreSQL instead of SQLite |
| `--failfast` | `true` | Stop on first test failure |
| `--go-version` | auto | Detected from `go.mod` (currently 1.26.1) |
| `--clean-before` | `true` | Clean stale (stopped/exited) containers before starting |
| `--clean-after` | `true` | Clean this run's containers after completion |
| `--keep-on-failure` | `false` | Preserve containers for manual inspection on failure |
| `--logs-dir` | `control_logs` | Where to save run artefacts |
| `--verbose` | `false` | Verbose output |
| `--stats` | `false` | Collect container resource-usage stats |
| `--hs-memory-limit` | `0` | Fail if any headscale container exceeds N MB (0 = disabled) |
| `--ts-memory-limit` | `0` | Fail if any tailscale container exceeds N MB |
### Timeout guidance
The default `120m` is generous for a single test. If you must tune it,
these are realistic floors by category:
| Test type | Minimum | Examples |
| ------------------------- | ----------- | ------------------------------------- |
| Basic functionality / CLI | 900s (15m) | `TestPingAllByIP`, `TestCLI*` |
| Route / ACL | 1200s (20m) | `TestSubnet*`, `TestACL*` |
| HA / failover | 1800s (30m) | `TestHASubnetRouter*` |
| Long-running | 2100s (35m) | `TestNodeOnlineStatus` (~12 min body) |
| Full suite | 45m | `go test ./integration -timeout 45m` |
**Never** use the shell `timeout` command around `hi`. It kills the
process mid-cleanup and leaves stale containers:
```bash
timeout 300 go run ./cmd/hi run "TestName" # WRONG — orphaned containers
go run ./cmd/hi run "TestName" --timeout=900s # correct
```
## Concurrent Execution
Multiple `hi run` invocations can run simultaneously on the same Docker
daemon. Each invocation gets a unique **Run ID** (format
`YYYYMMDD-HHMMSS-6charhash`, e.g. `20260409-104215-mdjtzx`).
- **Container names** include the short run ID: `ts-mdjtzx-1-74-fgdyls`
- **Docker labels**: `hi.run-id={runID}` on every container
- **Port allocation**: dynamic — kernel assigns free ports, no conflicts
- **Cleanup isolation**: each run cleans only its own containers
- **Log directories**: `control_logs/{runID}/`
```bash
# Start three tests in parallel — each gets its own run ID
go run ./cmd/hi run "TestPingAllByIP" &
go run ./cmd/hi run "TestACLAllowUserDst" &
go run ./cmd/hi run "TestOIDCAuthenticationPingAll" &
```
### Safety rules for concurrent runs
- ✅ Your run cleans only containers labelled with its own `hi.run-id`
-`--clean-before` removes only stopped/exited containers
-**Never** run `docker rm -f $(docker ps -q --filter name=hs-)`
this destroys other agents' live test sessions
-**Never** run `docker system prune -f` while any tests are running
-**Never** run `hi clean containers` / `hi clean all` while other
tests are running — both kill all test containers on the daemon
To identify your own containers:
```bash
docker ps --filter "label=hi.run-id=20260409-104215-mdjtzx"
```
The run ID appears at the top of the `hi run` output — copy it from
there rather than trying to reconstruct it.
## Artefacts
Every run saves debugging artefacts under `control_logs/{runID}/`:
```
control_logs/20260409-104215-mdjtzx/
├── hs-<test>-<hash>.stderr.log # headscale server errors
├── hs-<test>-<hash>.stdout.log # headscale server output
├── hs-<test>-<hash>.db # database snapshot (SQLite)
├── hs-<test>-<hash>_metrics.txt # Prometheus metrics dump
├── hs-<test>-<hash>-mapresponses/ # MapResponse protocol captures
├── ts-<client>-<hash>.stderr.log # tailscale client errors
├── ts-<client>-<hash>.stdout.log # tailscale client output
└── ts-<client>-<hash>_status.json # client network-status dump
```
Artefacts persist after cleanup. Old runs accumulate fast — delete
unwanted directories to reclaim disk.
## Debugging workflow
When a test fails, read the artefacts **in this order**:
1. **`hs-*.stderr.log`** — headscale server errors, panics, policy
evaluation failures. Most issues originate server-side.
```bash
grep -E "ERROR|panic|FATAL" control_logs/*/hs-*.stderr.log
```
2. **`ts-*.stderr.log`** — authentication failures, connectivity issues,
DNS resolution problems on the client side.
3. **MapResponse JSON** in `hs-*-mapresponses/` — protocol-level
debugging for network map generation, peer visibility, route
distribution, policy evaluation results.
```bash
ls control_logs/*/hs-*-mapresponses/
jq '.Peers[] | {Name, Tags, PrimaryRoutes}' \
control_logs/*/hs-*-mapresponses/001.json
```
4. **`*_status.json`** — client peer-connectivity state.
5. **`hs-*.db`** — SQLite snapshot for post-mortem consistency checks.
```bash
sqlite3 control_logs/<runID>/hs-*.db
sqlite> .tables
sqlite> .schema nodes
sqlite> SELECT id, hostname, user_id, tags FROM nodes WHERE hostname LIKE '%problematic%';
```
6. **`*_metrics.txt`** — Prometheus dumps for latency, NodeStore
operation timing, database query performance, memory usage.
## Heuristic: infrastructure vs code
**Before blaming Docker, disk, or network: read `hs-*.stderr.log` in
full.** In practice, well over 99% of failures are code bugs (policy
evaluation, NodeStore sync, route approval) rather than infrastructure.
Actual infrastructure failures have signature error messages:
| Signature | Cause | Fix |
| --------------------------------------------------------------- | ------------------------- | ------------------------------------------------------------- |
| `failed to resolve "hs-...": no DNS fallback candidates remain` | Docker DNS | Reset Docker networking |
| `container creation timeout`, no progress >2 min | Resource exhaustion | `docker system prune -f` (when no other tests running), retry |
| OOM kills, slow Docker daemon | Too many concurrent tests | Reduce concurrency, wait for completion |
| `no space left on device` | Disk full | Delete old `control_logs/` |
If you don't see a signature error, **assume it's a code regression** —
do not retry hoping the flake goes away.
## Common failure patterns (code bugs)
### Route advertisement timing
Test asserts route state before the client has finished propagating its
Hostinfo update. Symptom: `nodes[0].GetAvailableRoutes()` empty when
the test expects a route.
- **Wrong fix**: `time.Sleep(5 * time.Second)` — fragile and slow.
- **Right fix**: wrap the assertion in `EventuallyWithT`. See
[`../../integration/README.md`](../../integration/README.md).
### NodeStore sync issues
Route changes not reflected in the NodeStore snapshot. Symptom: route
advertisements in logs but no tracking updates in subsequent reads.
The sync point is `State.UpdateNodeFromMapRequest()` in
`hscontrol/state/state.go`. If you added a new kind of client state
update, make sure it lands here.
### HA failover: routes disappearing on disconnect
`TestHASubnetRouterFailover` fails because approved routes vanish when
a subnet router goes offline. **This is a bug, not expected behaviour.**
Route approval must not be coupled to client connectivity — routes
stay approved; only the primary-route selection is affected by
connectivity.
### Policy evaluation race
Symptom: tests that change policy and immediately assert peer visibility
fail intermittently. Policy changes trigger async recomputation.
- See recent fixes in `git log -- hscontrol/state/` for examples (e.g.
the `PolicyChange` trigger on every Connect/Disconnect).
### SQLite vs PostgreSQL timing differences
Some race conditions only surface on one backend. If a test is flaky,
try the other backend with `--postgres`:
```bash
go run ./cmd/hi run "TestName" --postgres --verbose
```
PostgreSQL generally has more consistent timing; SQLite can expose
races during rapid writes.
## Keeping containers for inspection
If you need to inspect a failed test's state manually:
```bash
go run ./cmd/hi run "TestName" --keep-on-failure
# containers survive — inspect them
docker exec -it ts-<runID>-<...> /bin/sh
docker logs hs-<runID>-<...>
# clean up manually when done
go run ./cmd/hi clean all # only when no other tests are running
```

View File

@@ -1,25 +1,336 @@
# Integration testing
Headscale relies on integration testing to ensure we remain compatible with Tailscale.
Headscale's integration tests start a real Headscale server and run
scenarios against real Tailscale clients across supported versions, all
inside Docker. They are the safety net that keeps us honest about
Tailscale protocol compatibility.
This is typically performed by starting a Headscale server and running a test "scenario"
with an array of Tailscale clients and versions.
This file documents **how to write** integration tests. For **how to
run** them, see [`../cmd/hi/README.md`](../cmd/hi/README.md).
Headscale's test framework and the current set of scenarios are defined in this directory.
Tests live in files ending with `_test.go`; the framework lives in the
rest of this directory (`scenario.go`, `tailscale.go`, helpers, and the
`hsic/`, `tsic/`, `dockertestutil/` packages).
Tests are located in files ending with `_test.go` and the framework are located in the rest.
## Running tests
## Running integration tests locally
The easiest way to run tests locally is to use [act](https://github.com/nektos/act), a local GitHub Actions runner:
For local runs, use [`cmd/hi`](../cmd/hi):
```bash
go run ./cmd/hi doctor
go run ./cmd/hi run "TestPingAllByIP"
```
Alternatively, [`act`](https://github.com/nektos/act) runs the GitHub
Actions workflow locally:
```bash
act pull_request -W .github/workflows/test-integration.yaml
```
Alternatively, the `docker run` command in each GitHub workflow file can be used.
Each test runs as a separate workflow on GitHub Actions. To add a new
test, run `go generate` inside `../cmd/gh-action-integration-generator/`
and commit the generated workflow file.
## Running integration tests on GitHub Actions
## Framework overview
Each test currently runs as a separate workflows in GitHub actions, to add new test, run
`go generate` inside `../cmd/gh-action-integration-generator/` and commit the result.
The integration framework has four layers:
- **`scenario.go`** — `Scenario` orchestrates a test environment: a
Headscale server, one or more users, and a collection of Tailscale
clients. `NewScenario(spec)` returns a ready-to-use environment.
- **`hsic/`** — "Headscale Integration Container": wraps a Headscale
server in Docker. Options for config, DB backend, DERP, OIDC, etc.
- **`tsic/`** — "Tailscale Integration Container": wraps a single
Tailscale client. Options for version, hostname, auth method, etc.
- **`dockertestutil/`** — low-level Docker helpers (networks, container
lifecycle, `IsRunningInContainer()` detection).
Tests compose these pieces via `ScenarioSpec` and `CreateHeadscaleEnv`
rather than calling Docker directly.
## Required scaffolding
### `IntegrationSkip(t)`
**Every** integration test function must call `IntegrationSkip(t)` as
its first statement. Without it, the test runs in the wrong environment
and fails with confusing errors.
```go
func TestMyScenario(t *testing.T) {
IntegrationSkip(t)
// ... rest of the test
}
```
`IntegrationSkip` is defined in `integration/scenario_test.go:15` and:
- skips the test when not running inside the Docker test container
(`dockertestutil.IsRunningInContainer()`),
- skips when `-short` is passed to `go test`.
### Scenario setup
The canonical setup creates users, clients, and the Headscale server in
one shot:
```go
func TestMyScenario(t *testing.T) {
IntegrationSkip(t)
t.Parallel()
spec := ScenarioSpec{
NodesPerUser: 2,
Users: []string{"alice", "bob"},
}
scenario, err := NewScenario(spec)
require.NoError(t, err)
defer scenario.ShutdownAssertNoPanics(t)
err = scenario.CreateHeadscaleEnv(
[]tsic.Option{tsic.WithSSH()},
hsic.WithTestName("myscenario"),
)
require.NoError(t, err)
allClients, err := scenario.ListTailscaleClients()
require.NoError(t, err)
headscale, err := scenario.Headscale()
require.NoError(t, err)
// ... assertions
}
```
Review `scenario.go` and `hsic/options.go` / `tsic/options.go` for the
full option set (DERP, OIDC, policy files, DB backend, ACL grants,
exit-node config, etc.).
## The `EventuallyWithT` pattern
Integration tests operate on a distributed system with real async
propagation: clients advertise state, the server processes it, updates
stream to peers. Direct assertions after state changes fail
intermittently. Wrap external calls in `assert.EventuallyWithT`:
```go
assert.EventuallyWithT(t, func(c *assert.CollectT) {
status, err := client.Status()
assert.NoError(c, err)
for _, peerKey := range status.Peers() {
peerStatus := status.Peer[peerKey]
requirePeerSubnetRoutesWithCollect(c, peerStatus, expectedRoutes)
}
}, 10*time.Second, 500*time.Millisecond, "client should see expected routes")
```
### External calls that need wrapping
These read distributed state and may reflect stale data until
propagation completes:
- `headscale.ListNodes()`
- `client.Status()`
- `client.Curl()`
- `client.Traceroute()`
- `client.Execute()` when the command reads state
### Blocking operations that must NOT be wrapped
State-mutating commands run exactly once and either succeed or fail
immediately — not eventually. Wrapping them in `EventuallyWithT` hides
real failures behind retry.
Use `client.MustStatus()` when you only need an ID for a blocking call:
```go
// CORRECT — mutation runs once
for _, client := range allClients {
status := client.MustStatus()
_, _, err := client.Execute([]string{
"tailscale", "set",
"--advertise-routes=" + expectedRoutes[string(status.Self.ID)],
})
require.NoErrorf(t, err, "failed to advertise route: %s", err)
}
```
Typical blocking operations: any `tailscale set` (routes, exit node,
accept-routes, ssh), node registration via the CLI, user creation via
gRPC.
### The four rules
1. **One external call per `EventuallyWithT` block.** Related assertions
on the result of a single call go together in the same block.
**Loop exception**: iterating over a collection of clients (or peers)
and calling `Status()` on each inside a single block is allowed — it
is the same logical "check all clients" operation. The rule applies
to distinct calls like `ListNodes()` + `Status()`, which must be
split into separate blocks.
2. **Never nest `EventuallyWithT` calls.** A nested retry loop
multiplies timing windows and makes failures impossible to diagnose.
3. **Use `*WithCollect` helper variants** inside the block. Regular
helpers use `require` and abort on the first failed assertion,
preventing retry.
4. **Always provide a descriptive final message** — it appears on
failure and is your only clue about what the test was waiting for.
### Variable scoping
Variables used across multiple `EventuallyWithT` blocks must be declared
at function scope. Inside the block, assign with `=`, not `:=``:=`
creates a shadow invisible to the outer scope:
```go
var nodes []*v1.Node
var err error
assert.EventuallyWithT(t, func(c *assert.CollectT) {
nodes, err = headscale.ListNodes() // = not :=
assert.NoError(c, err)
assert.Len(c, nodes, 2)
requireNodeRouteCountWithCollect(c, nodes[0], 2, 2, 2)
}, 10*time.Second, 500*time.Millisecond, "nodes should have expected routes")
// nodes is usable here because it was declared at function scope
```
### Helper functions
Inside `EventuallyWithT` blocks, use the `*WithCollect` variants so
assertion failures restart the wait loop instead of failing the test
immediately:
- `requirePeerSubnetRoutesWithCollect(c, status, expected)`
`integration/route_test.go:2941`
- `requireNodeRouteCountWithCollect(c, node, announced, approved, subnet)`
`integration/route_test.go:2958`
- `assertTracerouteViaIPWithCollect(c, traceroute, ip)`
`integration/route_test.go:2898`
When you write a new helper to be called inside `EventuallyWithT`, it
must accept `*assert.CollectT` as its first parameter, not `*testing.T`.
## Identifying nodes by property, not position
The order of `headscale.ListNodes()` is not stable. Tests that index
`nodes[0]` will break when node ordering changes. Look nodes up by ID,
hostname, or tag:
```go
// WRONG — relies on array position
require.Len(t, nodes[0].GetAvailableRoutes(), 1)
// CORRECT — find the node that should have the route
expectedRoutes := map[string]string{"1": "10.33.0.0/16"}
for _, node := range nodes {
nodeIDStr := fmt.Sprintf("%d", node.GetId())
if route, shouldHaveRoute := expectedRoutes[nodeIDStr]; shouldHaveRoute {
assert.Contains(t, node.GetAvailableRoutes(), route)
}
}
```
## Full example: advertising and approving a route
```go
func TestRouteAdvertisementBasic(t *testing.T) {
IntegrationSkip(t)
t.Parallel()
spec := ScenarioSpec{
NodesPerUser: 2,
Users: []string{"user1"},
}
scenario, err := NewScenario(spec)
require.NoError(t, err)
defer scenario.ShutdownAssertNoPanics(t)
err = scenario.CreateHeadscaleEnv([]tsic.Option{}, hsic.WithTestName("route"))
require.NoError(t, err)
allClients, err := scenario.ListTailscaleClients()
require.NoError(t, err)
headscale, err := scenario.Headscale()
require.NoError(t, err)
// --- Blocking: advertise the route on one client ---
router := allClients[0]
_, _, err = router.Execute([]string{
"tailscale", "set",
"--advertise-routes=10.33.0.0/16",
})
require.NoErrorf(t, err, "advertising route: %s", err)
// --- Eventually: headscale should see the announced route ---
var nodes []*v1.Node
assert.EventuallyWithT(t, func(c *assert.CollectT) {
nodes, err = headscale.ListNodes()
assert.NoError(c, err)
assert.Len(c, nodes, 2)
for _, node := range nodes {
if node.GetName() == router.Hostname() {
requireNodeRouteCountWithCollect(c, node, 1, 0, 0)
}
}
}, 10*time.Second, 500*time.Millisecond, "route should be announced")
// --- Blocking: approve the route via headscale CLI ---
var routerNode *v1.Node
for _, node := range nodes {
if node.GetName() == router.Hostname() {
routerNode = node
break
}
}
require.NotNil(t, routerNode)
_, err = headscale.ApproveRoutes(routerNode.GetId(), []string{"10.33.0.0/16"})
require.NoError(t, err)
// --- Eventually: a peer should see the approved route ---
peer := allClients[1]
assert.EventuallyWithT(t, func(c *assert.CollectT) {
status, err := peer.Status()
assert.NoError(c, err)
for _, peerKey := range status.Peers() {
if peerKey == router.PublicKey() {
requirePeerSubnetRoutesWithCollect(c,
status.Peer[peerKey],
[]netip.Prefix{netip.MustParsePrefix("10.33.0.0/16")})
}
}
}, 10*time.Second, 500*time.Millisecond, "peer should see approved route")
}
```
## Common pitfalls
- **Forgetting `IntegrationSkip(t)`**: the test runs outside Docker and
fails in confusing ways. Always the first line.
- **Using `require` inside `EventuallyWithT`**: aborts after the first
iteration instead of retrying. Use `assert.*` + the `*WithCollect`
helpers.
- **Mixing mutation and query in one `EventuallyWithT`**: hides real
failures. Keep mutation outside, query inside.
- **Assuming node ordering**: look up by property.
- **Ignoring `err` from `client.Status()`**: retry only retries the
whole block; don't silently drop errors from mid-block calls.
- **Timeouts too tight**: 5s is reasonable for local state, 10s for
state that must propagate through the map poll cycle. Don't go lower
to "speed up the test" — you just make it flaky.
## Debugging failing tests
Tests save comprehensive artefacts to `control_logs/{runID}/`. Read them
in this order: server stderr, client stderr, MapResponse JSON, database
snapshot. The full debugging workflow, heuristics, and failure patterns
are documented in [`../cmd/hi/README.md`](../cmd/hi/README.md).