docs: expand cmd/hi and integration READMEs

Move integration-test runbook and authoring guide into the component READMEs so the content sits next to the code it describes.
2026-04-17 06:19:51 +02:00 · 2026-04-09 15:39:15 +00:00
parent 742878d172
commit 70b622fc68
2 changed files with 584 additions and 17 deletions
--- a/cmd/hi/README.md
+++ b/cmd/hi/README.md
@@ -1,6 +1,262 @@
-# hi
+# hi — Headscale Integration test runner
-hi (headscale integration runner) is an entirely "vibe coded" wrapper around our
+`hi` wraps Docker container orchestration around the tests in
-[integration test suite](../integration). It essentially runs the docker
+[`../../integration`](../../integration) and extracts debugging artefacts
-commands for you with some added benefits of extracting resources like logs and
+(logs, database snapshots, MapResponse protocol captures) for post-mortem
-databases.
+analysis.
 **Read this file in full before running any `hi` command.** The test
 runner has sharp edges — wrong flags produce stale containers, lost
 artefacts, or hung CI.
 For test-authoring patterns (scenario setup, `EventuallyWithT`,
 `IntegrationSkip`, helper variants), read
 [`../../integration/README.md`](../../integration/README.md).
 ## Quick Start
 ```bash
 # Verify system requirements (Docker, Go, disk space, images)
 go run ./cmd/hi doctor
 # Run a single test (the default flags are tuned for development)
 go run ./cmd/hi run "TestPingAllByIP"
 # Run a database-heavy test against PostgreSQL
 go run ./cmd/hi run "TestExpireNode" --postgres
 # Pattern matching
 go run ./cmd/hi run "TestSubnet*"
 ```
 Run `doctor` before the first `run` in any new environment. Tests
 generate ~100 MB of logs per run in `control_logs/`; `doctor` verifies
 there is enough space and that the required Docker images are available.
 ## Commands
 | Command            | Purpose                                              |
 | ------------------ | ---------------------------------------------------- |
 | `run [pattern]`    | Execute the test(s) matching `pattern`               |
 | `doctor`           | Verify system requirements                           |
 | `clean networks`   | Prune unused Docker networks                         |
 | `clean images`     | Clean old test images                                |
 | `clean containers` | Kill **all** test containers (dangerous — see below) |
 | `clean cache`      | Clean Go module cache volume                         |
 | `clean all`        | Run all cleanup operations                           |
 ## Flags
 Defaults are tuned for single-test development runs. Review before
 changing.
 | Flag                | Default        | Purpose                                                                     |
 | ------------------- | -------------- | --------------------------------------------------------------------------- |
 | `--timeout`         | `120m`         | Total test timeout. Use the built-in flag — never wrap with bash `timeout`. |
 | `--postgres`        | `false`        | Use PostgreSQL instead of SQLite                                            |
 | `--failfast`        | `true`         | Stop on first test failure                                                  |
 | `--go-version`      | auto           | Detected from `go.mod` (currently 1.26.1)                                   |
 | `--clean-before`    | `true`         | Clean stale (stopped/exited) containers before starting                     |
 | `--clean-after`     | `true`         | Clean this run's containers after completion                                |
 | `--keep-on-failure` | `false`        | Preserve containers for manual inspection on failure                        |
 | `--logs-dir`        | `control_logs` | Where to save run artefacts                                                 |
 | `--verbose`         | `false`        | Verbose output                                                              |
 | `--stats`           | `false`        | Collect container resource-usage stats                                      |
 | `--hs-memory-limit` | `0`            | Fail if any headscale container exceeds N MB (0 = disabled)                 |
 | `--ts-memory-limit` | `0`            | Fail if any tailscale container exceeds N MB                                |
 ### Timeout guidance
 The default `120m` is generous for a single test. If you must tune it,
 these are realistic floors by category:
 | Test type                 | Minimum     | Examples                              |
 | ------------------------- | ----------- | ------------------------------------- |
 | Basic functionality / CLI | 900s (15m)  | `TestPingAllByIP`, `TestCLI*`         |
 | Route / ACL               | 1200s (20m) | `TestSubnet*`, `TestACL*`             |
 | HA / failover             | 1800s (30m) | `TestHASubnetRouter*`                 |
 | Long-running              | 2100s (35m) | `TestNodeOnlineStatus` (~12 min body) |
 | Full suite                | 45m         | `go test ./integration -timeout 45m`  |
 **Never** use the shell `timeout` command around `hi`. It kills the
 process mid-cleanup and leaves stale containers:
 ```bash
 timeout 300 go run ./cmd/hi run "TestName"   # WRONG — orphaned containers
 go run ./cmd/hi run "TestName" --timeout=900s  # correct
 ```
 ## Concurrent Execution
 Multiple `hi run` invocations can run simultaneously on the same Docker
 daemon. Each invocation gets a unique **Run ID** (format
 `YYYYMMDD-HHMMSS-6charhash`, e.g. `20260409-104215-mdjtzx`).
 - **Container names** include the short run ID: `ts-mdjtzx-1-74-fgdyls`
 - **Docker labels**: `hi.run-id={runID}` on every container
 - **Port allocation**: dynamic — kernel assigns free ports, no conflicts
 - **Cleanup isolation**: each run cleans only its own containers
 - **Log directories**: `control_logs/{runID}/`
 ```bash
 # Start three tests in parallel — each gets its own run ID
 go run ./cmd/hi run "TestPingAllByIP" &
 go run ./cmd/hi run "TestACLAllowUserDst" &
 go run ./cmd/hi run "TestOIDCAuthenticationPingAll" &
 ```
 ### Safety rules for concurrent runs
 - ✅ Your run cleans only containers labelled with its own `hi.run-id`
 - ✅ `--clean-before` removes only stopped/exited containers
 - ❌ **Never** run `docker rm -f $(docker ps -q --filter name=hs-)` —
  this destroys other agents' live test sessions
 - ❌ **Never** run `docker system prune -f` while any tests are running
 - ❌ **Never** run `hi clean containers` / `hi clean all` while other
  tests are running — both kill all test containers on the daemon
 To identify your own containers:
 ```bash
 docker ps --filter "label=hi.run-id=20260409-104215-mdjtzx"
 ```
 The run ID appears at the top of the `hi run` output — copy it from
 there rather than trying to reconstruct it.
 ## Artefacts
 Every run saves debugging artefacts under `control_logs/{runID}/`:
 ```
 control_logs/20260409-104215-mdjtzx/
 ├── hs-<test>-<hash>.stderr.log        # headscale server errors
 ├── hs-<test>-<hash>.stdout.log        # headscale server output
 ├── hs-<test>-<hash>.db                # database snapshot (SQLite)
 ├── hs-<test>-<hash>_metrics.txt       # Prometheus metrics dump
 ├── hs-<test>-<hash>-mapresponses/     # MapResponse protocol captures
 ├── ts-<client>-<hash>.stderr.log      # tailscale client errors
 ├── ts-<client>-<hash>.stdout.log      # tailscale client output
 └── ts-<client>-<hash>_status.json     # client network-status dump
 ```
 Artefacts persist after cleanup. Old runs accumulate fast — delete
 unwanted directories to reclaim disk.
 ## Debugging workflow
 When a test fails, read the artefacts **in this order**:
 1. **`hs-*.stderr.log`** — headscale server errors, panics, policy
   evaluation failures. Most issues originate server-side.
   ```bash
   grep -E "ERROR|panic|FATAL" control_logs/*/hs-*.stderr.log
   ```
 2. **`ts-*.stderr.log`** — authentication failures, connectivity issues,
   DNS resolution problems on the client side.
 3. **MapResponse JSON** in `hs-*-mapresponses/` — protocol-level
   debugging for network map generation, peer visibility, route
   distribution, policy evaluation results.
   ```bash
   ls control_logs/*/hs-*-mapresponses/
   jq '.Peers[] | {Name, Tags, PrimaryRoutes}' \
       control_logs/*/hs-*-mapresponses/001.json
   ```
 4. **`*_status.json`** — client peer-connectivity state.
 5. **`hs-*.db`** — SQLite snapshot for post-mortem consistency checks.
   ```bash
   sqlite3 control_logs/<runID>/hs-*.db
   sqlite> .tables
   sqlite> .schema nodes
   sqlite> SELECT id, hostname, user_id, tags FROM nodes WHERE hostname LIKE '%problematic%';
   ```
 6. **`*_metrics.txt`** — Prometheus dumps for latency, NodeStore
   operation timing, database query performance, memory usage.
 ## Heuristic: infrastructure vs code
 **Before blaming Docker, disk, or network: read `hs-*.stderr.log` in
 full.** In practice, well over 99% of failures are code bugs (policy
 evaluation, NodeStore sync, route approval) rather than infrastructure.
 Actual infrastructure failures have signature error messages:
 | Signature                                                       | Cause                     | Fix                                                           |
 | --------------------------------------------------------------- | ------------------------- | ------------------------------------------------------------- |
 | `failed to resolve "hs-...": no DNS fallback candidates remain` | Docker DNS                | Reset Docker networking                                       |
 | `container creation timeout`, no progress >2 min                | Resource exhaustion       | `docker system prune -f` (when no other tests running), retry |
 | OOM kills, slow Docker daemon                                   | Too many concurrent tests | Reduce concurrency, wait for completion                       |
 | `no space left on device`                                       | Disk full                 | Delete old `control_logs/`                                    |
 If you don't see a signature error, **assume it's a code regression** —
 do not retry hoping the flake goes away.
 ## Common failure patterns (code bugs)
 ### Route advertisement timing
 Test asserts route state before the client has finished propagating its
 Hostinfo update. Symptom: `nodes[0].GetAvailableRoutes()` empty when
 the test expects a route.
 - **Wrong fix**: `time.Sleep(5 * time.Second)` — fragile and slow.
 - **Right fix**: wrap the assertion in `EventuallyWithT`. See
  [`../../integration/README.md`](../../integration/README.md).
 ### NodeStore sync issues
 Route changes not reflected in the NodeStore snapshot. Symptom: route
 advertisements in logs but no tracking updates in subsequent reads.
 The sync point is `State.UpdateNodeFromMapRequest()` in
 `hscontrol/state/state.go`. If you added a new kind of client state
 update, make sure it lands here.
 ### HA failover: routes disappearing on disconnect
 `TestHASubnetRouterFailover` fails because approved routes vanish when
 a subnet router goes offline. **This is a bug, not expected behaviour.**
 Route approval must not be coupled to client connectivity — routes
 stay approved; only the primary-route selection is affected by
 connectivity.
 ### Policy evaluation race
 Symptom: tests that change policy and immediately assert peer visibility
 fail intermittently. Policy changes trigger async recomputation.
 - See recent fixes in `git log -- hscontrol/state/` for examples (e.g.
  the `PolicyChange` trigger on every Connect/Disconnect).
 ### SQLite vs PostgreSQL timing differences
 Some race conditions only surface on one backend. If a test is flaky,
 try the other backend with `--postgres`:
 ```bash
 go run ./cmd/hi run "TestName" --postgres --verbose
 ```
 PostgreSQL generally has more consistent timing; SQLite can expose
 races during rapid writes.
 ## Keeping containers for inspection
 If you need to inspect a failed test's state manually:
 ```bash
 go run ./cmd/hi run "TestName" --keep-on-failure
 # containers survive — inspect them
 docker exec -it ts-<runID>-<...> /bin/sh
 docker logs hs-<runID>-<...>
 # clean up manually when done
 go run ./cmd/hi clean all   # only when no other tests are running
 ```
--- a/integration/README.md
+++ b/integration/README.md
@@ -1,25 +1,336 @@
 # Integration testing
-Headscale relies on integration testing to ensure we remain compatible with Tailscale.
+Headscale's integration tests start a real Headscale server and run
 scenarios against real Tailscale clients across supported versions, all
 inside Docker. They are the safety net that keeps us honest about
 Tailscale protocol compatibility.
-This is typically performed by starting a Headscale server and running a test "scenario"
+This file documents **how to write** integration tests. For **how to
-with an array of Tailscale clients and versions.
+run** them, see [`../cmd/hi/README.md`](../cmd/hi/README.md).
-Headscale's test framework and the current set of scenarios are defined in this directory.
+Tests live in files ending with `_test.go`; the framework lives in the
 rest of this directory (`scenario.go`, `tailscale.go`, helpers, and the
 `hsic/`, `tsic/`, `dockertestutil/` packages).
-Tests are located in files ending with `_test.go` and the framework are located in the rest.
+## Running tests
-## Running integration tests locally
+For local runs, use [`cmd/hi`](../cmd/hi):
 The easiest way to run tests locally is to use [act](https://github.com/nektos/act), a local GitHub Actions runner:
 ```bash
 go run ./cmd/hi doctor
 go run ./cmd/hi run "TestPingAllByIP"
 ```
 Alternatively, [`act`](https://github.com/nektos/act) runs the GitHub
 Actions workflow locally:
 ```bash
 act pull_request -W .github/workflows/test-integration.yaml
 ```
-Alternatively, the `docker run` command in each GitHub workflow file can be used.
+Each test runs as a separate workflow on GitHub Actions. To add a new
 test, run `go generate` inside `../cmd/gh-action-integration-generator/`
 and commit the generated workflow file.
-## Running integration tests on GitHub Actions
+## Framework overview
-Each test currently runs as a separate workflows in GitHub actions, to add new test, run
+The integration framework has four layers:
-`go generate` inside `../cmd/gh-action-integration-generator/` and commit the result.
+
 - **`scenario.go`** — `Scenario` orchestrates a test environment: a
  Headscale server, one or more users, and a collection of Tailscale
  clients. `NewScenario(spec)` returns a ready-to-use environment.
 - **`hsic/`** — "Headscale Integration Container": wraps a Headscale
  server in Docker. Options for config, DB backend, DERP, OIDC, etc.
 - **`tsic/`** — "Tailscale Integration Container": wraps a single
  Tailscale client. Options for version, hostname, auth method, etc.
 - **`dockertestutil/`** — low-level Docker helpers (networks, container
  lifecycle, `IsRunningInContainer()` detection).
 Tests compose these pieces via `ScenarioSpec` and `CreateHeadscaleEnv`
 rather than calling Docker directly.
 ## Required scaffolding
 ### `IntegrationSkip(t)`
 **Every** integration test function must call `IntegrationSkip(t)` as
 its first statement. Without it, the test runs in the wrong environment
 and fails with confusing errors.
 ```go
 func TestMyScenario(t *testing.T) {
    IntegrationSkip(t)
    // ... rest of the test
 }
 ```
 `IntegrationSkip` is defined in `integration/scenario_test.go:15` and:
 - skips the test when not running inside the Docker test container
  (`dockertestutil.IsRunningInContainer()`),
 - skips when `-short` is passed to `go test`.
 ### Scenario setup
 The canonical setup creates users, clients, and the Headscale server in
 one shot:
 ```go
 func TestMyScenario(t *testing.T) {
    IntegrationSkip(t)
    t.Parallel()
    spec := ScenarioSpec{
        NodesPerUser: 2,
        Users:        []string{"alice", "bob"},
    }
    scenario, err := NewScenario(spec)
    require.NoError(t, err)
    defer scenario.ShutdownAssertNoPanics(t)
    err = scenario.CreateHeadscaleEnv(
        []tsic.Option{tsic.WithSSH()},
        hsic.WithTestName("myscenario"),
    )
    require.NoError(t, err)
    allClients, err := scenario.ListTailscaleClients()
    require.NoError(t, err)
    headscale, err := scenario.Headscale()
    require.NoError(t, err)
    // ... assertions
 }
 ```
 Review `scenario.go` and `hsic/options.go` / `tsic/options.go` for the
 full option set (DERP, OIDC, policy files, DB backend, ACL grants,
 exit-node config, etc.).
 ## The `EventuallyWithT` pattern
 Integration tests operate on a distributed system with real async
 propagation: clients advertise state, the server processes it, updates
 stream to peers. Direct assertions after state changes fail
 intermittently. Wrap external calls in `assert.EventuallyWithT`:
 ```go
 assert.EventuallyWithT(t, func(c *assert.CollectT) {
    status, err := client.Status()
    assert.NoError(c, err)
    for _, peerKey := range status.Peers() {
        peerStatus := status.Peer[peerKey]
        requirePeerSubnetRoutesWithCollect(c, peerStatus, expectedRoutes)
    }
 }, 10*time.Second, 500*time.Millisecond, "client should see expected routes")
 ```
 ### External calls that need wrapping
 These read distributed state and may reflect stale data until
 propagation completes:
 - `headscale.ListNodes()`
 - `client.Status()`
 - `client.Curl()`
 - `client.Traceroute()`
 - `client.Execute()` when the command reads state
 ### Blocking operations that must NOT be wrapped
 State-mutating commands run exactly once and either succeed or fail
 immediately — not eventually. Wrapping them in `EventuallyWithT` hides
 real failures behind retry.
 Use `client.MustStatus()` when you only need an ID for a blocking call:
 ```go
 // CORRECT — mutation runs once
 for _, client := range allClients {
    status := client.MustStatus()
    _, _, err := client.Execute([]string{
        "tailscale", "set",
        "--advertise-routes=" + expectedRoutes[string(status.Self.ID)],
    })
    require.NoErrorf(t, err, "failed to advertise route: %s", err)
 }
 ```
 Typical blocking operations: any `tailscale set` (routes, exit node,
 accept-routes, ssh), node registration via the CLI, user creation via
 gRPC.
 ### The four rules
 1. **One external call per `EventuallyWithT` block.** Related assertions
   on the result of a single call go together in the same block.
   **Loop exception**: iterating over a collection of clients (or peers)
   and calling `Status()` on each inside a single block is allowed — it
   is the same logical "check all clients" operation. The rule applies
   to distinct calls like `ListNodes()` + `Status()`, which must be
   split into separate blocks.
 2. **Never nest `EventuallyWithT` calls.** A nested retry loop
   multiplies timing windows and makes failures impossible to diagnose.
 3. **Use `*WithCollect` helper variants** inside the block. Regular
   helpers use `require` and abort on the first failed assertion,
   preventing retry.
 4. **Always provide a descriptive final message** — it appears on
   failure and is your only clue about what the test was waiting for.
 ### Variable scoping
 Variables used across multiple `EventuallyWithT` blocks must be declared
 at function scope. Inside the block, assign with `=`, not `:=` — `:=`
 creates a shadow invisible to the outer scope:
 ```go
 var nodes []*v1.Node
 var err error
 assert.EventuallyWithT(t, func(c *assert.CollectT) {
    nodes, err = headscale.ListNodes()   // = not :=
    assert.NoError(c, err)
    assert.Len(c, nodes, 2)
    requireNodeRouteCountWithCollect(c, nodes[0], 2, 2, 2)
 }, 10*time.Second, 500*time.Millisecond, "nodes should have expected routes")
 // nodes is usable here because it was declared at function scope
 ```
 ### Helper functions
 Inside `EventuallyWithT` blocks, use the `*WithCollect` variants so
 assertion failures restart the wait loop instead of failing the test
 immediately:
 - `requirePeerSubnetRoutesWithCollect(c, status, expected)` —
  `integration/route_test.go:2941`
 - `requireNodeRouteCountWithCollect(c, node, announced, approved, subnet)` —
  `integration/route_test.go:2958`
 - `assertTracerouteViaIPWithCollect(c, traceroute, ip)` —
  `integration/route_test.go:2898`
 When you write a new helper to be called inside `EventuallyWithT`, it
 must accept `*assert.CollectT` as its first parameter, not `*testing.T`.
 ## Identifying nodes by property, not position
 The order of `headscale.ListNodes()` is not stable. Tests that index
 `nodes[0]` will break when node ordering changes. Look nodes up by ID,
 hostname, or tag:
 ```go
 // WRONG — relies on array position
 require.Len(t, nodes[0].GetAvailableRoutes(), 1)
 // CORRECT — find the node that should have the route
 expectedRoutes := map[string]string{"1": "10.33.0.0/16"}
 for _, node := range nodes {
    nodeIDStr := fmt.Sprintf("%d", node.GetId())
    if route, shouldHaveRoute := expectedRoutes[nodeIDStr]; shouldHaveRoute {
        assert.Contains(t, node.GetAvailableRoutes(), route)
    }
 }
 ```
 ## Full example: advertising and approving a route
 ```go
 func TestRouteAdvertisementBasic(t *testing.T) {
    IntegrationSkip(t)
    t.Parallel()
    spec := ScenarioSpec{
        NodesPerUser: 2,
        Users:        []string{"user1"},
    }
    scenario, err := NewScenario(spec)
    require.NoError(t, err)
    defer scenario.ShutdownAssertNoPanics(t)
    err = scenario.CreateHeadscaleEnv([]tsic.Option{}, hsic.WithTestName("route"))
    require.NoError(t, err)
    allClients, err := scenario.ListTailscaleClients()
    require.NoError(t, err)
    headscale, err := scenario.Headscale()
    require.NoError(t, err)
    // --- Blocking: advertise the route on one client ---
    router := allClients[0]
    _, _, err = router.Execute([]string{
        "tailscale", "set",
        "--advertise-routes=10.33.0.0/16",
    })
    require.NoErrorf(t, err, "advertising route: %s", err)
    // --- Eventually: headscale should see the announced route ---
    var nodes []*v1.Node
    assert.EventuallyWithT(t, func(c *assert.CollectT) {
        nodes, err = headscale.ListNodes()
        assert.NoError(c, err)
        assert.Len(c, nodes, 2)
        for _, node := range nodes {
            if node.GetName() == router.Hostname() {
                requireNodeRouteCountWithCollect(c, node, 1, 0, 0)
            }
        }
    }, 10*time.Second, 500*time.Millisecond, "route should be announced")
    // --- Blocking: approve the route via headscale CLI ---
    var routerNode *v1.Node
    for _, node := range nodes {
        if node.GetName() == router.Hostname() {
            routerNode = node
            break
        }
    }
    require.NotNil(t, routerNode)
    _, err = headscale.ApproveRoutes(routerNode.GetId(), []string{"10.33.0.0/16"})
    require.NoError(t, err)
    // --- Eventually: a peer should see the approved route ---
    peer := allClients[1]
    assert.EventuallyWithT(t, func(c *assert.CollectT) {
        status, err := peer.Status()
        assert.NoError(c, err)
        for _, peerKey := range status.Peers() {
            if peerKey == router.PublicKey() {
                requirePeerSubnetRoutesWithCollect(c,
                    status.Peer[peerKey],
                    []netip.Prefix{netip.MustParsePrefix("10.33.0.0/16")})
            }
        }
    }, 10*time.Second, 500*time.Millisecond, "peer should see approved route")
 }
 ```
 ## Common pitfalls
 - **Forgetting `IntegrationSkip(t)`**: the test runs outside Docker and
  fails in confusing ways. Always the first line.
 - **Using `require` inside `EventuallyWithT`**: aborts after the first
  iteration instead of retrying. Use `assert.*` + the `*WithCollect`
  helpers.
 - **Mixing mutation and query in one `EventuallyWithT`**: hides real
  failures. Keep mutation outside, query inside.
 - **Assuming node ordering**: look up by property.
 - **Ignoring `err` from `client.Status()`**: retry only retries the
  whole block; don't silently drop errors from mid-block calls.
 - **Timeouts too tight**: 5s is reasonable for local state, 10s for
  state that must propagate through the map poll cycle. Don't go lower
  to "speed up the test" — you just make it flaky.
 ## Debugging failing tests
 Tests save comprehensive artefacts to `control_logs/{runID}/`. Read them
 in this order: server stderr, client stderr, MapResponse JSON, database
 snapshot. The full debugging workflow, heuristics, and failure patterns
 are documented in [`../cmd/hi/README.md`](../cmd/hi/README.md).