docs: expand cmd/hi and integration READMEs

Move integration-test runbook and authoring guide into the component READMEs so the content sits next to the code it describes.
2026-04-11 03:27:20 +02:00 · 2026-04-09 15:39:15 +00:00
parent 742878d172
commit 70b622fc68
2 changed files with 584 additions and 17 deletions
--- a/integration/README.md
+++ b/integration/README.md
@@ -1,25 +1,336 @@
 # Integration testing

-Headscale relies on integration testing to ensure we remain compatible with Tailscale.
+Headscale's integration tests start a real Headscale server and run
+scenarios against real Tailscale clients across supported versions, all
+inside Docker. They are the safety net that keeps us honest about
+Tailscale protocol compatibility.

-This is typically performed by starting a Headscale server and running a test "scenario"
-with an array of Tailscale clients and versions.
+This file documents **how to write** integration tests. For **how to
+run** them, see [`../cmd/hi/README.md`](../cmd/hi/README.md).

-Headscale's test framework and the current set of scenarios are defined in this directory.
+Tests live in files ending with `_test.go`; the framework lives in the
+rest of this directory (`scenario.go`, `tailscale.go`, helpers, and the
+`hsic/`, `tsic/`, `dockertestutil/` packages).

-Tests are located in files ending with `_test.go` and the framework are located in the rest.
+## Running tests

-## Running integration tests locally
-
-The easiest way to run tests locally is to use [act](https://github.com/nektos/act), a local GitHub Actions runner:
+For local runs, use [`cmd/hi`](../cmd/hi):

+```bash
+go run ./cmd/hi doctor
+go run ./cmd/hi run "TestPingAllByIP"
 ```
+
+Alternatively, [`act`](https://github.com/nektos/act) runs the GitHub
+Actions workflow locally:
+
+```bash
 act pull_request -W .github/workflows/test-integration.yaml
 ```

-Alternatively, the `docker run` command in each GitHub workflow file can be used.
+Each test runs as a separate workflow on GitHub Actions. To add a new
+test, run `go generate` inside `../cmd/gh-action-integration-generator/`
+and commit the generated workflow file.

-## Running integration tests on GitHub Actions
+## Framework overview

-Each test currently runs as a separate workflows in GitHub actions, to add new test, run
-`go generate` inside `../cmd/gh-action-integration-generator/` and commit the result.
+The integration framework has four layers:
+
+- **`scenario.go`** — `Scenario` orchestrates a test environment: a
+  Headscale server, one or more users, and a collection of Tailscale
+  clients. `NewScenario(spec)` returns a ready-to-use environment.
+- **`hsic/`** — "Headscale Integration Container": wraps a Headscale
+  server in Docker. Options for config, DB backend, DERP, OIDC, etc.
+- **`tsic/`** — "Tailscale Integration Container": wraps a single
+  Tailscale client. Options for version, hostname, auth method, etc.
+- **`dockertestutil/`** — low-level Docker helpers (networks, container
+  lifecycle, `IsRunningInContainer()` detection).
+
+Tests compose these pieces via `ScenarioSpec` and `CreateHeadscaleEnv`
+rather than calling Docker directly.
+
+## Required scaffolding
+
+### `IntegrationSkip(t)`
+
+**Every** integration test function must call `IntegrationSkip(t)` as
+its first statement. Without it, the test runs in the wrong environment
+and fails with confusing errors.
+
+```go
+func TestMyScenario(t *testing.T) {
+    IntegrationSkip(t)
+    // ... rest of the test
+}
+```
+
+`IntegrationSkip` is defined in `integration/scenario_test.go:15` and:
+
+- skips the test when not running inside the Docker test container
+  (`dockertestutil.IsRunningInContainer()`),
+- skips when `-short` is passed to `go test`.
+
+### Scenario setup
+
+The canonical setup creates users, clients, and the Headscale server in
+one shot:
+
+```go
+func TestMyScenario(t *testing.T) {
+    IntegrationSkip(t)
+    t.Parallel()
+
+    spec := ScenarioSpec{
+        NodesPerUser: 2,
+        Users:        []string{"alice", "bob"},
+    }
+    scenario, err := NewScenario(spec)
+    require.NoError(t, err)
+    defer scenario.ShutdownAssertNoPanics(t)
+
+    err = scenario.CreateHeadscaleEnv(
+        []tsic.Option{tsic.WithSSH()},
+        hsic.WithTestName("myscenario"),
+    )
+    require.NoError(t, err)
+
+    allClients, err := scenario.ListTailscaleClients()
+    require.NoError(t, err)
+
+    headscale, err := scenario.Headscale()
+    require.NoError(t, err)
+
+    // ... assertions
+}
+```
+
+Review `scenario.go` and `hsic/options.go` / `tsic/options.go` for the
+full option set (DERP, OIDC, policy files, DB backend, ACL grants,
+exit-node config, etc.).
+
+## The `EventuallyWithT` pattern
+
+Integration tests operate on a distributed system with real async
+propagation: clients advertise state, the server processes it, updates
+stream to peers. Direct assertions after state changes fail
+intermittently. Wrap external calls in `assert.EventuallyWithT`:
+
+```go
+assert.EventuallyWithT(t, func(c *assert.CollectT) {
+    status, err := client.Status()
+    assert.NoError(c, err)
+    for _, peerKey := range status.Peers() {
+        peerStatus := status.Peer[peerKey]
+        requirePeerSubnetRoutesWithCollect(c, peerStatus, expectedRoutes)
+    }
+}, 10*time.Second, 500*time.Millisecond, "client should see expected routes")
+```
+
+### External calls that need wrapping
+
+These read distributed state and may reflect stale data until
+propagation completes:
+
+- `headscale.ListNodes()`
+- `client.Status()`
+- `client.Curl()`
+- `client.Traceroute()`
+- `client.Execute()` when the command reads state
+
+### Blocking operations that must NOT be wrapped
+
+State-mutating commands run exactly once and either succeed or fail
+immediately — not eventually. Wrapping them in `EventuallyWithT` hides
+real failures behind retry.
+
+Use `client.MustStatus()` when you only need an ID for a blocking call:
+
+```go
+// CORRECT — mutation runs once
+for _, client := range allClients {
+    status := client.MustStatus()
+    _, _, err := client.Execute([]string{
+        "tailscale", "set",
+        "--advertise-routes=" + expectedRoutes[string(status.Self.ID)],
+    })
+    require.NoErrorf(t, err, "failed to advertise route: %s", err)
+}
+```
+
+Typical blocking operations: any `tailscale set` (routes, exit node,
+accept-routes, ssh), node registration via the CLI, user creation via
+gRPC.
+
+### The four rules
+
+1. **One external call per `EventuallyWithT` block.** Related assertions
+   on the result of a single call go together in the same block.
+
+   **Loop exception**: iterating over a collection of clients (or peers)
+   and calling `Status()` on each inside a single block is allowed — it
+   is the same logical "check all clients" operation. The rule applies
+   to distinct calls like `ListNodes()` + `Status()`, which must be
+   split into separate blocks.
+
+2. **Never nest `EventuallyWithT` calls.** A nested retry loop
+   multiplies timing windows and makes failures impossible to diagnose.
+
+3. **Use `*WithCollect` helper variants** inside the block. Regular
+   helpers use `require` and abort on the first failed assertion,
+   preventing retry.
+
+4. **Always provide a descriptive final message** — it appears on
+   failure and is your only clue about what the test was waiting for.
+
+### Variable scoping
+
+Variables used across multiple `EventuallyWithT` blocks must be declared
+at function scope. Inside the block, assign with `=`, not `:=` — `:=`
+creates a shadow invisible to the outer scope:
+
+```go
+var nodes []*v1.Node
+var err error
+assert.EventuallyWithT(t, func(c *assert.CollectT) {
+    nodes, err = headscale.ListNodes()   // = not :=
+    assert.NoError(c, err)
+    assert.Len(c, nodes, 2)
+    requireNodeRouteCountWithCollect(c, nodes[0], 2, 2, 2)
+}, 10*time.Second, 500*time.Millisecond, "nodes should have expected routes")
+
+// nodes is usable here because it was declared at function scope
+```
+
+### Helper functions
+
+Inside `EventuallyWithT` blocks, use the `*WithCollect` variants so
+assertion failures restart the wait loop instead of failing the test
+immediately:
+
+- `requirePeerSubnetRoutesWithCollect(c, status, expected)` —
+  `integration/route_test.go:2941`
+- `requireNodeRouteCountWithCollect(c, node, announced, approved, subnet)` —
+  `integration/route_test.go:2958`
+- `assertTracerouteViaIPWithCollect(c, traceroute, ip)` —
+  `integration/route_test.go:2898`
+
+When you write a new helper to be called inside `EventuallyWithT`, it
+must accept `*assert.CollectT` as its first parameter, not `*testing.T`.
+
+## Identifying nodes by property, not position
+
+The order of `headscale.ListNodes()` is not stable. Tests that index
+`nodes[0]` will break when node ordering changes. Look nodes up by ID,
+hostname, or tag:
+
+```go
+// WRONG — relies on array position
+require.Len(t, nodes[0].GetAvailableRoutes(), 1)
+
+// CORRECT — find the node that should have the route
+expectedRoutes := map[string]string{"1": "10.33.0.0/16"}
+for _, node := range nodes {
+    nodeIDStr := fmt.Sprintf("%d", node.GetId())
+    if route, shouldHaveRoute := expectedRoutes[nodeIDStr]; shouldHaveRoute {
+        assert.Contains(t, node.GetAvailableRoutes(), route)
+    }
+}
+```
+
+## Full example: advertising and approving a route
+
+```go
+func TestRouteAdvertisementBasic(t *testing.T) {
+    IntegrationSkip(t)
+    t.Parallel()
+
+    spec := ScenarioSpec{
+        NodesPerUser: 2,
+        Users:        []string{"user1"},
+    }
+    scenario, err := NewScenario(spec)
+    require.NoError(t, err)
+    defer scenario.ShutdownAssertNoPanics(t)
+
+    err = scenario.CreateHeadscaleEnv([]tsic.Option{}, hsic.WithTestName("route"))
+    require.NoError(t, err)
+
+    allClients, err := scenario.ListTailscaleClients()
+    require.NoError(t, err)
+
+    headscale, err := scenario.Headscale()
+    require.NoError(t, err)
+
+    // --- Blocking: advertise the route on one client ---
+    router := allClients[0]
+    _, _, err = router.Execute([]string{
+        "tailscale", "set",
+        "--advertise-routes=10.33.0.0/16",
+    })
+    require.NoErrorf(t, err, "advertising route: %s", err)
+
+    // --- Eventually: headscale should see the announced route ---
+    var nodes []*v1.Node
+    assert.EventuallyWithT(t, func(c *assert.CollectT) {
+        nodes, err = headscale.ListNodes()
+        assert.NoError(c, err)
+        assert.Len(c, nodes, 2)
+
+        for _, node := range nodes {
+            if node.GetName() == router.Hostname() {
+                requireNodeRouteCountWithCollect(c, node, 1, 0, 0)
+            }
+        }
+    }, 10*time.Second, 500*time.Millisecond, "route should be announced")
+
+    // --- Blocking: approve the route via headscale CLI ---
+    var routerNode *v1.Node
+    for _, node := range nodes {
+        if node.GetName() == router.Hostname() {
+            routerNode = node
+            break
+        }
+    }
+    require.NotNil(t, routerNode)
+
+    _, err = headscale.ApproveRoutes(routerNode.GetId(), []string{"10.33.0.0/16"})
+    require.NoError(t, err)
+
+    // --- Eventually: a peer should see the approved route ---
+    peer := allClients[1]
+    assert.EventuallyWithT(t, func(c *assert.CollectT) {
+        status, err := peer.Status()
+        assert.NoError(c, err)
+        for _, peerKey := range status.Peers() {
+            if peerKey == router.PublicKey() {
+                requirePeerSubnetRoutesWithCollect(c,
+                    status.Peer[peerKey],
+                    []netip.Prefix{netip.MustParsePrefix("10.33.0.0/16")})
+            }
+        }
+    }, 10*time.Second, 500*time.Millisecond, "peer should see approved route")
+}
+```
+
+## Common pitfalls
+
+- **Forgetting `IntegrationSkip(t)`**: the test runs outside Docker and
+  fails in confusing ways. Always the first line.
+- **Using `require` inside `EventuallyWithT`**: aborts after the first
+  iteration instead of retrying. Use `assert.*` + the `*WithCollect`
+  helpers.
+- **Mixing mutation and query in one `EventuallyWithT`**: hides real
+  failures. Keep mutation outside, query inside.
+- **Assuming node ordering**: look up by property.
+- **Ignoring `err` from `client.Status()`**: retry only retries the
+  whole block; don't silently drop errors from mid-block calls.
+- **Timeouts too tight**: 5s is reasonable for local state, 10s for
+  state that must propagate through the map poll cycle. Don't go lower
+  to "speed up the test" — you just make it flaky.
+
+## Debugging failing tests
+
+Tests save comprehensive artefacts to `control_logs/{runID}/`. Read them
+in this order: server stderr, client stderr, MapResponse JSON, database
+snapshot. The full debugging workflow, heuristics, and failure patterns
+are documented in [`../cmd/hi/README.md`](../cmd/hi/README.md).