diff --git a/cmd/hi/README.md b/cmd/hi/README.md index 17324219..7678111d 100644 --- a/cmd/hi/README.md +++ b/cmd/hi/README.md @@ -1,6 +1,262 @@ -# hi +# hi — Headscale Integration test runner -hi (headscale integration runner) is an entirely "vibe coded" wrapper around our -[integration test suite](../integration). It essentially runs the docker -commands for you with some added benefits of extracting resources like logs and -databases. +`hi` wraps Docker container orchestration around the tests in +[`../../integration`](../../integration) and extracts debugging artefacts +(logs, database snapshots, MapResponse protocol captures) for post-mortem +analysis. + +**Read this file in full before running any `hi` command.** The test +runner has sharp edges — wrong flags produce stale containers, lost +artefacts, or hung CI. + +For test-authoring patterns (scenario setup, `EventuallyWithT`, +`IntegrationSkip`, helper variants), read +[`../../integration/README.md`](../../integration/README.md). + +## Quick Start + +```bash +# Verify system requirements (Docker, Go, disk space, images) +go run ./cmd/hi doctor + +# Run a single test (the default flags are tuned for development) +go run ./cmd/hi run "TestPingAllByIP" + +# Run a database-heavy test against PostgreSQL +go run ./cmd/hi run "TestExpireNode" --postgres + +# Pattern matching +go run ./cmd/hi run "TestSubnet*" +``` + +Run `doctor` before the first `run` in any new environment. Tests +generate ~100 MB of logs per run in `control_logs/`; `doctor` verifies +there is enough space and that the required Docker images are available. + +## Commands + +| Command | Purpose | +| ------------------ | ---------------------------------------------------- | +| `run [pattern]` | Execute the test(s) matching `pattern` | +| `doctor` | Verify system requirements | +| `clean networks` | Prune unused Docker networks | +| `clean images` | Clean old test images | +| `clean containers` | Kill **all** test containers (dangerous — see below) | +| `clean cache` | Clean Go module cache volume | +| `clean all` | Run all cleanup operations | + +## Flags + +Defaults are tuned for single-test development runs. Review before +changing. + +| Flag | Default | Purpose | +| ------------------- | -------------- | --------------------------------------------------------------------------- | +| `--timeout` | `120m` | Total test timeout. Use the built-in flag — never wrap with bash `timeout`. | +| `--postgres` | `false` | Use PostgreSQL instead of SQLite | +| `--failfast` | `true` | Stop on first test failure | +| `--go-version` | auto | Detected from `go.mod` (currently 1.26.1) | +| `--clean-before` | `true` | Clean stale (stopped/exited) containers before starting | +| `--clean-after` | `true` | Clean this run's containers after completion | +| `--keep-on-failure` | `false` | Preserve containers for manual inspection on failure | +| `--logs-dir` | `control_logs` | Where to save run artefacts | +| `--verbose` | `false` | Verbose output | +| `--stats` | `false` | Collect container resource-usage stats | +| `--hs-memory-limit` | `0` | Fail if any headscale container exceeds N MB (0 = disabled) | +| `--ts-memory-limit` | `0` | Fail if any tailscale container exceeds N MB | + +### Timeout guidance + +The default `120m` is generous for a single test. If you must tune it, +these are realistic floors by category: + +| Test type | Minimum | Examples | +| ------------------------- | ----------- | ------------------------------------- | +| Basic functionality / CLI | 900s (15m) | `TestPingAllByIP`, `TestCLI*` | +| Route / ACL | 1200s (20m) | `TestSubnet*`, `TestACL*` | +| HA / failover | 1800s (30m) | `TestHASubnetRouter*` | +| Long-running | 2100s (35m) | `TestNodeOnlineStatus` (~12 min body) | +| Full suite | 45m | `go test ./integration -timeout 45m` | + +**Never** use the shell `timeout` command around `hi`. It kills the +process mid-cleanup and leaves stale containers: + +```bash +timeout 300 go run ./cmd/hi run "TestName" # WRONG — orphaned containers +go run ./cmd/hi run "TestName" --timeout=900s # correct +``` + +## Concurrent Execution + +Multiple `hi run` invocations can run simultaneously on the same Docker +daemon. Each invocation gets a unique **Run ID** (format +`YYYYMMDD-HHMMSS-6charhash`, e.g. `20260409-104215-mdjtzx`). + +- **Container names** include the short run ID: `ts-mdjtzx-1-74-fgdyls` +- **Docker labels**: `hi.run-id={runID}` on every container +- **Port allocation**: dynamic — kernel assigns free ports, no conflicts +- **Cleanup isolation**: each run cleans only its own containers +- **Log directories**: `control_logs/{runID}/` + +```bash +# Start three tests in parallel — each gets its own run ID +go run ./cmd/hi run "TestPingAllByIP" & +go run ./cmd/hi run "TestACLAllowUserDst" & +go run ./cmd/hi run "TestOIDCAuthenticationPingAll" & +``` + +### Safety rules for concurrent runs + +- ✅ Your run cleans only containers labelled with its own `hi.run-id` +- ✅ `--clean-before` removes only stopped/exited containers +- ❌ **Never** run `docker rm -f $(docker ps -q --filter name=hs-)` — + this destroys other agents' live test sessions +- ❌ **Never** run `docker system prune -f` while any tests are running +- ❌ **Never** run `hi clean containers` / `hi clean all` while other + tests are running — both kill all test containers on the daemon + +To identify your own containers: + +```bash +docker ps --filter "label=hi.run-id=20260409-104215-mdjtzx" +``` + +The run ID appears at the top of the `hi run` output — copy it from +there rather than trying to reconstruct it. + +## Artefacts + +Every run saves debugging artefacts under `control_logs/{runID}/`: + +``` +control_logs/20260409-104215-mdjtzx/ +├── hs--.stderr.log # headscale server errors +├── hs--.stdout.log # headscale server output +├── hs--.db # database snapshot (SQLite) +├── hs--_metrics.txt # Prometheus metrics dump +├── hs---mapresponses/ # MapResponse protocol captures +├── ts--.stderr.log # tailscale client errors +├── ts--.stdout.log # tailscale client output +└── ts--_status.json # client network-status dump +``` + +Artefacts persist after cleanup. Old runs accumulate fast — delete +unwanted directories to reclaim disk. + +## Debugging workflow + +When a test fails, read the artefacts **in this order**: + +1. **`hs-*.stderr.log`** — headscale server errors, panics, policy + evaluation failures. Most issues originate server-side. + + ```bash + grep -E "ERROR|panic|FATAL" control_logs/*/hs-*.stderr.log + ``` + +2. **`ts-*.stderr.log`** — authentication failures, connectivity issues, + DNS resolution problems on the client side. + +3. **MapResponse JSON** in `hs-*-mapresponses/` — protocol-level + debugging for network map generation, peer visibility, route + distribution, policy evaluation results. + + ```bash + ls control_logs/*/hs-*-mapresponses/ + jq '.Peers[] | {Name, Tags, PrimaryRoutes}' \ + control_logs/*/hs-*-mapresponses/001.json + ``` + +4. **`*_status.json`** — client peer-connectivity state. + +5. **`hs-*.db`** — SQLite snapshot for post-mortem consistency checks. + + ```bash + sqlite3 control_logs//hs-*.db + sqlite> .tables + sqlite> .schema nodes + sqlite> SELECT id, hostname, user_id, tags FROM nodes WHERE hostname LIKE '%problematic%'; + ``` + +6. **`*_metrics.txt`** — Prometheus dumps for latency, NodeStore + operation timing, database query performance, memory usage. + +## Heuristic: infrastructure vs code + +**Before blaming Docker, disk, or network: read `hs-*.stderr.log` in +full.** In practice, well over 99% of failures are code bugs (policy +evaluation, NodeStore sync, route approval) rather than infrastructure. + +Actual infrastructure failures have signature error messages: + +| Signature | Cause | Fix | +| --------------------------------------------------------------- | ------------------------- | ------------------------------------------------------------- | +| `failed to resolve "hs-...": no DNS fallback candidates remain` | Docker DNS | Reset Docker networking | +| `container creation timeout`, no progress >2 min | Resource exhaustion | `docker system prune -f` (when no other tests running), retry | +| OOM kills, slow Docker daemon | Too many concurrent tests | Reduce concurrency, wait for completion | +| `no space left on device` | Disk full | Delete old `control_logs/` | + +If you don't see a signature error, **assume it's a code regression** — +do not retry hoping the flake goes away. + +## Common failure patterns (code bugs) + +### Route advertisement timing + +Test asserts route state before the client has finished propagating its +Hostinfo update. Symptom: `nodes[0].GetAvailableRoutes()` empty when +the test expects a route. + +- **Wrong fix**: `time.Sleep(5 * time.Second)` — fragile and slow. +- **Right fix**: wrap the assertion in `EventuallyWithT`. See + [`../../integration/README.md`](../../integration/README.md). + +### NodeStore sync issues + +Route changes not reflected in the NodeStore snapshot. Symptom: route +advertisements in logs but no tracking updates in subsequent reads. + +The sync point is `State.UpdateNodeFromMapRequest()` in +`hscontrol/state/state.go`. If you added a new kind of client state +update, make sure it lands here. + +### HA failover: routes disappearing on disconnect + +`TestHASubnetRouterFailover` fails because approved routes vanish when +a subnet router goes offline. **This is a bug, not expected behaviour.** +Route approval must not be coupled to client connectivity — routes +stay approved; only the primary-route selection is affected by +connectivity. + +### Policy evaluation race + +Symptom: tests that change policy and immediately assert peer visibility +fail intermittently. Policy changes trigger async recomputation. + +- See recent fixes in `git log -- hscontrol/state/` for examples (e.g. + the `PolicyChange` trigger on every Connect/Disconnect). + +### SQLite vs PostgreSQL timing differences + +Some race conditions only surface on one backend. If a test is flaky, +try the other backend with `--postgres`: + +```bash +go run ./cmd/hi run "TestName" --postgres --verbose +``` + +PostgreSQL generally has more consistent timing; SQLite can expose +races during rapid writes. + +## Keeping containers for inspection + +If you need to inspect a failed test's state manually: + +```bash +go run ./cmd/hi run "TestName" --keep-on-failure +# containers survive — inspect them +docker exec -it ts--<...> /bin/sh +docker logs hs--<...> +# clean up manually when done +go run ./cmd/hi clean all # only when no other tests are running +``` diff --git a/integration/README.md b/integration/README.md index 56247c52..5511a113 100644 --- a/integration/README.md +++ b/integration/README.md @@ -1,25 +1,336 @@ # Integration testing -Headscale relies on integration testing to ensure we remain compatible with Tailscale. +Headscale's integration tests start a real Headscale server and run +scenarios against real Tailscale clients across supported versions, all +inside Docker. They are the safety net that keeps us honest about +Tailscale protocol compatibility. -This is typically performed by starting a Headscale server and running a test "scenario" -with an array of Tailscale clients and versions. +This file documents **how to write** integration tests. For **how to +run** them, see [`../cmd/hi/README.md`](../cmd/hi/README.md). -Headscale's test framework and the current set of scenarios are defined in this directory. +Tests live in files ending with `_test.go`; the framework lives in the +rest of this directory (`scenario.go`, `tailscale.go`, helpers, and the +`hsic/`, `tsic/`, `dockertestutil/` packages). -Tests are located in files ending with `_test.go` and the framework are located in the rest. +## Running tests -## Running integration tests locally - -The easiest way to run tests locally is to use [act](https://github.com/nektos/act), a local GitHub Actions runner: +For local runs, use [`cmd/hi`](../cmd/hi): +```bash +go run ./cmd/hi doctor +go run ./cmd/hi run "TestPingAllByIP" ``` + +Alternatively, [`act`](https://github.com/nektos/act) runs the GitHub +Actions workflow locally: + +```bash act pull_request -W .github/workflows/test-integration.yaml ``` -Alternatively, the `docker run` command in each GitHub workflow file can be used. +Each test runs as a separate workflow on GitHub Actions. To add a new +test, run `go generate` inside `../cmd/gh-action-integration-generator/` +and commit the generated workflow file. -## Running integration tests on GitHub Actions +## Framework overview -Each test currently runs as a separate workflows in GitHub actions, to add new test, run -`go generate` inside `../cmd/gh-action-integration-generator/` and commit the result. +The integration framework has four layers: + +- **`scenario.go`** — `Scenario` orchestrates a test environment: a + Headscale server, one or more users, and a collection of Tailscale + clients. `NewScenario(spec)` returns a ready-to-use environment. +- **`hsic/`** — "Headscale Integration Container": wraps a Headscale + server in Docker. Options for config, DB backend, DERP, OIDC, etc. +- **`tsic/`** — "Tailscale Integration Container": wraps a single + Tailscale client. Options for version, hostname, auth method, etc. +- **`dockertestutil/`** — low-level Docker helpers (networks, container + lifecycle, `IsRunningInContainer()` detection). + +Tests compose these pieces via `ScenarioSpec` and `CreateHeadscaleEnv` +rather than calling Docker directly. + +## Required scaffolding + +### `IntegrationSkip(t)` + +**Every** integration test function must call `IntegrationSkip(t)` as +its first statement. Without it, the test runs in the wrong environment +and fails with confusing errors. + +```go +func TestMyScenario(t *testing.T) { + IntegrationSkip(t) + // ... rest of the test +} +``` + +`IntegrationSkip` is defined in `integration/scenario_test.go:15` and: + +- skips the test when not running inside the Docker test container + (`dockertestutil.IsRunningInContainer()`), +- skips when `-short` is passed to `go test`. + +### Scenario setup + +The canonical setup creates users, clients, and the Headscale server in +one shot: + +```go +func TestMyScenario(t *testing.T) { + IntegrationSkip(t) + t.Parallel() + + spec := ScenarioSpec{ + NodesPerUser: 2, + Users: []string{"alice", "bob"}, + } + scenario, err := NewScenario(spec) + require.NoError(t, err) + defer scenario.ShutdownAssertNoPanics(t) + + err = scenario.CreateHeadscaleEnv( + []tsic.Option{tsic.WithSSH()}, + hsic.WithTestName("myscenario"), + ) + require.NoError(t, err) + + allClients, err := scenario.ListTailscaleClients() + require.NoError(t, err) + + headscale, err := scenario.Headscale() + require.NoError(t, err) + + // ... assertions +} +``` + +Review `scenario.go` and `hsic/options.go` / `tsic/options.go` for the +full option set (DERP, OIDC, policy files, DB backend, ACL grants, +exit-node config, etc.). + +## The `EventuallyWithT` pattern + +Integration tests operate on a distributed system with real async +propagation: clients advertise state, the server processes it, updates +stream to peers. Direct assertions after state changes fail +intermittently. Wrap external calls in `assert.EventuallyWithT`: + +```go +assert.EventuallyWithT(t, func(c *assert.CollectT) { + status, err := client.Status() + assert.NoError(c, err) + for _, peerKey := range status.Peers() { + peerStatus := status.Peer[peerKey] + requirePeerSubnetRoutesWithCollect(c, peerStatus, expectedRoutes) + } +}, 10*time.Second, 500*time.Millisecond, "client should see expected routes") +``` + +### External calls that need wrapping + +These read distributed state and may reflect stale data until +propagation completes: + +- `headscale.ListNodes()` +- `client.Status()` +- `client.Curl()` +- `client.Traceroute()` +- `client.Execute()` when the command reads state + +### Blocking operations that must NOT be wrapped + +State-mutating commands run exactly once and either succeed or fail +immediately — not eventually. Wrapping them in `EventuallyWithT` hides +real failures behind retry. + +Use `client.MustStatus()` when you only need an ID for a blocking call: + +```go +// CORRECT — mutation runs once +for _, client := range allClients { + status := client.MustStatus() + _, _, err := client.Execute([]string{ + "tailscale", "set", + "--advertise-routes=" + expectedRoutes[string(status.Self.ID)], + }) + require.NoErrorf(t, err, "failed to advertise route: %s", err) +} +``` + +Typical blocking operations: any `tailscale set` (routes, exit node, +accept-routes, ssh), node registration via the CLI, user creation via +gRPC. + +### The four rules + +1. **One external call per `EventuallyWithT` block.** Related assertions + on the result of a single call go together in the same block. + + **Loop exception**: iterating over a collection of clients (or peers) + and calling `Status()` on each inside a single block is allowed — it + is the same logical "check all clients" operation. The rule applies + to distinct calls like `ListNodes()` + `Status()`, which must be + split into separate blocks. + +2. **Never nest `EventuallyWithT` calls.** A nested retry loop + multiplies timing windows and makes failures impossible to diagnose. + +3. **Use `*WithCollect` helper variants** inside the block. Regular + helpers use `require` and abort on the first failed assertion, + preventing retry. + +4. **Always provide a descriptive final message** — it appears on + failure and is your only clue about what the test was waiting for. + +### Variable scoping + +Variables used across multiple `EventuallyWithT` blocks must be declared +at function scope. Inside the block, assign with `=`, not `:=` — `:=` +creates a shadow invisible to the outer scope: + +```go +var nodes []*v1.Node +var err error +assert.EventuallyWithT(t, func(c *assert.CollectT) { + nodes, err = headscale.ListNodes() // = not := + assert.NoError(c, err) + assert.Len(c, nodes, 2) + requireNodeRouteCountWithCollect(c, nodes[0], 2, 2, 2) +}, 10*time.Second, 500*time.Millisecond, "nodes should have expected routes") + +// nodes is usable here because it was declared at function scope +``` + +### Helper functions + +Inside `EventuallyWithT` blocks, use the `*WithCollect` variants so +assertion failures restart the wait loop instead of failing the test +immediately: + +- `requirePeerSubnetRoutesWithCollect(c, status, expected)` — + `integration/route_test.go:2941` +- `requireNodeRouteCountWithCollect(c, node, announced, approved, subnet)` — + `integration/route_test.go:2958` +- `assertTracerouteViaIPWithCollect(c, traceroute, ip)` — + `integration/route_test.go:2898` + +When you write a new helper to be called inside `EventuallyWithT`, it +must accept `*assert.CollectT` as its first parameter, not `*testing.T`. + +## Identifying nodes by property, not position + +The order of `headscale.ListNodes()` is not stable. Tests that index +`nodes[0]` will break when node ordering changes. Look nodes up by ID, +hostname, or tag: + +```go +// WRONG — relies on array position +require.Len(t, nodes[0].GetAvailableRoutes(), 1) + +// CORRECT — find the node that should have the route +expectedRoutes := map[string]string{"1": "10.33.0.0/16"} +for _, node := range nodes { + nodeIDStr := fmt.Sprintf("%d", node.GetId()) + if route, shouldHaveRoute := expectedRoutes[nodeIDStr]; shouldHaveRoute { + assert.Contains(t, node.GetAvailableRoutes(), route) + } +} +``` + +## Full example: advertising and approving a route + +```go +func TestRouteAdvertisementBasic(t *testing.T) { + IntegrationSkip(t) + t.Parallel() + + spec := ScenarioSpec{ + NodesPerUser: 2, + Users: []string{"user1"}, + } + scenario, err := NewScenario(spec) + require.NoError(t, err) + defer scenario.ShutdownAssertNoPanics(t) + + err = scenario.CreateHeadscaleEnv([]tsic.Option{}, hsic.WithTestName("route")) + require.NoError(t, err) + + allClients, err := scenario.ListTailscaleClients() + require.NoError(t, err) + + headscale, err := scenario.Headscale() + require.NoError(t, err) + + // --- Blocking: advertise the route on one client --- + router := allClients[0] + _, _, err = router.Execute([]string{ + "tailscale", "set", + "--advertise-routes=10.33.0.0/16", + }) + require.NoErrorf(t, err, "advertising route: %s", err) + + // --- Eventually: headscale should see the announced route --- + var nodes []*v1.Node + assert.EventuallyWithT(t, func(c *assert.CollectT) { + nodes, err = headscale.ListNodes() + assert.NoError(c, err) + assert.Len(c, nodes, 2) + + for _, node := range nodes { + if node.GetName() == router.Hostname() { + requireNodeRouteCountWithCollect(c, node, 1, 0, 0) + } + } + }, 10*time.Second, 500*time.Millisecond, "route should be announced") + + // --- Blocking: approve the route via headscale CLI --- + var routerNode *v1.Node + for _, node := range nodes { + if node.GetName() == router.Hostname() { + routerNode = node + break + } + } + require.NotNil(t, routerNode) + + _, err = headscale.ApproveRoutes(routerNode.GetId(), []string{"10.33.0.0/16"}) + require.NoError(t, err) + + // --- Eventually: a peer should see the approved route --- + peer := allClients[1] + assert.EventuallyWithT(t, func(c *assert.CollectT) { + status, err := peer.Status() + assert.NoError(c, err) + for _, peerKey := range status.Peers() { + if peerKey == router.PublicKey() { + requirePeerSubnetRoutesWithCollect(c, + status.Peer[peerKey], + []netip.Prefix{netip.MustParsePrefix("10.33.0.0/16")}) + } + } + }, 10*time.Second, 500*time.Millisecond, "peer should see approved route") +} +``` + +## Common pitfalls + +- **Forgetting `IntegrationSkip(t)`**: the test runs outside Docker and + fails in confusing ways. Always the first line. +- **Using `require` inside `EventuallyWithT`**: aborts after the first + iteration instead of retrying. Use `assert.*` + the `*WithCollect` + helpers. +- **Mixing mutation and query in one `EventuallyWithT`**: hides real + failures. Keep mutation outside, query inside. +- **Assuming node ordering**: look up by property. +- **Ignoring `err` from `client.Status()`**: retry only retries the + whole block; don't silently drop errors from mid-block calls. +- **Timeouts too tight**: 5s is reasonable for local state, 10s for + state that must propagate through the map poll cycle. Don't go lower + to "speed up the test" — you just make it flaky. + +## Debugging failing tests + +Tests save comprehensive artefacts to `control_logs/{runID}/`. Read them +in this order: server stderr, client stderr, MapResponse JSON, database +snapshot. The full debugging workflow, heuristics, and failure patterns +are documented in [`../cmd/hi/README.md`](../cmd/hi/README.md).