mirror of
https://github.com/juanfont/headscale.git
synced 2026-04-17 06:19:51 +02:00
docs: expand cmd/hi and integration READMEs
Move integration-test runbook and authoring guide into the component READMEs so the content sits next to the code it describes.
This commit is contained in:
266
cmd/hi/README.md
266
cmd/hi/README.md
@@ -1,6 +1,262 @@
|
|||||||
# hi
|
# hi — Headscale Integration test runner
|
||||||
|
|
||||||
hi (headscale integration runner) is an entirely "vibe coded" wrapper around our
|
`hi` wraps Docker container orchestration around the tests in
|
||||||
[integration test suite](../integration). It essentially runs the docker
|
[`../../integration`](../../integration) and extracts debugging artefacts
|
||||||
commands for you with some added benefits of extracting resources like logs and
|
(logs, database snapshots, MapResponse protocol captures) for post-mortem
|
||||||
databases.
|
analysis.
|
||||||
|
|
||||||
|
**Read this file in full before running any `hi` command.** The test
|
||||||
|
runner has sharp edges — wrong flags produce stale containers, lost
|
||||||
|
artefacts, or hung CI.
|
||||||
|
|
||||||
|
For test-authoring patterns (scenario setup, `EventuallyWithT`,
|
||||||
|
`IntegrationSkip`, helper variants), read
|
||||||
|
[`../../integration/README.md`](../../integration/README.md).
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Verify system requirements (Docker, Go, disk space, images)
|
||||||
|
go run ./cmd/hi doctor
|
||||||
|
|
||||||
|
# Run a single test (the default flags are tuned for development)
|
||||||
|
go run ./cmd/hi run "TestPingAllByIP"
|
||||||
|
|
||||||
|
# Run a database-heavy test against PostgreSQL
|
||||||
|
go run ./cmd/hi run "TestExpireNode" --postgres
|
||||||
|
|
||||||
|
# Pattern matching
|
||||||
|
go run ./cmd/hi run "TestSubnet*"
|
||||||
|
```
|
||||||
|
|
||||||
|
Run `doctor` before the first `run` in any new environment. Tests
|
||||||
|
generate ~100 MB of logs per run in `control_logs/`; `doctor` verifies
|
||||||
|
there is enough space and that the required Docker images are available.
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
| Command | Purpose |
|
||||||
|
| ------------------ | ---------------------------------------------------- |
|
||||||
|
| `run [pattern]` | Execute the test(s) matching `pattern` |
|
||||||
|
| `doctor` | Verify system requirements |
|
||||||
|
| `clean networks` | Prune unused Docker networks |
|
||||||
|
| `clean images` | Clean old test images |
|
||||||
|
| `clean containers` | Kill **all** test containers (dangerous — see below) |
|
||||||
|
| `clean cache` | Clean Go module cache volume |
|
||||||
|
| `clean all` | Run all cleanup operations |
|
||||||
|
|
||||||
|
## Flags
|
||||||
|
|
||||||
|
Defaults are tuned for single-test development runs. Review before
|
||||||
|
changing.
|
||||||
|
|
||||||
|
| Flag | Default | Purpose |
|
||||||
|
| ------------------- | -------------- | --------------------------------------------------------------------------- |
|
||||||
|
| `--timeout` | `120m` | Total test timeout. Use the built-in flag — never wrap with bash `timeout`. |
|
||||||
|
| `--postgres` | `false` | Use PostgreSQL instead of SQLite |
|
||||||
|
| `--failfast` | `true` | Stop on first test failure |
|
||||||
|
| `--go-version` | auto | Detected from `go.mod` (currently 1.26.1) |
|
||||||
|
| `--clean-before` | `true` | Clean stale (stopped/exited) containers before starting |
|
||||||
|
| `--clean-after` | `true` | Clean this run's containers after completion |
|
||||||
|
| `--keep-on-failure` | `false` | Preserve containers for manual inspection on failure |
|
||||||
|
| `--logs-dir` | `control_logs` | Where to save run artefacts |
|
||||||
|
| `--verbose` | `false` | Verbose output |
|
||||||
|
| `--stats` | `false` | Collect container resource-usage stats |
|
||||||
|
| `--hs-memory-limit` | `0` | Fail if any headscale container exceeds N MB (0 = disabled) |
|
||||||
|
| `--ts-memory-limit` | `0` | Fail if any tailscale container exceeds N MB |
|
||||||
|
|
||||||
|
### Timeout guidance
|
||||||
|
|
||||||
|
The default `120m` is generous for a single test. If you must tune it,
|
||||||
|
these are realistic floors by category:
|
||||||
|
|
||||||
|
| Test type | Minimum | Examples |
|
||||||
|
| ------------------------- | ----------- | ------------------------------------- |
|
||||||
|
| Basic functionality / CLI | 900s (15m) | `TestPingAllByIP`, `TestCLI*` |
|
||||||
|
| Route / ACL | 1200s (20m) | `TestSubnet*`, `TestACL*` |
|
||||||
|
| HA / failover | 1800s (30m) | `TestHASubnetRouter*` |
|
||||||
|
| Long-running | 2100s (35m) | `TestNodeOnlineStatus` (~12 min body) |
|
||||||
|
| Full suite | 45m | `go test ./integration -timeout 45m` |
|
||||||
|
|
||||||
|
**Never** use the shell `timeout` command around `hi`. It kills the
|
||||||
|
process mid-cleanup and leaves stale containers:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
timeout 300 go run ./cmd/hi run "TestName" # WRONG — orphaned containers
|
||||||
|
go run ./cmd/hi run "TestName" --timeout=900s # correct
|
||||||
|
```
|
||||||
|
|
||||||
|
## Concurrent Execution
|
||||||
|
|
||||||
|
Multiple `hi run` invocations can run simultaneously on the same Docker
|
||||||
|
daemon. Each invocation gets a unique **Run ID** (format
|
||||||
|
`YYYYMMDD-HHMMSS-6charhash`, e.g. `20260409-104215-mdjtzx`).
|
||||||
|
|
||||||
|
- **Container names** include the short run ID: `ts-mdjtzx-1-74-fgdyls`
|
||||||
|
- **Docker labels**: `hi.run-id={runID}` on every container
|
||||||
|
- **Port allocation**: dynamic — kernel assigns free ports, no conflicts
|
||||||
|
- **Cleanup isolation**: each run cleans only its own containers
|
||||||
|
- **Log directories**: `control_logs/{runID}/`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Start three tests in parallel — each gets its own run ID
|
||||||
|
go run ./cmd/hi run "TestPingAllByIP" &
|
||||||
|
go run ./cmd/hi run "TestACLAllowUserDst" &
|
||||||
|
go run ./cmd/hi run "TestOIDCAuthenticationPingAll" &
|
||||||
|
```
|
||||||
|
|
||||||
|
### Safety rules for concurrent runs
|
||||||
|
|
||||||
|
- ✅ Your run cleans only containers labelled with its own `hi.run-id`
|
||||||
|
- ✅ `--clean-before` removes only stopped/exited containers
|
||||||
|
- ❌ **Never** run `docker rm -f $(docker ps -q --filter name=hs-)` —
|
||||||
|
this destroys other agents' live test sessions
|
||||||
|
- ❌ **Never** run `docker system prune -f` while any tests are running
|
||||||
|
- ❌ **Never** run `hi clean containers` / `hi clean all` while other
|
||||||
|
tests are running — both kill all test containers on the daemon
|
||||||
|
|
||||||
|
To identify your own containers:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker ps --filter "label=hi.run-id=20260409-104215-mdjtzx"
|
||||||
|
```
|
||||||
|
|
||||||
|
The run ID appears at the top of the `hi run` output — copy it from
|
||||||
|
there rather than trying to reconstruct it.
|
||||||
|
|
||||||
|
## Artefacts
|
||||||
|
|
||||||
|
Every run saves debugging artefacts under `control_logs/{runID}/`:
|
||||||
|
|
||||||
|
```
|
||||||
|
control_logs/20260409-104215-mdjtzx/
|
||||||
|
├── hs-<test>-<hash>.stderr.log # headscale server errors
|
||||||
|
├── hs-<test>-<hash>.stdout.log # headscale server output
|
||||||
|
├── hs-<test>-<hash>.db # database snapshot (SQLite)
|
||||||
|
├── hs-<test>-<hash>_metrics.txt # Prometheus metrics dump
|
||||||
|
├── hs-<test>-<hash>-mapresponses/ # MapResponse protocol captures
|
||||||
|
├── ts-<client>-<hash>.stderr.log # tailscale client errors
|
||||||
|
├── ts-<client>-<hash>.stdout.log # tailscale client output
|
||||||
|
└── ts-<client>-<hash>_status.json # client network-status dump
|
||||||
|
```
|
||||||
|
|
||||||
|
Artefacts persist after cleanup. Old runs accumulate fast — delete
|
||||||
|
unwanted directories to reclaim disk.
|
||||||
|
|
||||||
|
## Debugging workflow
|
||||||
|
|
||||||
|
When a test fails, read the artefacts **in this order**:
|
||||||
|
|
||||||
|
1. **`hs-*.stderr.log`** — headscale server errors, panics, policy
|
||||||
|
evaluation failures. Most issues originate server-side.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -E "ERROR|panic|FATAL" control_logs/*/hs-*.stderr.log
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **`ts-*.stderr.log`** — authentication failures, connectivity issues,
|
||||||
|
DNS resolution problems on the client side.
|
||||||
|
|
||||||
|
3. **MapResponse JSON** in `hs-*-mapresponses/` — protocol-level
|
||||||
|
debugging for network map generation, peer visibility, route
|
||||||
|
distribution, policy evaluation results.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ls control_logs/*/hs-*-mapresponses/
|
||||||
|
jq '.Peers[] | {Name, Tags, PrimaryRoutes}' \
|
||||||
|
control_logs/*/hs-*-mapresponses/001.json
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **`*_status.json`** — client peer-connectivity state.
|
||||||
|
|
||||||
|
5. **`hs-*.db`** — SQLite snapshot for post-mortem consistency checks.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sqlite3 control_logs/<runID>/hs-*.db
|
||||||
|
sqlite> .tables
|
||||||
|
sqlite> .schema nodes
|
||||||
|
sqlite> SELECT id, hostname, user_id, tags FROM nodes WHERE hostname LIKE '%problematic%';
|
||||||
|
```
|
||||||
|
|
||||||
|
6. **`*_metrics.txt`** — Prometheus dumps for latency, NodeStore
|
||||||
|
operation timing, database query performance, memory usage.
|
||||||
|
|
||||||
|
## Heuristic: infrastructure vs code
|
||||||
|
|
||||||
|
**Before blaming Docker, disk, or network: read `hs-*.stderr.log` in
|
||||||
|
full.** In practice, well over 99% of failures are code bugs (policy
|
||||||
|
evaluation, NodeStore sync, route approval) rather than infrastructure.
|
||||||
|
|
||||||
|
Actual infrastructure failures have signature error messages:
|
||||||
|
|
||||||
|
| Signature | Cause | Fix |
|
||||||
|
| --------------------------------------------------------------- | ------------------------- | ------------------------------------------------------------- |
|
||||||
|
| `failed to resolve "hs-...": no DNS fallback candidates remain` | Docker DNS | Reset Docker networking |
|
||||||
|
| `container creation timeout`, no progress >2 min | Resource exhaustion | `docker system prune -f` (when no other tests running), retry |
|
||||||
|
| OOM kills, slow Docker daemon | Too many concurrent tests | Reduce concurrency, wait for completion |
|
||||||
|
| `no space left on device` | Disk full | Delete old `control_logs/` |
|
||||||
|
|
||||||
|
If you don't see a signature error, **assume it's a code regression** —
|
||||||
|
do not retry hoping the flake goes away.
|
||||||
|
|
||||||
|
## Common failure patterns (code bugs)
|
||||||
|
|
||||||
|
### Route advertisement timing
|
||||||
|
|
||||||
|
Test asserts route state before the client has finished propagating its
|
||||||
|
Hostinfo update. Symptom: `nodes[0].GetAvailableRoutes()` empty when
|
||||||
|
the test expects a route.
|
||||||
|
|
||||||
|
- **Wrong fix**: `time.Sleep(5 * time.Second)` — fragile and slow.
|
||||||
|
- **Right fix**: wrap the assertion in `EventuallyWithT`. See
|
||||||
|
[`../../integration/README.md`](../../integration/README.md).
|
||||||
|
|
||||||
|
### NodeStore sync issues
|
||||||
|
|
||||||
|
Route changes not reflected in the NodeStore snapshot. Symptom: route
|
||||||
|
advertisements in logs but no tracking updates in subsequent reads.
|
||||||
|
|
||||||
|
The sync point is `State.UpdateNodeFromMapRequest()` in
|
||||||
|
`hscontrol/state/state.go`. If you added a new kind of client state
|
||||||
|
update, make sure it lands here.
|
||||||
|
|
||||||
|
### HA failover: routes disappearing on disconnect
|
||||||
|
|
||||||
|
`TestHASubnetRouterFailover` fails because approved routes vanish when
|
||||||
|
a subnet router goes offline. **This is a bug, not expected behaviour.**
|
||||||
|
Route approval must not be coupled to client connectivity — routes
|
||||||
|
stay approved; only the primary-route selection is affected by
|
||||||
|
connectivity.
|
||||||
|
|
||||||
|
### Policy evaluation race
|
||||||
|
|
||||||
|
Symptom: tests that change policy and immediately assert peer visibility
|
||||||
|
fail intermittently. Policy changes trigger async recomputation.
|
||||||
|
|
||||||
|
- See recent fixes in `git log -- hscontrol/state/` for examples (e.g.
|
||||||
|
the `PolicyChange` trigger on every Connect/Disconnect).
|
||||||
|
|
||||||
|
### SQLite vs PostgreSQL timing differences
|
||||||
|
|
||||||
|
Some race conditions only surface on one backend. If a test is flaky,
|
||||||
|
try the other backend with `--postgres`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
go run ./cmd/hi run "TestName" --postgres --verbose
|
||||||
|
```
|
||||||
|
|
||||||
|
PostgreSQL generally has more consistent timing; SQLite can expose
|
||||||
|
races during rapid writes.
|
||||||
|
|
||||||
|
## Keeping containers for inspection
|
||||||
|
|
||||||
|
If you need to inspect a failed test's state manually:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
go run ./cmd/hi run "TestName" --keep-on-failure
|
||||||
|
# containers survive — inspect them
|
||||||
|
docker exec -it ts-<runID>-<...> /bin/sh
|
||||||
|
docker logs hs-<runID>-<...>
|
||||||
|
# clean up manually when done
|
||||||
|
go run ./cmd/hi clean all # only when no other tests are running
|
||||||
|
```
|
||||||
|
|||||||
@@ -1,25 +1,336 @@
|
|||||||
# Integration testing
|
# Integration testing
|
||||||
|
|
||||||
Headscale relies on integration testing to ensure we remain compatible with Tailscale.
|
Headscale's integration tests start a real Headscale server and run
|
||||||
|
scenarios against real Tailscale clients across supported versions, all
|
||||||
|
inside Docker. They are the safety net that keeps us honest about
|
||||||
|
Tailscale protocol compatibility.
|
||||||
|
|
||||||
This is typically performed by starting a Headscale server and running a test "scenario"
|
This file documents **how to write** integration tests. For **how to
|
||||||
with an array of Tailscale clients and versions.
|
run** them, see [`../cmd/hi/README.md`](../cmd/hi/README.md).
|
||||||
|
|
||||||
Headscale's test framework and the current set of scenarios are defined in this directory.
|
Tests live in files ending with `_test.go`; the framework lives in the
|
||||||
|
rest of this directory (`scenario.go`, `tailscale.go`, helpers, and the
|
||||||
|
`hsic/`, `tsic/`, `dockertestutil/` packages).
|
||||||
|
|
||||||
Tests are located in files ending with `_test.go` and the framework are located in the rest.
|
## Running tests
|
||||||
|
|
||||||
## Running integration tests locally
|
For local runs, use [`cmd/hi`](../cmd/hi):
|
||||||
|
|
||||||
The easiest way to run tests locally is to use [act](https://github.com/nektos/act), a local GitHub Actions runner:
|
|
||||||
|
|
||||||
|
```bash
|
||||||
|
go run ./cmd/hi doctor
|
||||||
|
go run ./cmd/hi run "TestPingAllByIP"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Alternatively, [`act`](https://github.com/nektos/act) runs the GitHub
|
||||||
|
Actions workflow locally:
|
||||||
|
|
||||||
|
```bash
|
||||||
act pull_request -W .github/workflows/test-integration.yaml
|
act pull_request -W .github/workflows/test-integration.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
Alternatively, the `docker run` command in each GitHub workflow file can be used.
|
Each test runs as a separate workflow on GitHub Actions. To add a new
|
||||||
|
test, run `go generate` inside `../cmd/gh-action-integration-generator/`
|
||||||
|
and commit the generated workflow file.
|
||||||
|
|
||||||
## Running integration tests on GitHub Actions
|
## Framework overview
|
||||||
|
|
||||||
Each test currently runs as a separate workflows in GitHub actions, to add new test, run
|
The integration framework has four layers:
|
||||||
`go generate` inside `../cmd/gh-action-integration-generator/` and commit the result.
|
|
||||||
|
- **`scenario.go`** — `Scenario` orchestrates a test environment: a
|
||||||
|
Headscale server, one or more users, and a collection of Tailscale
|
||||||
|
clients. `NewScenario(spec)` returns a ready-to-use environment.
|
||||||
|
- **`hsic/`** — "Headscale Integration Container": wraps a Headscale
|
||||||
|
server in Docker. Options for config, DB backend, DERP, OIDC, etc.
|
||||||
|
- **`tsic/`** — "Tailscale Integration Container": wraps a single
|
||||||
|
Tailscale client. Options for version, hostname, auth method, etc.
|
||||||
|
- **`dockertestutil/`** — low-level Docker helpers (networks, container
|
||||||
|
lifecycle, `IsRunningInContainer()` detection).
|
||||||
|
|
||||||
|
Tests compose these pieces via `ScenarioSpec` and `CreateHeadscaleEnv`
|
||||||
|
rather than calling Docker directly.
|
||||||
|
|
||||||
|
## Required scaffolding
|
||||||
|
|
||||||
|
### `IntegrationSkip(t)`
|
||||||
|
|
||||||
|
**Every** integration test function must call `IntegrationSkip(t)` as
|
||||||
|
its first statement. Without it, the test runs in the wrong environment
|
||||||
|
and fails with confusing errors.
|
||||||
|
|
||||||
|
```go
|
||||||
|
func TestMyScenario(t *testing.T) {
|
||||||
|
IntegrationSkip(t)
|
||||||
|
// ... rest of the test
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`IntegrationSkip` is defined in `integration/scenario_test.go:15` and:
|
||||||
|
|
||||||
|
- skips the test when not running inside the Docker test container
|
||||||
|
(`dockertestutil.IsRunningInContainer()`),
|
||||||
|
- skips when `-short` is passed to `go test`.
|
||||||
|
|
||||||
|
### Scenario setup
|
||||||
|
|
||||||
|
The canonical setup creates users, clients, and the Headscale server in
|
||||||
|
one shot:
|
||||||
|
|
||||||
|
```go
|
||||||
|
func TestMyScenario(t *testing.T) {
|
||||||
|
IntegrationSkip(t)
|
||||||
|
t.Parallel()
|
||||||
|
|
||||||
|
spec := ScenarioSpec{
|
||||||
|
NodesPerUser: 2,
|
||||||
|
Users: []string{"alice", "bob"},
|
||||||
|
}
|
||||||
|
scenario, err := NewScenario(spec)
|
||||||
|
require.NoError(t, err)
|
||||||
|
defer scenario.ShutdownAssertNoPanics(t)
|
||||||
|
|
||||||
|
err = scenario.CreateHeadscaleEnv(
|
||||||
|
[]tsic.Option{tsic.WithSSH()},
|
||||||
|
hsic.WithTestName("myscenario"),
|
||||||
|
)
|
||||||
|
require.NoError(t, err)
|
||||||
|
|
||||||
|
allClients, err := scenario.ListTailscaleClients()
|
||||||
|
require.NoError(t, err)
|
||||||
|
|
||||||
|
headscale, err := scenario.Headscale()
|
||||||
|
require.NoError(t, err)
|
||||||
|
|
||||||
|
// ... assertions
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Review `scenario.go` and `hsic/options.go` / `tsic/options.go` for the
|
||||||
|
full option set (DERP, OIDC, policy files, DB backend, ACL grants,
|
||||||
|
exit-node config, etc.).
|
||||||
|
|
||||||
|
## The `EventuallyWithT` pattern
|
||||||
|
|
||||||
|
Integration tests operate on a distributed system with real async
|
||||||
|
propagation: clients advertise state, the server processes it, updates
|
||||||
|
stream to peers. Direct assertions after state changes fail
|
||||||
|
intermittently. Wrap external calls in `assert.EventuallyWithT`:
|
||||||
|
|
||||||
|
```go
|
||||||
|
assert.EventuallyWithT(t, func(c *assert.CollectT) {
|
||||||
|
status, err := client.Status()
|
||||||
|
assert.NoError(c, err)
|
||||||
|
for _, peerKey := range status.Peers() {
|
||||||
|
peerStatus := status.Peer[peerKey]
|
||||||
|
requirePeerSubnetRoutesWithCollect(c, peerStatus, expectedRoutes)
|
||||||
|
}
|
||||||
|
}, 10*time.Second, 500*time.Millisecond, "client should see expected routes")
|
||||||
|
```
|
||||||
|
|
||||||
|
### External calls that need wrapping
|
||||||
|
|
||||||
|
These read distributed state and may reflect stale data until
|
||||||
|
propagation completes:
|
||||||
|
|
||||||
|
- `headscale.ListNodes()`
|
||||||
|
- `client.Status()`
|
||||||
|
- `client.Curl()`
|
||||||
|
- `client.Traceroute()`
|
||||||
|
- `client.Execute()` when the command reads state
|
||||||
|
|
||||||
|
### Blocking operations that must NOT be wrapped
|
||||||
|
|
||||||
|
State-mutating commands run exactly once and either succeed or fail
|
||||||
|
immediately — not eventually. Wrapping them in `EventuallyWithT` hides
|
||||||
|
real failures behind retry.
|
||||||
|
|
||||||
|
Use `client.MustStatus()` when you only need an ID for a blocking call:
|
||||||
|
|
||||||
|
```go
|
||||||
|
// CORRECT — mutation runs once
|
||||||
|
for _, client := range allClients {
|
||||||
|
status := client.MustStatus()
|
||||||
|
_, _, err := client.Execute([]string{
|
||||||
|
"tailscale", "set",
|
||||||
|
"--advertise-routes=" + expectedRoutes[string(status.Self.ID)],
|
||||||
|
})
|
||||||
|
require.NoErrorf(t, err, "failed to advertise route: %s", err)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Typical blocking operations: any `tailscale set` (routes, exit node,
|
||||||
|
accept-routes, ssh), node registration via the CLI, user creation via
|
||||||
|
gRPC.
|
||||||
|
|
||||||
|
### The four rules
|
||||||
|
|
||||||
|
1. **One external call per `EventuallyWithT` block.** Related assertions
|
||||||
|
on the result of a single call go together in the same block.
|
||||||
|
|
||||||
|
**Loop exception**: iterating over a collection of clients (or peers)
|
||||||
|
and calling `Status()` on each inside a single block is allowed — it
|
||||||
|
is the same logical "check all clients" operation. The rule applies
|
||||||
|
to distinct calls like `ListNodes()` + `Status()`, which must be
|
||||||
|
split into separate blocks.
|
||||||
|
|
||||||
|
2. **Never nest `EventuallyWithT` calls.** A nested retry loop
|
||||||
|
multiplies timing windows and makes failures impossible to diagnose.
|
||||||
|
|
||||||
|
3. **Use `*WithCollect` helper variants** inside the block. Regular
|
||||||
|
helpers use `require` and abort on the first failed assertion,
|
||||||
|
preventing retry.
|
||||||
|
|
||||||
|
4. **Always provide a descriptive final message** — it appears on
|
||||||
|
failure and is your only clue about what the test was waiting for.
|
||||||
|
|
||||||
|
### Variable scoping
|
||||||
|
|
||||||
|
Variables used across multiple `EventuallyWithT` blocks must be declared
|
||||||
|
at function scope. Inside the block, assign with `=`, not `:=` — `:=`
|
||||||
|
creates a shadow invisible to the outer scope:
|
||||||
|
|
||||||
|
```go
|
||||||
|
var nodes []*v1.Node
|
||||||
|
var err error
|
||||||
|
assert.EventuallyWithT(t, func(c *assert.CollectT) {
|
||||||
|
nodes, err = headscale.ListNodes() // = not :=
|
||||||
|
assert.NoError(c, err)
|
||||||
|
assert.Len(c, nodes, 2)
|
||||||
|
requireNodeRouteCountWithCollect(c, nodes[0], 2, 2, 2)
|
||||||
|
}, 10*time.Second, 500*time.Millisecond, "nodes should have expected routes")
|
||||||
|
|
||||||
|
// nodes is usable here because it was declared at function scope
|
||||||
|
```
|
||||||
|
|
||||||
|
### Helper functions
|
||||||
|
|
||||||
|
Inside `EventuallyWithT` blocks, use the `*WithCollect` variants so
|
||||||
|
assertion failures restart the wait loop instead of failing the test
|
||||||
|
immediately:
|
||||||
|
|
||||||
|
- `requirePeerSubnetRoutesWithCollect(c, status, expected)` —
|
||||||
|
`integration/route_test.go:2941`
|
||||||
|
- `requireNodeRouteCountWithCollect(c, node, announced, approved, subnet)` —
|
||||||
|
`integration/route_test.go:2958`
|
||||||
|
- `assertTracerouteViaIPWithCollect(c, traceroute, ip)` —
|
||||||
|
`integration/route_test.go:2898`
|
||||||
|
|
||||||
|
When you write a new helper to be called inside `EventuallyWithT`, it
|
||||||
|
must accept `*assert.CollectT` as its first parameter, not `*testing.T`.
|
||||||
|
|
||||||
|
## Identifying nodes by property, not position
|
||||||
|
|
||||||
|
The order of `headscale.ListNodes()` is not stable. Tests that index
|
||||||
|
`nodes[0]` will break when node ordering changes. Look nodes up by ID,
|
||||||
|
hostname, or tag:
|
||||||
|
|
||||||
|
```go
|
||||||
|
// WRONG — relies on array position
|
||||||
|
require.Len(t, nodes[0].GetAvailableRoutes(), 1)
|
||||||
|
|
||||||
|
// CORRECT — find the node that should have the route
|
||||||
|
expectedRoutes := map[string]string{"1": "10.33.0.0/16"}
|
||||||
|
for _, node := range nodes {
|
||||||
|
nodeIDStr := fmt.Sprintf("%d", node.GetId())
|
||||||
|
if route, shouldHaveRoute := expectedRoutes[nodeIDStr]; shouldHaveRoute {
|
||||||
|
assert.Contains(t, node.GetAvailableRoutes(), route)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Full example: advertising and approving a route
|
||||||
|
|
||||||
|
```go
|
||||||
|
func TestRouteAdvertisementBasic(t *testing.T) {
|
||||||
|
IntegrationSkip(t)
|
||||||
|
t.Parallel()
|
||||||
|
|
||||||
|
spec := ScenarioSpec{
|
||||||
|
NodesPerUser: 2,
|
||||||
|
Users: []string{"user1"},
|
||||||
|
}
|
||||||
|
scenario, err := NewScenario(spec)
|
||||||
|
require.NoError(t, err)
|
||||||
|
defer scenario.ShutdownAssertNoPanics(t)
|
||||||
|
|
||||||
|
err = scenario.CreateHeadscaleEnv([]tsic.Option{}, hsic.WithTestName("route"))
|
||||||
|
require.NoError(t, err)
|
||||||
|
|
||||||
|
allClients, err := scenario.ListTailscaleClients()
|
||||||
|
require.NoError(t, err)
|
||||||
|
|
||||||
|
headscale, err := scenario.Headscale()
|
||||||
|
require.NoError(t, err)
|
||||||
|
|
||||||
|
// --- Blocking: advertise the route on one client ---
|
||||||
|
router := allClients[0]
|
||||||
|
_, _, err = router.Execute([]string{
|
||||||
|
"tailscale", "set",
|
||||||
|
"--advertise-routes=10.33.0.0/16",
|
||||||
|
})
|
||||||
|
require.NoErrorf(t, err, "advertising route: %s", err)
|
||||||
|
|
||||||
|
// --- Eventually: headscale should see the announced route ---
|
||||||
|
var nodes []*v1.Node
|
||||||
|
assert.EventuallyWithT(t, func(c *assert.CollectT) {
|
||||||
|
nodes, err = headscale.ListNodes()
|
||||||
|
assert.NoError(c, err)
|
||||||
|
assert.Len(c, nodes, 2)
|
||||||
|
|
||||||
|
for _, node := range nodes {
|
||||||
|
if node.GetName() == router.Hostname() {
|
||||||
|
requireNodeRouteCountWithCollect(c, node, 1, 0, 0)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}, 10*time.Second, 500*time.Millisecond, "route should be announced")
|
||||||
|
|
||||||
|
// --- Blocking: approve the route via headscale CLI ---
|
||||||
|
var routerNode *v1.Node
|
||||||
|
for _, node := range nodes {
|
||||||
|
if node.GetName() == router.Hostname() {
|
||||||
|
routerNode = node
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
require.NotNil(t, routerNode)
|
||||||
|
|
||||||
|
_, err = headscale.ApproveRoutes(routerNode.GetId(), []string{"10.33.0.0/16"})
|
||||||
|
require.NoError(t, err)
|
||||||
|
|
||||||
|
// --- Eventually: a peer should see the approved route ---
|
||||||
|
peer := allClients[1]
|
||||||
|
assert.EventuallyWithT(t, func(c *assert.CollectT) {
|
||||||
|
status, err := peer.Status()
|
||||||
|
assert.NoError(c, err)
|
||||||
|
for _, peerKey := range status.Peers() {
|
||||||
|
if peerKey == router.PublicKey() {
|
||||||
|
requirePeerSubnetRoutesWithCollect(c,
|
||||||
|
status.Peer[peerKey],
|
||||||
|
[]netip.Prefix{netip.MustParsePrefix("10.33.0.0/16")})
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}, 10*time.Second, 500*time.Millisecond, "peer should see approved route")
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Common pitfalls
|
||||||
|
|
||||||
|
- **Forgetting `IntegrationSkip(t)`**: the test runs outside Docker and
|
||||||
|
fails in confusing ways. Always the first line.
|
||||||
|
- **Using `require` inside `EventuallyWithT`**: aborts after the first
|
||||||
|
iteration instead of retrying. Use `assert.*` + the `*WithCollect`
|
||||||
|
helpers.
|
||||||
|
- **Mixing mutation and query in one `EventuallyWithT`**: hides real
|
||||||
|
failures. Keep mutation outside, query inside.
|
||||||
|
- **Assuming node ordering**: look up by property.
|
||||||
|
- **Ignoring `err` from `client.Status()`**: retry only retries the
|
||||||
|
whole block; don't silently drop errors from mid-block calls.
|
||||||
|
- **Timeouts too tight**: 5s is reasonable for local state, 10s for
|
||||||
|
state that must propagate through the map poll cycle. Don't go lower
|
||||||
|
to "speed up the test" — you just make it flaky.
|
||||||
|
|
||||||
|
## Debugging failing tests
|
||||||
|
|
||||||
|
Tests save comprehensive artefacts to `control_logs/{runID}/`. Read them
|
||||||
|
in this order: server stderr, client stderr, MapResponse JSON, database
|
||||||
|
snapshot. The full debugging workflow, heuristics, and failure patterns
|
||||||
|
are documented in [`../cmd/hi/README.md`](../cmd/hi/README.md).
|
||||||
|
|||||||
Reference in New Issue
Block a user