Files
headscale/integration/README.md
Kristoffer Dalby 70b622fc68 docs: expand cmd/hi and integration READMEs
Move integration-test runbook and authoring guide into the component
READMEs so the content sits next to the code it describes.
2026-04-10 12:30:07 +01:00

11 KiB

Integration testing

Headscale's integration tests start a real Headscale server and run scenarios against real Tailscale clients across supported versions, all inside Docker. They are the safety net that keeps us honest about Tailscale protocol compatibility.

This file documents how to write integration tests. For how to run them, see ../cmd/hi/README.md.

Tests live in files ending with _test.go; the framework lives in the rest of this directory (scenario.go, tailscale.go, helpers, and the hsic/, tsic/, dockertestutil/ packages).

Running tests

For local runs, use cmd/hi:

go run ./cmd/hi doctor
go run ./cmd/hi run "TestPingAllByIP"

Alternatively, act runs the GitHub Actions workflow locally:

act pull_request -W .github/workflows/test-integration.yaml

Each test runs as a separate workflow on GitHub Actions. To add a new test, run go generate inside ../cmd/gh-action-integration-generator/ and commit the generated workflow file.

Framework overview

The integration framework has four layers:

  • scenario.goScenario orchestrates a test environment: a Headscale server, one or more users, and a collection of Tailscale clients. NewScenario(spec) returns a ready-to-use environment.
  • hsic/ — "Headscale Integration Container": wraps a Headscale server in Docker. Options for config, DB backend, DERP, OIDC, etc.
  • tsic/ — "Tailscale Integration Container": wraps a single Tailscale client. Options for version, hostname, auth method, etc.
  • dockertestutil/ — low-level Docker helpers (networks, container lifecycle, IsRunningInContainer() detection).

Tests compose these pieces via ScenarioSpec and CreateHeadscaleEnv rather than calling Docker directly.

Required scaffolding

IntegrationSkip(t)

Every integration test function must call IntegrationSkip(t) as its first statement. Without it, the test runs in the wrong environment and fails with confusing errors.

func TestMyScenario(t *testing.T) {
    IntegrationSkip(t)
    // ... rest of the test
}

IntegrationSkip is defined in integration/scenario_test.go:15 and:

  • skips the test when not running inside the Docker test container (dockertestutil.IsRunningInContainer()),
  • skips when -short is passed to go test.

Scenario setup

The canonical setup creates users, clients, and the Headscale server in one shot:

func TestMyScenario(t *testing.T) {
    IntegrationSkip(t)
    t.Parallel()

    spec := ScenarioSpec{
        NodesPerUser: 2,
        Users:        []string{"alice", "bob"},
    }
    scenario, err := NewScenario(spec)
    require.NoError(t, err)
    defer scenario.ShutdownAssertNoPanics(t)

    err = scenario.CreateHeadscaleEnv(
        []tsic.Option{tsic.WithSSH()},
        hsic.WithTestName("myscenario"),
    )
    require.NoError(t, err)

    allClients, err := scenario.ListTailscaleClients()
    require.NoError(t, err)

    headscale, err := scenario.Headscale()
    require.NoError(t, err)

    // ... assertions
}

Review scenario.go and hsic/options.go / tsic/options.go for the full option set (DERP, OIDC, policy files, DB backend, ACL grants, exit-node config, etc.).

The EventuallyWithT pattern

Integration tests operate on a distributed system with real async propagation: clients advertise state, the server processes it, updates stream to peers. Direct assertions after state changes fail intermittently. Wrap external calls in assert.EventuallyWithT:

assert.EventuallyWithT(t, func(c *assert.CollectT) {
    status, err := client.Status()
    assert.NoError(c, err)
    for _, peerKey := range status.Peers() {
        peerStatus := status.Peer[peerKey]
        requirePeerSubnetRoutesWithCollect(c, peerStatus, expectedRoutes)
    }
}, 10*time.Second, 500*time.Millisecond, "client should see expected routes")

External calls that need wrapping

These read distributed state and may reflect stale data until propagation completes:

  • headscale.ListNodes()
  • client.Status()
  • client.Curl()
  • client.Traceroute()
  • client.Execute() when the command reads state

Blocking operations that must NOT be wrapped

State-mutating commands run exactly once and either succeed or fail immediately — not eventually. Wrapping them in EventuallyWithT hides real failures behind retry.

Use client.MustStatus() when you only need an ID for a blocking call:

// CORRECT — mutation runs once
for _, client := range allClients {
    status := client.MustStatus()
    _, _, err := client.Execute([]string{
        "tailscale", "set",
        "--advertise-routes=" + expectedRoutes[string(status.Self.ID)],
    })
    require.NoErrorf(t, err, "failed to advertise route: %s", err)
}

Typical blocking operations: any tailscale set (routes, exit node, accept-routes, ssh), node registration via the CLI, user creation via gRPC.

The four rules

  1. One external call per EventuallyWithT block. Related assertions on the result of a single call go together in the same block.

    Loop exception: iterating over a collection of clients (or peers) and calling Status() on each inside a single block is allowed — it is the same logical "check all clients" operation. The rule applies to distinct calls like ListNodes() + Status(), which must be split into separate blocks.

  2. Never nest EventuallyWithT calls. A nested retry loop multiplies timing windows and makes failures impossible to diagnose.

  3. Use *WithCollect helper variants inside the block. Regular helpers use require and abort on the first failed assertion, preventing retry.

  4. Always provide a descriptive final message — it appears on failure and is your only clue about what the test was waiting for.

Variable scoping

Variables used across multiple EventuallyWithT blocks must be declared at function scope. Inside the block, assign with =, not :=:= creates a shadow invisible to the outer scope:

var nodes []*v1.Node
var err error
assert.EventuallyWithT(t, func(c *assert.CollectT) {
    nodes, err = headscale.ListNodes()   // = not :=
    assert.NoError(c, err)
    assert.Len(c, nodes, 2)
    requireNodeRouteCountWithCollect(c, nodes[0], 2, 2, 2)
}, 10*time.Second, 500*time.Millisecond, "nodes should have expected routes")

// nodes is usable here because it was declared at function scope

Helper functions

Inside EventuallyWithT blocks, use the *WithCollect variants so assertion failures restart the wait loop instead of failing the test immediately:

  • requirePeerSubnetRoutesWithCollect(c, status, expected)integration/route_test.go:2941
  • requireNodeRouteCountWithCollect(c, node, announced, approved, subnet)integration/route_test.go:2958
  • assertTracerouteViaIPWithCollect(c, traceroute, ip)integration/route_test.go:2898

When you write a new helper to be called inside EventuallyWithT, it must accept *assert.CollectT as its first parameter, not *testing.T.

Identifying nodes by property, not position

The order of headscale.ListNodes() is not stable. Tests that index nodes[0] will break when node ordering changes. Look nodes up by ID, hostname, or tag:

// WRONG — relies on array position
require.Len(t, nodes[0].GetAvailableRoutes(), 1)

// CORRECT — find the node that should have the route
expectedRoutes := map[string]string{"1": "10.33.0.0/16"}
for _, node := range nodes {
    nodeIDStr := fmt.Sprintf("%d", node.GetId())
    if route, shouldHaveRoute := expectedRoutes[nodeIDStr]; shouldHaveRoute {
        assert.Contains(t, node.GetAvailableRoutes(), route)
    }
}

Full example: advertising and approving a route

func TestRouteAdvertisementBasic(t *testing.T) {
    IntegrationSkip(t)
    t.Parallel()

    spec := ScenarioSpec{
        NodesPerUser: 2,
        Users:        []string{"user1"},
    }
    scenario, err := NewScenario(spec)
    require.NoError(t, err)
    defer scenario.ShutdownAssertNoPanics(t)

    err = scenario.CreateHeadscaleEnv([]tsic.Option{}, hsic.WithTestName("route"))
    require.NoError(t, err)

    allClients, err := scenario.ListTailscaleClients()
    require.NoError(t, err)

    headscale, err := scenario.Headscale()
    require.NoError(t, err)

    // --- Blocking: advertise the route on one client ---
    router := allClients[0]
    _, _, err = router.Execute([]string{
        "tailscale", "set",
        "--advertise-routes=10.33.0.0/16",
    })
    require.NoErrorf(t, err, "advertising route: %s", err)

    // --- Eventually: headscale should see the announced route ---
    var nodes []*v1.Node
    assert.EventuallyWithT(t, func(c *assert.CollectT) {
        nodes, err = headscale.ListNodes()
        assert.NoError(c, err)
        assert.Len(c, nodes, 2)

        for _, node := range nodes {
            if node.GetName() == router.Hostname() {
                requireNodeRouteCountWithCollect(c, node, 1, 0, 0)
            }
        }
    }, 10*time.Second, 500*time.Millisecond, "route should be announced")

    // --- Blocking: approve the route via headscale CLI ---
    var routerNode *v1.Node
    for _, node := range nodes {
        if node.GetName() == router.Hostname() {
            routerNode = node
            break
        }
    }
    require.NotNil(t, routerNode)

    _, err = headscale.ApproveRoutes(routerNode.GetId(), []string{"10.33.0.0/16"})
    require.NoError(t, err)

    // --- Eventually: a peer should see the approved route ---
    peer := allClients[1]
    assert.EventuallyWithT(t, func(c *assert.CollectT) {
        status, err := peer.Status()
        assert.NoError(c, err)
        for _, peerKey := range status.Peers() {
            if peerKey == router.PublicKey() {
                requirePeerSubnetRoutesWithCollect(c,
                    status.Peer[peerKey],
                    []netip.Prefix{netip.MustParsePrefix("10.33.0.0/16")})
            }
        }
    }, 10*time.Second, 500*time.Millisecond, "peer should see approved route")
}

Common pitfalls

  • Forgetting IntegrationSkip(t): the test runs outside Docker and fails in confusing ways. Always the first line.
  • Using require inside EventuallyWithT: aborts after the first iteration instead of retrying. Use assert.* + the *WithCollect helpers.
  • Mixing mutation and query in one EventuallyWithT: hides real failures. Keep mutation outside, query inside.
  • Assuming node ordering: look up by property.
  • Ignoring err from client.Status(): retry only retries the whole block; don't silently drop errors from mid-block calls.
  • Timeouts too tight: 5s is reasonable for local state, 10s for state that must propagate through the map poll cycle. Don't go lower to "speed up the test" — you just make it flaky.

Debugging failing tests

Tests save comprehensive artefacts to control_logs/{runID}/. Read them in this order: server stderr, client stderr, MapResponse JSON, database snapshot. The full debugging workflow, heuristics, and failure patterns are documented in ../cmd/hi/README.md.