mirror of
https://github.com/juanfont/headscale.git
synced 2026-04-11 03:27:20 +02:00
docs: expand cmd/hi and integration READMEs
Move integration-test runbook and authoring guide into the component READMEs so the content sits next to the code it describes.
This commit is contained in:
266
cmd/hi/README.md
266
cmd/hi/README.md
@@ -1,6 +1,262 @@
|
||||
# hi
|
||||
# hi — Headscale Integration test runner
|
||||
|
||||
hi (headscale integration runner) is an entirely "vibe coded" wrapper around our
|
||||
[integration test suite](../integration). It essentially runs the docker
|
||||
commands for you with some added benefits of extracting resources like logs and
|
||||
databases.
|
||||
`hi` wraps Docker container orchestration around the tests in
|
||||
[`../../integration`](../../integration) and extracts debugging artefacts
|
||||
(logs, database snapshots, MapResponse protocol captures) for post-mortem
|
||||
analysis.
|
||||
|
||||
**Read this file in full before running any `hi` command.** The test
|
||||
runner has sharp edges — wrong flags produce stale containers, lost
|
||||
artefacts, or hung CI.
|
||||
|
||||
For test-authoring patterns (scenario setup, `EventuallyWithT`,
|
||||
`IntegrationSkip`, helper variants), read
|
||||
[`../../integration/README.md`](../../integration/README.md).
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Verify system requirements (Docker, Go, disk space, images)
|
||||
go run ./cmd/hi doctor
|
||||
|
||||
# Run a single test (the default flags are tuned for development)
|
||||
go run ./cmd/hi run "TestPingAllByIP"
|
||||
|
||||
# Run a database-heavy test against PostgreSQL
|
||||
go run ./cmd/hi run "TestExpireNode" --postgres
|
||||
|
||||
# Pattern matching
|
||||
go run ./cmd/hi run "TestSubnet*"
|
||||
```
|
||||
|
||||
Run `doctor` before the first `run` in any new environment. Tests
|
||||
generate ~100 MB of logs per run in `control_logs/`; `doctor` verifies
|
||||
there is enough space and that the required Docker images are available.
|
||||
|
||||
## Commands
|
||||
|
||||
| Command | Purpose |
|
||||
| ------------------ | ---------------------------------------------------- |
|
||||
| `run [pattern]` | Execute the test(s) matching `pattern` |
|
||||
| `doctor` | Verify system requirements |
|
||||
| `clean networks` | Prune unused Docker networks |
|
||||
| `clean images` | Clean old test images |
|
||||
| `clean containers` | Kill **all** test containers (dangerous — see below) |
|
||||
| `clean cache` | Clean Go module cache volume |
|
||||
| `clean all` | Run all cleanup operations |
|
||||
|
||||
## Flags
|
||||
|
||||
Defaults are tuned for single-test development runs. Review before
|
||||
changing.
|
||||
|
||||
| Flag | Default | Purpose |
|
||||
| ------------------- | -------------- | --------------------------------------------------------------------------- |
|
||||
| `--timeout` | `120m` | Total test timeout. Use the built-in flag — never wrap with bash `timeout`. |
|
||||
| `--postgres` | `false` | Use PostgreSQL instead of SQLite |
|
||||
| `--failfast` | `true` | Stop on first test failure |
|
||||
| `--go-version` | auto | Detected from `go.mod` (currently 1.26.1) |
|
||||
| `--clean-before` | `true` | Clean stale (stopped/exited) containers before starting |
|
||||
| `--clean-after` | `true` | Clean this run's containers after completion |
|
||||
| `--keep-on-failure` | `false` | Preserve containers for manual inspection on failure |
|
||||
| `--logs-dir` | `control_logs` | Where to save run artefacts |
|
||||
| `--verbose` | `false` | Verbose output |
|
||||
| `--stats` | `false` | Collect container resource-usage stats |
|
||||
| `--hs-memory-limit` | `0` | Fail if any headscale container exceeds N MB (0 = disabled) |
|
||||
| `--ts-memory-limit` | `0` | Fail if any tailscale container exceeds N MB |
|
||||
|
||||
### Timeout guidance
|
||||
|
||||
The default `120m` is generous for a single test. If you must tune it,
|
||||
these are realistic floors by category:
|
||||
|
||||
| Test type | Minimum | Examples |
|
||||
| ------------------------- | ----------- | ------------------------------------- |
|
||||
| Basic functionality / CLI | 900s (15m) | `TestPingAllByIP`, `TestCLI*` |
|
||||
| Route / ACL | 1200s (20m) | `TestSubnet*`, `TestACL*` |
|
||||
| HA / failover | 1800s (30m) | `TestHASubnetRouter*` |
|
||||
| Long-running | 2100s (35m) | `TestNodeOnlineStatus` (~12 min body) |
|
||||
| Full suite | 45m | `go test ./integration -timeout 45m` |
|
||||
|
||||
**Never** use the shell `timeout` command around `hi`. It kills the
|
||||
process mid-cleanup and leaves stale containers:
|
||||
|
||||
```bash
|
||||
timeout 300 go run ./cmd/hi run "TestName" # WRONG — orphaned containers
|
||||
go run ./cmd/hi run "TestName" --timeout=900s # correct
|
||||
```
|
||||
|
||||
## Concurrent Execution
|
||||
|
||||
Multiple `hi run` invocations can run simultaneously on the same Docker
|
||||
daemon. Each invocation gets a unique **Run ID** (format
|
||||
`YYYYMMDD-HHMMSS-6charhash`, e.g. `20260409-104215-mdjtzx`).
|
||||
|
||||
- **Container names** include the short run ID: `ts-mdjtzx-1-74-fgdyls`
|
||||
- **Docker labels**: `hi.run-id={runID}` on every container
|
||||
- **Port allocation**: dynamic — kernel assigns free ports, no conflicts
|
||||
- **Cleanup isolation**: each run cleans only its own containers
|
||||
- **Log directories**: `control_logs/{runID}/`
|
||||
|
||||
```bash
|
||||
# Start three tests in parallel — each gets its own run ID
|
||||
go run ./cmd/hi run "TestPingAllByIP" &
|
||||
go run ./cmd/hi run "TestACLAllowUserDst" &
|
||||
go run ./cmd/hi run "TestOIDCAuthenticationPingAll" &
|
||||
```
|
||||
|
||||
### Safety rules for concurrent runs
|
||||
|
||||
- ✅ Your run cleans only containers labelled with its own `hi.run-id`
|
||||
- ✅ `--clean-before` removes only stopped/exited containers
|
||||
- ❌ **Never** run `docker rm -f $(docker ps -q --filter name=hs-)` —
|
||||
this destroys other agents' live test sessions
|
||||
- ❌ **Never** run `docker system prune -f` while any tests are running
|
||||
- ❌ **Never** run `hi clean containers` / `hi clean all` while other
|
||||
tests are running — both kill all test containers on the daemon
|
||||
|
||||
To identify your own containers:
|
||||
|
||||
```bash
|
||||
docker ps --filter "label=hi.run-id=20260409-104215-mdjtzx"
|
||||
```
|
||||
|
||||
The run ID appears at the top of the `hi run` output — copy it from
|
||||
there rather than trying to reconstruct it.
|
||||
|
||||
## Artefacts
|
||||
|
||||
Every run saves debugging artefacts under `control_logs/{runID}/`:
|
||||
|
||||
```
|
||||
control_logs/20260409-104215-mdjtzx/
|
||||
├── hs-<test>-<hash>.stderr.log # headscale server errors
|
||||
├── hs-<test>-<hash>.stdout.log # headscale server output
|
||||
├── hs-<test>-<hash>.db # database snapshot (SQLite)
|
||||
├── hs-<test>-<hash>_metrics.txt # Prometheus metrics dump
|
||||
├── hs-<test>-<hash>-mapresponses/ # MapResponse protocol captures
|
||||
├── ts-<client>-<hash>.stderr.log # tailscale client errors
|
||||
├── ts-<client>-<hash>.stdout.log # tailscale client output
|
||||
└── ts-<client>-<hash>_status.json # client network-status dump
|
||||
```
|
||||
|
||||
Artefacts persist after cleanup. Old runs accumulate fast — delete
|
||||
unwanted directories to reclaim disk.
|
||||
|
||||
## Debugging workflow
|
||||
|
||||
When a test fails, read the artefacts **in this order**:
|
||||
|
||||
1. **`hs-*.stderr.log`** — headscale server errors, panics, policy
|
||||
evaluation failures. Most issues originate server-side.
|
||||
|
||||
```bash
|
||||
grep -E "ERROR|panic|FATAL" control_logs/*/hs-*.stderr.log
|
||||
```
|
||||
|
||||
2. **`ts-*.stderr.log`** — authentication failures, connectivity issues,
|
||||
DNS resolution problems on the client side.
|
||||
|
||||
3. **MapResponse JSON** in `hs-*-mapresponses/` — protocol-level
|
||||
debugging for network map generation, peer visibility, route
|
||||
distribution, policy evaluation results.
|
||||
|
||||
```bash
|
||||
ls control_logs/*/hs-*-mapresponses/
|
||||
jq '.Peers[] | {Name, Tags, PrimaryRoutes}' \
|
||||
control_logs/*/hs-*-mapresponses/001.json
|
||||
```
|
||||
|
||||
4. **`*_status.json`** — client peer-connectivity state.
|
||||
|
||||
5. **`hs-*.db`** — SQLite snapshot for post-mortem consistency checks.
|
||||
|
||||
```bash
|
||||
sqlite3 control_logs/<runID>/hs-*.db
|
||||
sqlite> .tables
|
||||
sqlite> .schema nodes
|
||||
sqlite> SELECT id, hostname, user_id, tags FROM nodes WHERE hostname LIKE '%problematic%';
|
||||
```
|
||||
|
||||
6. **`*_metrics.txt`** — Prometheus dumps for latency, NodeStore
|
||||
operation timing, database query performance, memory usage.
|
||||
|
||||
## Heuristic: infrastructure vs code
|
||||
|
||||
**Before blaming Docker, disk, or network: read `hs-*.stderr.log` in
|
||||
full.** In practice, well over 99% of failures are code bugs (policy
|
||||
evaluation, NodeStore sync, route approval) rather than infrastructure.
|
||||
|
||||
Actual infrastructure failures have signature error messages:
|
||||
|
||||
| Signature | Cause | Fix |
|
||||
| --------------------------------------------------------------- | ------------------------- | ------------------------------------------------------------- |
|
||||
| `failed to resolve "hs-...": no DNS fallback candidates remain` | Docker DNS | Reset Docker networking |
|
||||
| `container creation timeout`, no progress >2 min | Resource exhaustion | `docker system prune -f` (when no other tests running), retry |
|
||||
| OOM kills, slow Docker daemon | Too many concurrent tests | Reduce concurrency, wait for completion |
|
||||
| `no space left on device` | Disk full | Delete old `control_logs/` |
|
||||
|
||||
If you don't see a signature error, **assume it's a code regression** —
|
||||
do not retry hoping the flake goes away.
|
||||
|
||||
## Common failure patterns (code bugs)
|
||||
|
||||
### Route advertisement timing
|
||||
|
||||
Test asserts route state before the client has finished propagating its
|
||||
Hostinfo update. Symptom: `nodes[0].GetAvailableRoutes()` empty when
|
||||
the test expects a route.
|
||||
|
||||
- **Wrong fix**: `time.Sleep(5 * time.Second)` — fragile and slow.
|
||||
- **Right fix**: wrap the assertion in `EventuallyWithT`. See
|
||||
[`../../integration/README.md`](../../integration/README.md).
|
||||
|
||||
### NodeStore sync issues
|
||||
|
||||
Route changes not reflected in the NodeStore snapshot. Symptom: route
|
||||
advertisements in logs but no tracking updates in subsequent reads.
|
||||
|
||||
The sync point is `State.UpdateNodeFromMapRequest()` in
|
||||
`hscontrol/state/state.go`. If you added a new kind of client state
|
||||
update, make sure it lands here.
|
||||
|
||||
### HA failover: routes disappearing on disconnect
|
||||
|
||||
`TestHASubnetRouterFailover` fails because approved routes vanish when
|
||||
a subnet router goes offline. **This is a bug, not expected behaviour.**
|
||||
Route approval must not be coupled to client connectivity — routes
|
||||
stay approved; only the primary-route selection is affected by
|
||||
connectivity.
|
||||
|
||||
### Policy evaluation race
|
||||
|
||||
Symptom: tests that change policy and immediately assert peer visibility
|
||||
fail intermittently. Policy changes trigger async recomputation.
|
||||
|
||||
- See recent fixes in `git log -- hscontrol/state/` for examples (e.g.
|
||||
the `PolicyChange` trigger on every Connect/Disconnect).
|
||||
|
||||
### SQLite vs PostgreSQL timing differences
|
||||
|
||||
Some race conditions only surface on one backend. If a test is flaky,
|
||||
try the other backend with `--postgres`:
|
||||
|
||||
```bash
|
||||
go run ./cmd/hi run "TestName" --postgres --verbose
|
||||
```
|
||||
|
||||
PostgreSQL generally has more consistent timing; SQLite can expose
|
||||
races during rapid writes.
|
||||
|
||||
## Keeping containers for inspection
|
||||
|
||||
If you need to inspect a failed test's state manually:
|
||||
|
||||
```bash
|
||||
go run ./cmd/hi run "TestName" --keep-on-failure
|
||||
# containers survive — inspect them
|
||||
docker exec -it ts-<runID>-<...> /bin/sh
|
||||
docker logs hs-<runID>-<...>
|
||||
# clean up manually when done
|
||||
go run ./cmd/hi clean all # only when no other tests are running
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user