Commit Graph

3958 Commits

Author SHA1 Message Date
Kristoffer Dalby
bc9877ce28 policy/v2: use bare IPs in autogroup:self DstPorts
Use ip.String() instead of netip.PrefixFrom(ip, ip.BitLen()).String()
when building DstPorts for autogroup:self destinations. This produces
bare IPs like "100.90.199.68" instead of CIDR notation like
"100.90.199.68/32", matching the Tailscale FilterRule wire format.

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
e3ab288351 policy/v2: remove resolved grant skip categories
Remove 91 entries from grantSkipReasons that are now passing:
- 90 MISSING_IPV6_ADDRS entries (identity aliases now include IPv6)
- 1 RAW_IPV6_ADDR_EXPANSION entry (address aliases no longer expand)

Move GRANT-P09_12B from the removed MISSING_IPV6_ADDRS category to
SUBNET_ROUTE_FILTER_RULES, which is its remaining failure mode.

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
1a5ed2c7ca policy/policyutil: update ReduceFilterRules test expectations for IPv6
Now that AppendToIPSet includes both IPv4 and IPv6, tests with
nodes that have IPv6 addresses produce additional entries in SrcIPs
and DstPorts. Update the expected values accordingly.

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
ccade49742 policy: include IPv6 in identity-based alias resolution
AppendToIPSet now adds both IPv4 and IPv6 addresses for nodes,
matching Tailscale's FilterRule wire format where identity-based
aliases (tags, users, groups, autogroups) resolve to both address
families.

Address-based aliases (raw IPs, host names) are unchanged: they
resolve to exactly the literal prefix. The appendIfNodeHasIP helper
that incorrectly expanded address aliases to include the matching
node's other IPs is removed, fixing the RAW_IPV6_ADDR_EXPANSION
bug where a raw fd7a: IPv6 address would incorrectly include the
node's IPv4.

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
91aac1ceb2 hscontrol/policy/v2: replace routes golden data with Tailscale SaaS captures
Replace the headscale-adapted routes golden files with authoritative
captures from Tailscale SaaS using the 12-node topology (8 original
grant nodes + 4 new route-specific nodes: ha-router1, ha-router2,
big-router, multi-router).

The golden data was captured via debug-packet-filter-rules from all
12 nodes. The routes driver now falls back to the standard 3-user
setup when topology.users is absent (matching the SaaS capture
format) and converts @passkey/@dalby.cc emails to @example.com.

92 test cases captured, all valid JSON, all from Tailscale SaaS.

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
162e1dc35b hscontrol/policy/v2: replace ACL golden data with Tailscale SaaS captures
Replace the headscale-adapted ACL golden files with authoritative
captures from Tailscale SaaS using the 8-node grant topology.

The golden data was captured via debug-packet-filter-rules (FilterRule
wire format) from each of the 8 nodes after pushing each ACL policy
to the Tailscale API. This gives us the exact format Tailscale sends
to clients:

- SrcIPs use IP ranges (100.64.0.0-100.115.91.255) not CIDRs
- SrcIPs include subnet routes (10.33.0.0/16) for wildcard sources
- IPProto is omitted for default all-protocol rules
- DstPorts use bare IPs without /32 suffix
- Identity aliases include both IPv4 and IPv6 addresses

The test driver is updated to use the 8-node topology (3 users,
5 tagged nodes) matching the grant compat tests, with the same
email conversion (kratail2tid@passkey -> @example.com).

215 test cases: 199 success + 16 error (captured from API 400s).
All captured from Tailscale SaaS, no headscale-adapted values.

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
d83697186a hscontrol/policy/v2: convert routes compat tests to JSON-driven format
Replace 8,286 lines of inline Go struct test expectations in
tailscale_routes_compat_test.go with 92 JSON golden files in
testdata/routes_results/ROUTES-*.json and a ~300-line Go driver in
tailscale_routes_data_compat_test.go.

Unlike the ACL and grants compat tests which use shared hardcoded node
topologies, the routes driver builds nodes from JSON topology data.
Each test file embeds its full topology including routable_ips and
approved_routes, making test files self-contained. This naturally
handles the IPv6 tests which use a different 4-node topology from the
standard 9-node setup.

Test count is preserved: 92 test cases across 19 original test
functions (SubnetBasics, ExitNodes, HARouters, FilterPlacement,
RouteCoverage, Overlapping, TagResolution, ProtocolPort, IPv6,
EdgeCases, AutoApprover, and additional variants).

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
7e71d1b58f hscontrol/policy/v2: convert ACL compat tests to JSON-driven format
Replace 9,937 lines of inline Go struct test expectations in
tailscale_acl_compat_test.go with 215 JSON golden files in
testdata/acl_results/ACL-*.json and a ~400-line Go driver in
tailscale_acl_data_compat_test.go.

This matches the pattern used by the grants compat tests
(testdata/grant_results/GRANT-*.json + tailscale_grants_compat_test.go)
and the SSH compat tests (testdata/ssh_results/SSH-*.json +
tailscale_ssh_data_compat_test.go).

The JSON golden files contain the same test expectations as the
original Go file, preserving the Tailscale SaaS reference data.
The expectations are NOT adapted to match headscale current output —
they represent the target behavior.

Test count is preserved: 215 test cases (203 success + 12 error).

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
0562bd85f4 hscontrol/policy/v2: fix test helpers to match production pipeline
- TestTagUserMutualExclusivity and TestUserToTagCrossIdentityGrant:
  add ReduceFilterRules after compileFilterRulesForNode to match the
  production filter pipeline in filterForNodeLocked. The compilation
  step produces global rules for all ACLs; ReduceFilterRules strips
  them down to only rules where the node is a destination.

- containsSrcIP/containsIP helpers: use util.ParseIPSet to handle
  IP range strings like "100.64.0.1-100.64.0.3" produced by
  ipSetToStrings when contiguous IPs are coalesced.

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
5830eabf09 hscontrol/policy: fix test assertions and expectations
Fix several test issues exposed by the ResolvedAddresses refactor:

- TestTagUserMutualExclusivity: remove incorrect ACL rule that was
  testing the wrong invariant. The test now correctly validates that
  without an explicit cross-identity grant, user-owned nodes cannot
  reach tagged nodes. Add TestUserToTagCrossIdentityGrant to verify
  that explicit user@ -> tag:X ACL rules produce valid filter rules.

- TestResolvePolicy/wildcard-alias: update expected prefixes to match
  the CGNAT range minus ChromeOS VM range (multiple prefixes instead
  of the encompassing 100.64.0.0/10).

- TestApproveRoutesWithPolicy: fix user Name fields from "testuser@"
  to "testuser" to match how resolveUser trims the @ suffix before
  comparing against stored names.

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
5f3bddc663 hscontrol/policy/v2: fix nil dereferences in alias resolution
Fix three nil dereference issues in the policy resolution code:

- newResolvedAddresses: preserve partial IP results when errors occur
  instead of discarding valid IPSets. Callers already handle errors
  and nil results independently, so returning both allows partial
  resolution (e.g. groups with phantom users) to work correctly.

- resolveTagOwners: guard against nil ResolvedAddresses before calling
  Prefixes(), since Resolve may return nil when resolution fails.

- Asterix.resolve: guard against nil *Policy pointer, which occurs
  when resolving wildcards without a policy context (e.g. in tests).

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
0c6ac28b04 hscontrol/policy/v2: recategorize grants skip list from SRCIPS_FORMAT into granular root causes
Replace the monolithic SRCIPS_FORMAT skip category (125 tests) with 7
specific subcategories based on analysis of actual test failures:

  MISSING_IPV6_ADDRS          - 90 tests: identity aliases resolve to IPv4 only
  SUBNET_ROUTE_FILTER_RULES   - 10 tests: no rules for subnet-routed CIDRs
  AUTOGROUP_SELF_CIDR_FORMAT  -  4 tests: /32 and /128 suffix on DstPorts IPs
  USER_PASSKEY_WILDCARD       -  2 tests: user:*@passkey unresolvable
  RAW_IPV6_ADDR_EXPANSION     -  2 tests: raw IPv6 expanded to include IPv4
  SRCIPS_WILDCARD_NODE_DEDUP  -  1 test:  wildcard+specific node IP dedup

Also reclassify tests that moved between categories after the CGNAT
split range fix (4 tests now passing, others recategorized into
CAPGRANT_COMPILATION, ERROR_VALIDATION_GAP, VIA_COMPILATION, etc).

Total: 207 skipped, 30 passing (was 193 skipped, 19 passing).
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
6f32dcf6f9 maybe only return ipv4? not always?
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
f01052c85f speculative new datastruct, fix ip range return
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
646a6e8266 hscontrol/policy/v2: add skip entries for 25 v2 gap-filling grant tests
Update the grants compatibility test skip list with 23 new entries for
the V-series tests (V07 and V24 pass without skipping).

New skip categories introduced:
- VIA_COMPILATION (3): via routes with specific src identities where
  there is no SrcIPs format issue (V11, V12, V13)
- Additional VIA_COMPILATION_AND_SRCIPS_FORMAT (3): via with wildcard
  src (V17, V21, V23)
- Additional CAPGRANT_COMPILATION (6): app grants on specific tags,
  drive cap, autogroup:self app (V02, V03, V06, V19, V20, V25)
- Additional CAPGRANT_COMPILATION_AND_SRCIPS_FORMAT (2): mixed ip+app
  on specific tags rejected by headscale (V09, V10)
- Additional ERROR_VALIDATION_GAP (9): autogroup:internet + app,
  raw 0.0.0.0/0 and ::/0 as grant dst (V01, V04, V05, V08, V14-V16,
  V18, V22)

Test totals: 237 total, 21 pass, 216 skip, 0 fail.

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
aa68fbafc0 hscontrol/policy/v2: add 25 v2 gap-filling grant testdata files
Add GRANT-V01 through GRANT-V25 JSON files captured from Tailscale SaaS
to fill coverage gaps in the grants compatibility test suite.

These tests cover:
- App grants on specific tags (not just wildcards)
- Mixed ip+app grants on specific tags
- Via routes with specific src identities (tags, groups, members)
- Via with multiple dst subnets and multiple via tags
- Drive cap with reverse drive-sharer generation
- autogroup:self with app grants
- autogroup:internet rejection with app grants
- Raw default route CIDR (0.0.0.0/0, ::/0) rejection as grant dst

Updates #2180
2026-03-25 15:17:23 +00:00
Kristoffer Dalby
2446158191 hscontrol/policy/v2: add data-driven grants compatibility test
Add TestGrantsCompat, a data-driven test that validates headscale's
grants implementation against 212 test cases captured from Tailscale
SaaS. Each test case loads a GRANT-*.json file from testdata/, applies
the policy through headscale's engine, and compares the resulting
packet filter rules against Tailscale's actual output.

Currently 19 tests pass and 193 are skipped with documented reasons:
- SRCIPS_FORMAT (125): IP range formatting differences
- CAPGRANT_COMPILATION (41): app capability grants not yet compiled
- ERROR_VALIDATION_GAP (14): validation strictness differences
- CAPGRANT_AND_SRCIPS_FORMAT (9): combined ip+app grant issues
- VIA_AND_SRCIPS_FORMAT (4): via route compilation not implemented
- AUTOGROUP_DANGER_ALL (3): autogroup:danger-all not supported
- VALIDATION_STRICTNESS (2): empty src/dst array handling

Updates #2180
2026-03-25 15:17:22 +00:00
Kristoffer Dalby
f1756f4d12 hscontrol/policy/v2: add grants compatibility testdata (212 JSON files)
Add 212 GRANT-*.json test files captured from Tailscale SaaS to
testdata/grant_results/. Each file contains a policy with grants,
the expected packet_filter_rules for 8 test nodes, and the topology
used during capture.

These files serve as the ground truth for the data-driven grants
compatibility test.

Updates #2180
2026-03-25 15:17:22 +00:00
Kristoffer Dalby
ca2081a44f hscontrol/policy/v2: rename tailscale_compat_test.go to tailscale_acl_compat_test.go
Rename the ACL compatibility test file to include 'acl' in the name,
making room for the upcoming grants compatibility test file.

Also fix a godoclint issue by adding a blank line between the file
header comment and the package declaration.

Updates #2180
2026-03-25 15:17:22 +00:00
Kristoffer Dalby
90c9555876 hscontrol/policy/v2: add ProtocolPort.MarshalJSON for Grant serialization
Implement ProtocolPort.MarshalJSON to produce string format matching
UnmarshalJSON expectations (e.g. "tcp:443", "udp:10000-20000", "*").

Add comprehensive TestGrantMarshalJSON with 10 test cases:
- IP-based grants with TCP, UDP, ICMP, and wildcard protocols
- Single ports, port ranges, and wildcard ports
- Capability-based grants using app field
- Grants with both ip and app fields
- Grants with via field for route filtering
- Testing omitempty behavior for ip, app, and via fields
- JSON round-trip validation (marshal → unmarshal → compare)

Add omitempty tag to Grant.InternetProtocols to avoid marshaling
null when field is empty.

Updates #2180
2026-03-25 15:17:22 +00:00
Kristoffer Dalby
1c31f04fab hscontrol/policy/v2: add TestACLToGrants
Add test for aclToGrants() function that converts ACL rules to Grant
format. Tests conversion of:
- Single-port TCP rules
- Multiple ACL entries to multiple Grants
- Port ranges and multiple ports in a single rule
- Wildcard protocols
- UDP, ICMP, and other protocol types

Ensures backward compatibility by verifying that ACL rules are correctly
transformed to the new Grant format.

Updates #2180
2026-03-25 15:17:22 +00:00
Kristoffer Dalby
31c0ecbd68 hscontrol/policy/v2: add TestUnmarshalGrants
Add comprehensive tests for Grant unmarshaling covering:
- Valid grants with ip field (network access)
- Valid grants with app field (capabilities)
- Wildcard port handling
- Port range parsing
- Error cases (missing fields, conflicting fields)

Updates #2180
2026-03-25 15:17:22 +00:00
Kristoffer Dalby
3ffdb4280a hscontrol/policy/v2: add Grant policy format support
Add support for the Grant policy format as an alternative to ACL format,
following Tailscale's policy v2 specification. Grants provide a more
structured way to define network access rules with explicit separation
of IP-based and capability-based permissions.

Key changes:

- Add Grant struct with Sources, Destinations, InternetProtocols (ip),
  and App (capabilities) fields
- Add ProtocolPort type for unmarshaling protocol:port strings
- Add Grant validation in Policy.validate() to enforce:
  - Mutual exclusivity of ip and app fields
  - Required ip or app field presence
  - Non-empty sources and destinations
- Refactor compileFilterRules to support both ACLs and Grants
- Convert ACLs to Grants internally via aclToGrants() for unified
  processing
- Extract destinationsToNetPortRange() helper for cleaner code
- Rename parseProtocol() to toIANAProtocolNumbers() for clarity
- Add ProtocolNumberToName mapping for reverse lookups

The Grant format allows policies to be written using either the legacy
ACL format or the new Grant format. ACLs are converted to Grants
internally, ensuring backward compatibility while enabling the new
format's benefits.

Updates #2180
2026-03-25 15:17:22 +00:00
Florian Preinstorfer
efd83da14e Explicitly mention that a headscale username should *not* end with @
See: #3149
2026-03-20 19:44:33 +01:00
Tanayk07
568baf3d02 fix: align banner right-side border to consistent 64-char width 2026-03-19 07:08:35 +01:00
Tanayk07
5105033224 feat: add prominent warning banner for non-standard IP prefixes
Add a highly visible ASCII-art warning banner that is printed at
startup when the configured IP prefixes fall outside the standard
Tailscale CGNAT (100.64.0.0/10) or ULA (fd7a:115c:a1e0::/48) ranges.

The warning fires once even if both v4 and v6 are non-standard, and
the warnBanner() function is reusable for other critical configuration
warnings in the future.

Also updates config-example.yaml to clarify that subsets of the
default ranges are fine, but ranges outside CGNAT/ULA are not.

Closes #3055
2026-03-19 07:08:35 +01:00
Kristoffer Dalby
3d53f97c82 hscontrol/servertest: fix test expectations for eventual consistency
Three corrections to issue tests that had wrong assumptions about
when data becomes available:

1. initial_map_should_include_peer_online_status: use WaitForCondition
   instead of checking the initial netmap. Online status is set by
   Connect() which sends a PeerChange patch after the initial
   RegisterResponse, so it may not be present immediately.

2. disco_key_should_propagate_to_peers: use WaitForCondition. The
   DiscoKey is sent in the first MapRequest (not RegisterRequest),
   so peers may not see it until a subsequent map update.

3. approved_route_without_announcement: invert the test expectation.
   Tailscale uses a strict advertise-then-approve model -- routes are
   only distributed when the node advertises them (Hostinfo.RoutableIPs)
   AND they are approved. An approval without advertisement is a dormant
   pre-approval. The test now asserts the route does NOT appear in
   AllowedIPs, matching upstream Tailscale semantics.

Also fix TestClient.Reconnect to clear the cached netmap and drain
pending updates before re-registering. Without this, WaitForPeers
returned immediately based on the old session's stale data.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
1053fbb16b hscontrol/state: fix online status reset during re-registration
Two fixes to how online status is handled during registration:

1. Re-registration (applyAuthNodeUpdate, HandleNodeFromPreAuthKey) no
   longer resets IsOnline to false. Online status is managed exclusively
   by Connect()/Disconnect() in the poll session lifecycle. The reset
   caused a false offline blip: the auth handler's change notification
   triggered a map regeneration showing the node as offline to peers,
   even though Connect() would set it back to true moments later.

2. New node creation (createAndSaveNewNode) now explicitly sets
   IsOnline=false instead of leaving it nil. This ensures peers always
   receive a known online status rather than an ambiguous nil/unknown.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
b09af3846b hscontrol/poll,state: fix grace period disconnect TOCTOU race
When a node disconnects, serveLongPoll defers a cleanup that starts a
grace period goroutine. This goroutine polls batcher.IsConnected() and,
if the node has not reconnected within ~10 seconds, calls
state.Disconnect() to mark it offline. A TOCTOU race exists: the node
can reconnect (calling Connect()) between the IsConnected check and
the Disconnect() call, causing the stale Disconnect() to overwrite
the new session's online status.

Fix with a monotonic per-node generation counter:

- State.Connect() increments the counter and returns the current
  generation alongside the change list.
- State.Disconnect() accepts the generation from the caller and
  rejects the call if a newer generation exists, making stale
  disconnects from old sessions a no-op.
- serveLongPoll captures the generation at Connect() time and passes
  it to Disconnect() in the deferred cleanup.
- RemoveNode's return value is now checked: if another session already
  owns the batcher slot (reconnect happened), the old session skips
  the grace period entirely.

Update batcher_test.go to track per-node connect generations and
pass them through to Disconnect(), matching production behavior.

Fixes the following test failures:
- server_state_online_after_reconnect_within_grace
- update_history_no_false_offline
- nodestore_correct_after_rapid_reconnect
- rapid_reconnect_peer_never_sees_offline
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
00c41b6422 hscontrol/servertest: add race, stress, and poll race tests
Add three test files designed to stress the control plane under
concurrent and adversarial conditions:

- race_test.go: 14 tests exercising concurrent mutations, session
  replacement, batcher contention, NodeStore access, and map response
  delivery during disconnect. All pass the Go race detector.

- poll_race_test.go: 8 tests targeting the poll.go grace period
  interleaving. These confirm a logical TOCTOU race: when a node
  disconnects and reconnects within the grace period, the old
  session's deferred Disconnect() can overwrite the new session's
  Connect(), leaving IsOnline=false despite an active poll session.

- stress_test.go: sustained churn, rapid mutations, rolling
  replacement, data integrity checks under load, and verification
  that rapid reconnects do not leak false-offline notifications.

Known failing tests (grace period TOCTOU race):
- server_state_online_after_reconnect_within_grace
- update_history_no_false_offline
- rapid_reconnect_peer_never_sees_offline
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
ab4e205ce7 hscontrol/servertest: expand issue tests to 24 scenarios, surface 4 issues
Split TestIssues into 7 focused test functions to stay under cyclomatic
complexity limits while testing more aggressively.

Issues surfaced (4 failing tests):

1. initial_map_should_include_peer_online_status: Initial MapResponse
   has Online=nil for peers. Online status only arrives later via
   PeersChangedPatch.

2. disco_key_should_propagate_to_peers: DiscoPublicKey set by client
   is not visible to peers. Peers see zero disco key.

3. approved_route_without_announcement_is_visible: Server-side route
   approval without client-side announcement silently produces empty
   SubnetRoutes (intersection of empty announced + approved = empty).

4. nodestore_correct_after_rapid_reconnect: After 5 rapid reconnect
   cycles, NodeStore reports node as offline despite having an active
   poll session. The connect/disconnect grace period interleaving
   leaves IsOnline in an incorrect state.

Passing tests (20) verify:
- IP uniqueness across 10 nodes
- IP stability across reconnect
- New peers have addresses immediately
- Node rename propagates to peers
- Node delete removes from all peer lists
- Hostinfo changes (OS field) propagate
- NodeStore/DB consistency after route mutations
- Grace period timing (8-20s window)
- Ephemeral node deletion (not just offline)
- 10-node simultaneous connect convergence
- Rapid sequential node additions
- Reconnect produces complete map
- Cross-user visibility with default policy
- Same-user multiple nodes get distinct IDs
- Same-hostname nodes get unique GivenNames
- Policy change during connect still converges
- DERP region references are valid
- User profiles present for self and peers
- Self-update arrives after route approval
- Route advertisement stored as AnnouncedRoutes
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
f87b08676d hscontrol/servertest: add policy, route, ephemeral, and content tests
Extend the servertest harness with:
- TestClient.Direct() accessor for advanced operations
- TestClient.WaitForPeerCount and WaitForCondition helpers
- TestHarness.ChangePolicy for ACL policy testing
- AssertDERPMapPresent and AssertSelfHasAddresses

New test suites:
- content_test.go: self node, DERP map, peer properties, user profiles,
  update history monotonicity, and endpoint update propagation
- policy_test.go: default allow-all, explicit policy, policy triggers
  updates on all nodes, multiple policy changes, multi-user mesh
- ephemeral_test.go: ephemeral connect, cleanup after disconnect,
  mixed ephemeral/regular, reconnect prevents cleanup
- routes_test.go: addresses in AllowedIPs, route advertise and approve,
  advertised routes via hostinfo, CGNAT range validation

Also fix node_departs test to use WaitForCondition instead of
assert.Eventually, and convert concurrent_join_and_leave to
interleaved_join_and_leave with grace-period-tolerant assertions.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
ca7362e9aa hscontrol/servertest: add control plane lifecycle and consistency tests
Add three test files exercising the servertest harness:

- lifecycle_test.go: connection, disconnection, reconnection, session
  replacement, and mesh formation at various sizes.
- consistency_test.go: symmetric visibility, consistent peer state,
  address presence, concurrent join/leave convergence.
- weather_test.go: rapid reconnects, flapping stability, reconnect
  with various delays, concurrent reconnects, and scale tests.

All tests use table-driven patterns with subtests.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
0288614bdf hscontrol: add servertest harness for in-process control plane testing
Add a new hscontrol/servertest package that provides a test harness
for exercising the full Headscale control protocol in-process, using
Tailscale's controlclient.Direct as the client.

The harness consists of:
- TestServer: wraps a Headscale instance with an httptest.Server
- TestClient: wraps controlclient.Direct with NetworkMap tracking
- TestHarness: orchestrates N clients against a single server
- Assertion helpers for mesh completeness, visibility, and consistency

Export minimal accessor methods on Headscale (HTTPHandler, NoisePublicKey,
GetState, SetServerURL, StartBatcher, StartEphemeralGC) so the servertest
package can construct a working server from outside the hscontrol package.

This enables fast, deterministic tests of connection lifecycle, update
propagation, and network weather scenarios without Docker.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
82c7efccf8 mapper/batcher: serialize per-node work to prevent out-of-order delivery
processBatchedChanges queued each pending change for a node as a
separate work item. Since multiple workers pull from the same channel,
two changes for the same node could be processed concurrently by
different workers. This caused two problems:

1. MapResponses delivered out of order — a later change could finish
   generating before an earlier one, so the client sees stale state.
2. updateSentPeers and computePeerDiff race against each other —
   updateSentPeers does Clear() + Store() which is not atomic relative
   to a concurrent Range() in computePeerDiff.

Bundle all pending changes for a node into a single work item so one
worker processes them sequentially. Add a per-node workMu that
serializes processing across consecutive batch ticks, preventing a
second worker from starting tick N+1 while tick N is still in progress.

Fixes #3140
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
81b871c9b5 integration/acl: replace custom entrypoints with WithPackages
Replace inline WithDockerEntrypoint shell scripts in
TestACLTagPropagation and TestACLTagPropagationPortSpecific with
the standard WithPackages and WithWebserver options.

The custom entrypoints used fragile fixed sleeps and lacked the
robust network/cert readiness waits that buildEntrypoint provides.

Updates #3139
2026-03-16 03:57:05 -07:00
Kristoffer Dalby
e5ebe3205a integration: standardize test infrastructure options
Make embedded DERP server and TLS the default configuration for all
integration tests, replacing the per-test opt-in model that led to
inconsistent and flaky test behavior.

Infrastructure changes:
- DefaultConfigEnv() includes embedded DERP server settings
- New() auto-generates a proper CA + server TLS certificate pair
- CA cert is installed into container trust stores and returned by
  GetCert() so clients and internal tools (curl) trust the server
- CreateCertificate() now returns (caCert, cert, key) instead of
  discarding the CA certificate
- Add WithPublicDERP() and WithoutTLS() opt-out options
- Remove WithTLS(), WithEmbeddedDERPServerOnly(), and WithDERPAsIP()
  since all their behavior is now the default or unnecessary

Test cleanup:
- Remove all redundant WithTLS/WithEmbeddedDERPServerOnly/WithDERPAsIP
  calls from test files
- Give every test a unique WithTestName by parameterizing aclScenario,
  sshScenario, and derpServerScenario helpers
- Add WithTestName to tests that were missing it
- Document all non-standard options with inline comments explaining
  why each is needed

Updates #3139
2026-03-16 03:57:05 -07:00
Kristoffer Dalby
87b8507ac9 mapper/batcher: replace connected map with per-node disconnectedAt
The Batcher's connected field (*xsync.Map[types.NodeID, *time.Time])
encoded three states via pointer semantics:

  - nil value:    node is connected
  - non-nil time: node disconnected at that timestamp
  - key missing:  node was never seen

This was error-prone (nil meaning 'connected' inverts Go idioms),
redundant with b.nodes + hasActiveConnections(), and required keeping
two parallel maps in sync. It also contained a bug in RemoveNode where
new(time.Now()) was used instead of &now, producing a zero time.

Replace the separate connected map with a disconnectedAt field on
multiChannelNodeConn (atomic.Pointer[time.Time]), tracked directly
on the object that already manages the node's connections.

Changes:
  - Add disconnectedAt field and helpers (markConnected, markDisconnected,
    isConnected, offlineDuration) to multiChannelNodeConn
  - Remove the connected field from Batcher
  - Simplify IsConnected from two map lookups to one
  - Simplify ConnectedMap and Debug from two-map iteration to one
  - Rewrite cleanupOfflineNodes to scan b.nodes directly
  - Remove the markDisconnectedIfNoConns helper
  - Update all tests and benchmarks

Fixes #3141
2026-03-16 02:22:56 -07:00
Kristoffer Dalby
60317064fd mapper/batcher: serialize per-node work to prevent out-of-order delivery
processBatchedChanges queued each pending change for a node as a
separate work item. Since multiple workers pull from the same channel,
two changes for the same node could be processed concurrently by
different workers. This caused two problems:

1. MapResponses delivered out of order — a later change could finish
   generating before an earlier one, so the client sees stale state.
2. updateSentPeers and computePeerDiff race against each other —
   updateSentPeers does Clear() + Store() which is not atomic relative
   to a concurrent Range() in computePeerDiff.

Bundle all pending changes for a node into a single work item so one
worker processes them sequentially. Add a per-node workMu that
serializes processing across consecutive batch ticks, preventing a
second worker from starting tick N+1 while tick N is still in progress.

Fixes #3140
2026-03-16 02:22:46 -07:00
Juan Font
4d427cfe2a noise: limit request body size to prevent unauthenticated OOM
The Noise handshake accepts any machine key without checking
registration, so all endpoints behind the Noise router are reachable
without credentials. Three handlers used io.ReadAll without size
limits, allowing an attacker to OOM-kill the server.

Fix:
- Add http.MaxBytesReader middleware (1 MiB) on the Noise router.
- Replace io.ReadAll + json.Unmarshal with json.NewDecoder in
  PollNetMapHandler and RegistrationHandler.
- Stop reading the body in NotImplementedHandler entirely.
2026-03-16 09:28:31 +01:00
Kristoffer Dalby
afd3a6acbc mapper/batcher: remove disabled X-prefixed test functions
Remove XTestBatcherChannelClosingRace (~95 lines) and
XTestBatcherScalability (~515 lines). These were disabled by
prefixing with X (making them invisible to go test) and served
as dead code. The functionality they covered is exercised by the
active test suite.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
feaf85bfbc mapper/batcher: clean up test constants and output
L8: Rename SCREAMING_SNAKE_CASE test constants to idiomatic Go
camelCase. Remove highLoad* and extremeLoad* constants that were
only referenced by disabled (X-prefixed) tests.

L10: Fix misleading assert message that said "1337" while checking
for region ID 999.

L12: Remove emoji from test log output to avoid encoding issues
in CI environments.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
86e279869e mapper/batcher: minor production code cleanup
L1: Replace crypto/rand with an atomic counter for generating
connection IDs. These identifiers are process-local and do not need
cryptographic randomness; a monotonic counter is cheaper and
produces shorter, sortable IDs.

L5: Use getActiveConnectionCount() in Debug() instead of directly
locking the mutex and reading the connections slice. This avoids
bypassing the accessor that already exists for this purpose.

L6: Extract the hardcoded 15*time.Minute cleanup threshold into
the named constant offlineNodeCleanupThreshold.

L7: Inline the trivial addWork wrapper; AddWork now calls addToBatch
directly.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
7881f65358 mapper: extract node connection types to node_conn.go
Move connectionEntry, multiChannelNodeConn, generateConnectionID, and
all their methods from batcher.go into a dedicated file. This reduces
batcher.go from ~1170 lines to ~800 and separates per-node connection
management from batcher orchestration.

Pure move — no logic changes.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
2d549e579f mapper/batcher: add regression tests for M1, M3, M7 fixes
- TestBatcher_CloseBeforeStart_DoesNotHang: verifies Close() before
  Start() returns promptly now that done is initialized in NewBatcher.

- TestBatcher_QueueWorkAfterClose_DoesNotHang: verifies queueWork
  returns via the done channel after Close(), even without Start().

- TestIsConnected_FalseAfterAddNodeFailure: verifies IsConnected
  returns false after AddNode fails and removes the last connection.

- TestRemoveConnectionAtIndex_NilsTrailingSlot: verifies the backing
  array slot is nil-ed after removal to avoid retaining pointers.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
50e8b21471 mapper/batcher: fix pointer retention, done-channel init, and connected-map races
M7: Nil out trailing *connectionEntry pointers in the backing array
after slice removal in removeConnectionAtIndexLocked and send().
Without this, the GC cannot collect removed entries until the slice
is reallocated.

M1: Initialize the done channel in NewBatcher instead of Start().
Previously, calling Close() or queueWork before Start() would select
on a nil channel, blocking forever. Moving the make() to the
constructor ensures the channel is always usable.

M2: Move b.connected.Delete and b.totalNodes decrement inside the
Compute callback in cleanupOfflineNodes. Previously these ran after
the Compute returned, allowing a concurrent AddNode to reconnect
between the delete and the bookkeeping update, which would wipe the
fresh connected state.

M3: Call markDisconnectedIfNoConns on AddNode error paths. Previously,
when initial map generation or send timed out, the connection was
removed but b.connected retained its old nil (= connected) value,
making IsConnected return true for a node with zero connections.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
8e26651f2c mapper/batcher: add regression tests for timer leak and Close lifecycle
Add four unit tests guarding fixes introduced in recent commits:

- TestConnectionEntry_SendFastPath_TimerStopped: verifies the
  time.NewTimer fix (H1) does not leak goroutines after many
  fast-path sends on a buffered channel.

- TestBatcher_CloseWaitsForWorkers: verifies Close() blocks until all
  worker goroutines exit (H3), preventing sends on torn-down channels.

- TestBatcher_CloseThenStartIsNoop: verifies the one-shot lifecycle
  contract; Start() after Close() must not spawn new goroutines.

- TestBatcher_CloseStopsTicker: verifies Close() stops the internal
  ticker to prevent resource leaks.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
57a38b5678 mapper/batcher: reduce hot-path log verbosity
Remove Caller(), channel pointer formatting (fmt.Sprintf("%p",...)),
and mutex timing from send(), addConnection(), and
removeConnectionByChannel(). Move per-broadcast summary and
no-connection logs from Debug to Trace. Remove per-connection
"attempting"/"succeeded" logs entirely; keep Warn for failures.

These methods run on every MapResponse delivery, so the savings
compound quickly under load.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
051a38a4c4 mapper/batcher: track worker goroutines and stop ticker on Close
Close() previously closed the done channel and returned immediately,
without waiting for worker goroutines to exit. This caused goroutine
leaks in tests and allowed workers to race with connection teardown.
The ticker was also never stopped, leaking its internal goroutine.

Add a sync.WaitGroup to track the doWork goroutine and every worker
it spawns. Close() now calls wg.Wait() after signalling shutdown,
ensuring all goroutines have exited before tearing down connections.
Also stop the ticker to prevent resource leaks.

Document that a Batcher must not be reused after Close().
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
3276bda0c0 mapper/batcher: replace time.After with NewTimer to avoid timer leak
connectionEntry.send() is on the hot path: called once per connection
per broadcast tick. time.After allocates a timer that sits in the
runtime timer heap until it fires (50 ms), even when the channel send
succeeds immediately. At 1000 connected nodes, every tick leaks 1000
timers into the heap, creating continuous GC pressure.

Replace with time.NewTimer + defer timer.Stop() so the timer is
removed from the heap as soon as the fast-path send completes.
2026-03-14 02:52:28 -07:00