Commit Graph

529 Commits

Author SHA1 Message Date
Tanayk07
568baf3d02 fix: align banner right-side border to consistent 64-char width 2026-03-19 07:08:35 +01:00
Tanayk07
5105033224 feat: add prominent warning banner for non-standard IP prefixes
Add a highly visible ASCII-art warning banner that is printed at
startup when the configured IP prefixes fall outside the standard
Tailscale CGNAT (100.64.0.0/10) or ULA (fd7a:115c:a1e0::/48) ranges.

The warning fires once even if both v4 and v6 are non-standard, and
the warnBanner() function is reusable for other critical configuration
warnings in the future.

Also updates config-example.yaml to clarify that subsets of the
default ranges are fine, but ranges outside CGNAT/ULA are not.

Closes #3055
2026-03-19 07:08:35 +01:00
Kristoffer Dalby
3d53f97c82 hscontrol/servertest: fix test expectations for eventual consistency
Three corrections to issue tests that had wrong assumptions about
when data becomes available:

1. initial_map_should_include_peer_online_status: use WaitForCondition
   instead of checking the initial netmap. Online status is set by
   Connect() which sends a PeerChange patch after the initial
   RegisterResponse, so it may not be present immediately.

2. disco_key_should_propagate_to_peers: use WaitForCondition. The
   DiscoKey is sent in the first MapRequest (not RegisterRequest),
   so peers may not see it until a subsequent map update.

3. approved_route_without_announcement: invert the test expectation.
   Tailscale uses a strict advertise-then-approve model -- routes are
   only distributed when the node advertises them (Hostinfo.RoutableIPs)
   AND they are approved. An approval without advertisement is a dormant
   pre-approval. The test now asserts the route does NOT appear in
   AllowedIPs, matching upstream Tailscale semantics.

Also fix TestClient.Reconnect to clear the cached netmap and drain
pending updates before re-registering. Without this, WaitForPeers
returned immediately based on the old session's stale data.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
1053fbb16b hscontrol/state: fix online status reset during re-registration
Two fixes to how online status is handled during registration:

1. Re-registration (applyAuthNodeUpdate, HandleNodeFromPreAuthKey) no
   longer resets IsOnline to false. Online status is managed exclusively
   by Connect()/Disconnect() in the poll session lifecycle. The reset
   caused a false offline blip: the auth handler's change notification
   triggered a map regeneration showing the node as offline to peers,
   even though Connect() would set it back to true moments later.

2. New node creation (createAndSaveNewNode) now explicitly sets
   IsOnline=false instead of leaving it nil. This ensures peers always
   receive a known online status rather than an ambiguous nil/unknown.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
b09af3846b hscontrol/poll,state: fix grace period disconnect TOCTOU race
When a node disconnects, serveLongPoll defers a cleanup that starts a
grace period goroutine. This goroutine polls batcher.IsConnected() and,
if the node has not reconnected within ~10 seconds, calls
state.Disconnect() to mark it offline. A TOCTOU race exists: the node
can reconnect (calling Connect()) between the IsConnected check and
the Disconnect() call, causing the stale Disconnect() to overwrite
the new session's online status.

Fix with a monotonic per-node generation counter:

- State.Connect() increments the counter and returns the current
  generation alongside the change list.
- State.Disconnect() accepts the generation from the caller and
  rejects the call if a newer generation exists, making stale
  disconnects from old sessions a no-op.
- serveLongPoll captures the generation at Connect() time and passes
  it to Disconnect() in the deferred cleanup.
- RemoveNode's return value is now checked: if another session already
  owns the batcher slot (reconnect happened), the old session skips
  the grace period entirely.

Update batcher_test.go to track per-node connect generations and
pass them through to Disconnect(), matching production behavior.

Fixes the following test failures:
- server_state_online_after_reconnect_within_grace
- update_history_no_false_offline
- nodestore_correct_after_rapid_reconnect
- rapid_reconnect_peer_never_sees_offline
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
00c41b6422 hscontrol/servertest: add race, stress, and poll race tests
Add three test files designed to stress the control plane under
concurrent and adversarial conditions:

- race_test.go: 14 tests exercising concurrent mutations, session
  replacement, batcher contention, NodeStore access, and map response
  delivery during disconnect. All pass the Go race detector.

- poll_race_test.go: 8 tests targeting the poll.go grace period
  interleaving. These confirm a logical TOCTOU race: when a node
  disconnects and reconnects within the grace period, the old
  session's deferred Disconnect() can overwrite the new session's
  Connect(), leaving IsOnline=false despite an active poll session.

- stress_test.go: sustained churn, rapid mutations, rolling
  replacement, data integrity checks under load, and verification
  that rapid reconnects do not leak false-offline notifications.

Known failing tests (grace period TOCTOU race):
- server_state_online_after_reconnect_within_grace
- update_history_no_false_offline
- rapid_reconnect_peer_never_sees_offline
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
ab4e205ce7 hscontrol/servertest: expand issue tests to 24 scenarios, surface 4 issues
Split TestIssues into 7 focused test functions to stay under cyclomatic
complexity limits while testing more aggressively.

Issues surfaced (4 failing tests):

1. initial_map_should_include_peer_online_status: Initial MapResponse
   has Online=nil for peers. Online status only arrives later via
   PeersChangedPatch.

2. disco_key_should_propagate_to_peers: DiscoPublicKey set by client
   is not visible to peers. Peers see zero disco key.

3. approved_route_without_announcement_is_visible: Server-side route
   approval without client-side announcement silently produces empty
   SubnetRoutes (intersection of empty announced + approved = empty).

4. nodestore_correct_after_rapid_reconnect: After 5 rapid reconnect
   cycles, NodeStore reports node as offline despite having an active
   poll session. The connect/disconnect grace period interleaving
   leaves IsOnline in an incorrect state.

Passing tests (20) verify:
- IP uniqueness across 10 nodes
- IP stability across reconnect
- New peers have addresses immediately
- Node rename propagates to peers
- Node delete removes from all peer lists
- Hostinfo changes (OS field) propagate
- NodeStore/DB consistency after route mutations
- Grace period timing (8-20s window)
- Ephemeral node deletion (not just offline)
- 10-node simultaneous connect convergence
- Rapid sequential node additions
- Reconnect produces complete map
- Cross-user visibility with default policy
- Same-user multiple nodes get distinct IDs
- Same-hostname nodes get unique GivenNames
- Policy change during connect still converges
- DERP region references are valid
- User profiles present for self and peers
- Self-update arrives after route approval
- Route advertisement stored as AnnouncedRoutes
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
f87b08676d hscontrol/servertest: add policy, route, ephemeral, and content tests
Extend the servertest harness with:
- TestClient.Direct() accessor for advanced operations
- TestClient.WaitForPeerCount and WaitForCondition helpers
- TestHarness.ChangePolicy for ACL policy testing
- AssertDERPMapPresent and AssertSelfHasAddresses

New test suites:
- content_test.go: self node, DERP map, peer properties, user profiles,
  update history monotonicity, and endpoint update propagation
- policy_test.go: default allow-all, explicit policy, policy triggers
  updates on all nodes, multiple policy changes, multi-user mesh
- ephemeral_test.go: ephemeral connect, cleanup after disconnect,
  mixed ephemeral/regular, reconnect prevents cleanup
- routes_test.go: addresses in AllowedIPs, route advertise and approve,
  advertised routes via hostinfo, CGNAT range validation

Also fix node_departs test to use WaitForCondition instead of
assert.Eventually, and convert concurrent_join_and_leave to
interleaved_join_and_leave with grace-period-tolerant assertions.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
ca7362e9aa hscontrol/servertest: add control plane lifecycle and consistency tests
Add three test files exercising the servertest harness:

- lifecycle_test.go: connection, disconnection, reconnection, session
  replacement, and mesh formation at various sizes.
- consistency_test.go: symmetric visibility, consistent peer state,
  address presence, concurrent join/leave convergence.
- weather_test.go: rapid reconnects, flapping stability, reconnect
  with various delays, concurrent reconnects, and scale tests.

All tests use table-driven patterns with subtests.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
0288614bdf hscontrol: add servertest harness for in-process control plane testing
Add a new hscontrol/servertest package that provides a test harness
for exercising the full Headscale control protocol in-process, using
Tailscale's controlclient.Direct as the client.

The harness consists of:
- TestServer: wraps a Headscale instance with an httptest.Server
- TestClient: wraps controlclient.Direct with NetworkMap tracking
- TestHarness: orchestrates N clients against a single server
- Assertion helpers for mesh completeness, visibility, and consistency

Export minimal accessor methods on Headscale (HTTPHandler, NoisePublicKey,
GetState, SetServerURL, StartBatcher, StartEphemeralGC) so the servertest
package can construct a working server from outside the hscontrol package.

This enables fast, deterministic tests of connection lifecycle, update
propagation, and network weather scenarios without Docker.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
82c7efccf8 mapper/batcher: serialize per-node work to prevent out-of-order delivery
processBatchedChanges queued each pending change for a node as a
separate work item. Since multiple workers pull from the same channel,
two changes for the same node could be processed concurrently by
different workers. This caused two problems:

1. MapResponses delivered out of order — a later change could finish
   generating before an earlier one, so the client sees stale state.
2. updateSentPeers and computePeerDiff race against each other —
   updateSentPeers does Clear() + Store() which is not atomic relative
   to a concurrent Range() in computePeerDiff.

Bundle all pending changes for a node into a single work item so one
worker processes them sequentially. Add a per-node workMu that
serializes processing across consecutive batch ticks, preventing a
second worker from starting tick N+1 while tick N is still in progress.

Fixes #3140
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
87b8507ac9 mapper/batcher: replace connected map with per-node disconnectedAt
The Batcher's connected field (*xsync.Map[types.NodeID, *time.Time])
encoded three states via pointer semantics:

  - nil value:    node is connected
  - non-nil time: node disconnected at that timestamp
  - key missing:  node was never seen

This was error-prone (nil meaning 'connected' inverts Go idioms),
redundant with b.nodes + hasActiveConnections(), and required keeping
two parallel maps in sync. It also contained a bug in RemoveNode where
new(time.Now()) was used instead of &now, producing a zero time.

Replace the separate connected map with a disconnectedAt field on
multiChannelNodeConn (atomic.Pointer[time.Time]), tracked directly
on the object that already manages the node's connections.

Changes:
  - Add disconnectedAt field and helpers (markConnected, markDisconnected,
    isConnected, offlineDuration) to multiChannelNodeConn
  - Remove the connected field from Batcher
  - Simplify IsConnected from two map lookups to one
  - Simplify ConnectedMap and Debug from two-map iteration to one
  - Rewrite cleanupOfflineNodes to scan b.nodes directly
  - Remove the markDisconnectedIfNoConns helper
  - Update all tests and benchmarks

Fixes #3141
2026-03-16 02:22:56 -07:00
Kristoffer Dalby
60317064fd mapper/batcher: serialize per-node work to prevent out-of-order delivery
processBatchedChanges queued each pending change for a node as a
separate work item. Since multiple workers pull from the same channel,
two changes for the same node could be processed concurrently by
different workers. This caused two problems:

1. MapResponses delivered out of order — a later change could finish
   generating before an earlier one, so the client sees stale state.
2. updateSentPeers and computePeerDiff race against each other —
   updateSentPeers does Clear() + Store() which is not atomic relative
   to a concurrent Range() in computePeerDiff.

Bundle all pending changes for a node into a single work item so one
worker processes them sequentially. Add a per-node workMu that
serializes processing across consecutive batch ticks, preventing a
second worker from starting tick N+1 while tick N is still in progress.

Fixes #3140
2026-03-16 02:22:46 -07:00
Juan Font
4d427cfe2a noise: limit request body size to prevent unauthenticated OOM
The Noise handshake accepts any machine key without checking
registration, so all endpoints behind the Noise router are reachable
without credentials. Three handlers used io.ReadAll without size
limits, allowing an attacker to OOM-kill the server.

Fix:
- Add http.MaxBytesReader middleware (1 MiB) on the Noise router.
- Replace io.ReadAll + json.Unmarshal with json.NewDecoder in
  PollNetMapHandler and RegistrationHandler.
- Stop reading the body in NotImplementedHandler entirely.
2026-03-16 09:28:31 +01:00
Kristoffer Dalby
afd3a6acbc mapper/batcher: remove disabled X-prefixed test functions
Remove XTestBatcherChannelClosingRace (~95 lines) and
XTestBatcherScalability (~515 lines). These were disabled by
prefixing with X (making them invisible to go test) and served
as dead code. The functionality they covered is exercised by the
active test suite.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
feaf85bfbc mapper/batcher: clean up test constants and output
L8: Rename SCREAMING_SNAKE_CASE test constants to idiomatic Go
camelCase. Remove highLoad* and extremeLoad* constants that were
only referenced by disabled (X-prefixed) tests.

L10: Fix misleading assert message that said "1337" while checking
for region ID 999.

L12: Remove emoji from test log output to avoid encoding issues
in CI environments.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
86e279869e mapper/batcher: minor production code cleanup
L1: Replace crypto/rand with an atomic counter for generating
connection IDs. These identifiers are process-local and do not need
cryptographic randomness; a monotonic counter is cheaper and
produces shorter, sortable IDs.

L5: Use getActiveConnectionCount() in Debug() instead of directly
locking the mutex and reading the connections slice. This avoids
bypassing the accessor that already exists for this purpose.

L6: Extract the hardcoded 15*time.Minute cleanup threshold into
the named constant offlineNodeCleanupThreshold.

L7: Inline the trivial addWork wrapper; AddWork now calls addToBatch
directly.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
7881f65358 mapper: extract node connection types to node_conn.go
Move connectionEntry, multiChannelNodeConn, generateConnectionID, and
all their methods from batcher.go into a dedicated file. This reduces
batcher.go from ~1170 lines to ~800 and separates per-node connection
management from batcher orchestration.

Pure move — no logic changes.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
2d549e579f mapper/batcher: add regression tests for M1, M3, M7 fixes
- TestBatcher_CloseBeforeStart_DoesNotHang: verifies Close() before
  Start() returns promptly now that done is initialized in NewBatcher.

- TestBatcher_QueueWorkAfterClose_DoesNotHang: verifies queueWork
  returns via the done channel after Close(), even without Start().

- TestIsConnected_FalseAfterAddNodeFailure: verifies IsConnected
  returns false after AddNode fails and removes the last connection.

- TestRemoveConnectionAtIndex_NilsTrailingSlot: verifies the backing
  array slot is nil-ed after removal to avoid retaining pointers.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
50e8b21471 mapper/batcher: fix pointer retention, done-channel init, and connected-map races
M7: Nil out trailing *connectionEntry pointers in the backing array
after slice removal in removeConnectionAtIndexLocked and send().
Without this, the GC cannot collect removed entries until the slice
is reallocated.

M1: Initialize the done channel in NewBatcher instead of Start().
Previously, calling Close() or queueWork before Start() would select
on a nil channel, blocking forever. Moving the make() to the
constructor ensures the channel is always usable.

M2: Move b.connected.Delete and b.totalNodes decrement inside the
Compute callback in cleanupOfflineNodes. Previously these ran after
the Compute returned, allowing a concurrent AddNode to reconnect
between the delete and the bookkeeping update, which would wipe the
fresh connected state.

M3: Call markDisconnectedIfNoConns on AddNode error paths. Previously,
when initial map generation or send timed out, the connection was
removed but b.connected retained its old nil (= connected) value,
making IsConnected return true for a node with zero connections.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
8e26651f2c mapper/batcher: add regression tests for timer leak and Close lifecycle
Add four unit tests guarding fixes introduced in recent commits:

- TestConnectionEntry_SendFastPath_TimerStopped: verifies the
  time.NewTimer fix (H1) does not leak goroutines after many
  fast-path sends on a buffered channel.

- TestBatcher_CloseWaitsForWorkers: verifies Close() blocks until all
  worker goroutines exit (H3), preventing sends on torn-down channels.

- TestBatcher_CloseThenStartIsNoop: verifies the one-shot lifecycle
  contract; Start() after Close() must not spawn new goroutines.

- TestBatcher_CloseStopsTicker: verifies Close() stops the internal
  ticker to prevent resource leaks.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
57a38b5678 mapper/batcher: reduce hot-path log verbosity
Remove Caller(), channel pointer formatting (fmt.Sprintf("%p",...)),
and mutex timing from send(), addConnection(), and
removeConnectionByChannel(). Move per-broadcast summary and
no-connection logs from Debug to Trace. Remove per-connection
"attempting"/"succeeded" logs entirely; keep Warn for failures.

These methods run on every MapResponse delivery, so the savings
compound quickly under load.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
051a38a4c4 mapper/batcher: track worker goroutines and stop ticker on Close
Close() previously closed the done channel and returned immediately,
without waiting for worker goroutines to exit. This caused goroutine
leaks in tests and allowed workers to race with connection teardown.
The ticker was also never stopped, leaking its internal goroutine.

Add a sync.WaitGroup to track the doWork goroutine and every worker
it spawns. Close() now calls wg.Wait() after signalling shutdown,
ensuring all goroutines have exited before tearing down connections.
Also stop the ticker to prevent resource leaks.

Document that a Batcher must not be reused after Close().
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
3276bda0c0 mapper/batcher: replace time.After with NewTimer to avoid timer leak
connectionEntry.send() is on the hot path: called once per connection
per broadcast tick. time.After allocates a timer that sits in the
runtime timer heap until it fires (50 ms), even when the channel send
succeeds immediately. At 1000 connected nodes, every tick leaks 1000
timers into the heap, creating continuous GC pressure.

Replace with time.NewTimer + defer timer.Stop() so the timer is
removed from the heap as soon as the fast-path send completes.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
2058343ad6 mapper: remove Batcher interface, rename to Batcher struct
Remove the Batcher interface since there is only one implementation.
Rename LockFreeBatcher to Batcher and merge batcher_lockfree.go into
batcher.go.

Drop type assertions in debug.go now that mapBatcher is a concrete
*mapper.Batcher pointer.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
9b24a39943 mapper/batcher: add scale benchmarks
Add benchmarks that systematically test node counts from 100 to
50,000 to identify scaling limits and validate performance under
load.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
3ebe4d99c1 mapper/batcher: reduce lock contention with two-phase send
Rewrite multiChannelNodeConn.send() to use a two-phase approach:
1. RLock: snapshot connections slice (cheap pointer copy)
2. Unlock: send to all connections (50ms timeouts happen here)
3. Lock: remove failed connections by pointer identity

Previously, send() held the write lock for the entire duration of
sending to all connections. With N stale connections each timing out
at 50ms, this blocked addConnection/removeConnection for N*50ms.
The two-phase approach holds the lock only for O(N) pointer
operations, not for N*50ms I/O waits.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
da33795e79 mapper/batcher: fix race conditions in cleanup and lookups
Replace the two-phase Load-check-Delete in cleanupOfflineNodes with
xsync.Map.Compute() for atomic check-and-delete. This prevents the
TOCTOU race where a node reconnects between the hasActiveConnections
check and the Delete call.

Add nil guards on all b.nodes.Load() and b.nodes.Range() call sites
to prevent nil pointer panics from concurrent cleanup races.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
57070680a5 mapper/batcher: restructure internals for correctness
Move per-node pending changes from a shared xsync.Map on the batcher
into multiChannelNodeConn, protected by a dedicated mutex. The new
appendPending/drainPending methods provide atomic append and drain
operations, eliminating data races in addToBatch and
processBatchedChanges.

Add sync.Once to multiChannelNodeConn.close() to make it idempotent,
preventing panics from concurrent close calls on the same channel.

Add started atomic.Bool to guard Start() against being called
multiple times, preventing orphaned goroutines.

Add comprehensive concurrency tests validating these changes.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
21e02e5d1f mapper/batcher: add unit tests and benchmarks
Add comprehensive unit tests for the LockFreeBatcher covering
AddNode/RemoveNode lifecycle, addToBatch routing (broadcast, targeted,
full update), processBatchedChanges deduplication, cleanup of offline
nodes, close/shutdown behavior, IsConnected state tracking, and
connected map consistency.

Add benchmarks for connection entry send, multi-channel send and
broadcast, peer diff computation, sentPeers updates, addToBatch at
various scales (10/100/1000 nodes), processBatchedChanges, broadcast
delivery, IsConnected lookups, connected map enumeration, connection
churn, and concurrent send+churn scenarios.

Widen setupBatcherWithTestData to accept testing.TB so benchmarks can
reuse the same database-backed test setup as unit tests.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
3e0a96ec3a all: fix test flakiness and improve test infrastructure
Buffer the AuthRequest verdict channel to prevent a race where the
sender blocks indefinitely if the receiver has already timed out, and
increase the auth followup test timeout from 100ms to 5s to prevent
spurious failures under load.

Skip postgres-backed tests when the postgres server is unavailable
instead of calling t.Fatal, which was preventing the rest of the test
suite from running.

Add TestMain to db, types, and policy/v2 packages to chdir to the
source directory before running tests. This ensures relative testdata/
paths resolve correctly when the test binary is executed from an
arbitrary working directory (e.g., via "go tool stress").
2026-03-14 02:52:28 -07:00
DM
fffc58b5d0 poll: fix poll test linter violations 2026-03-12 01:27:34 -07:00
DM
4aca9d6568 poll: stop stale map sessions through an explicit teardown hook
When stale-send cleanup prunes a connection from the batcher, the old serveLongPoll session needs an explicit stop signal. Pass a stop hook into AddNode and trigger it when that connection is removed, so the session exits through its normal cancel path instead of relying on channel closure from the batcher side.
2026-03-12 01:27:34 -07:00
DM
3daf45e88a mapper: close stale map channels after send timeouts
When the batcher timed out sending to a node, it removed the channel from multiChannelNodeConn but left the old serveLongPoll goroutine running on that channel. That left a live stale session behind: it no longer received new updates, but it could still keep the stream open and block shutdown.

Close the pruned channel when stale-send cleanup removes it so the old map session exits after draining any buffered update.
2026-03-12 01:27:34 -07:00
DM
b81d6c734d mapper: handle RemoveNode after channel cleanup
A connection can already be removed from multiChannelNodeConn by the stale-send cleanup path before serveLongPoll reaches its deferred RemoveNode call. In that case RemoveNode used to return early on "channel not found" and never updated the node's connected state.

Drop that early return so RemoveNode still checks whether any active connections remain and marks the node disconnected when the last one is gone.
2026-03-12 01:27:34 -07:00
DM
8423af2732 Swap favicon for updated version 2026-03-03 05:59:40 +01:00
Florian Preinstorfer
9baa795ddb Update docs for auth-id changes
- Replace "headscale nodes register" with "headscale auth register"
- Update from registration key to Auth ID
- Fix API example to register a node
2026-03-01 13:38:22 +01:00
Kristoffer Dalby
6c59d3e601 policy/v2: add SSH compatibility testdata from Tailscale SaaS
Add 39 test fixtures captured from Tailscale SaaS API responses
to validate SSH policy compilation parity. Each JSON file contains
the SSH policy section and expected compiled SSHRule arrays for 5
test nodes (3 user-owned, 2 tagged).

Test series: SSH-A (basic), SSH-B (specific sources), SSH-C
(destination combos), SSH-D (localpart), SSH-E (edge cases),
SSH-F (multi-rule), SSH-G (acceptEnv).

The data-driven TestSSHDataCompat harness uses cmp.Diff with
principal order tolerance but strict rule ordering (first-match-wins
semantics require exact order).

Updates #3049
2026-02-28 05:14:11 -08:00
Kristoffer Dalby
0acf09bdd2 policy/v2: add localpart:*@domain SSH user compilation
Add support for localpart:*@<domain> entries in SSH policy users.
When a user SSHes into a target, their email local-part becomes the
OS username (e.g. alice@example.com → OS user alice).

Type system (types.go):
- SSHUser.IsLocalpart() and ParseLocalpart() for validation
- SSHUsers.LocalpartEntries(), NormalUsers(), ContainsLocalpart()
- Enforces format: localpart:*@<domain> (wildcard-only)
- UserWildcard.Resolve for user:*@domain SSH source aliases
- acceptEnv passthrough for SSH rules

Compilation (filter.go):
- resolveLocalparts: pure function mapping users to local-parts
  by email domain. No node walking, easy to test.
- groupSourcesByUser: single walk producing per-user principals
  with sorted user IDs, and tagged principals separately.
- ipSetToPrincipals: shared helper replacing 6 inline copies.
- selfPrincipalsForNode: self-access using pre-computed byUser.

The approach separates data gathering from rule assembly. Localpart
rules are interleaved per source user to match Tailscale SaaS
first-match-wins ordering.

Updates #3049
2026-02-28 05:14:11 -08:00
QEDeD
414d3bbbd8 Fix typo in comment about fsnotify behavior
Correct loose (opposite of tight) to lose (opposite of keep).
2026-02-27 15:23:06 +01:00
DM
610c1daa4d types: avoid NodeView clone in CanAccess
NodeView.CanAccess called node2.AsStruct() on every check. In peer-map construction we run CanAccess in O(n^2) pair scans (often twice per pair), so that per-call clone multiplied into large heap churn
2026-02-26 19:15:07 -08:00
Kristoffer Dalby
3db0a483ed integration: add SSH check mode tests
Add ReadLog method to headscale integration container for log
inspection. Split SSH check mode tests into CLI and OIDC variants
and add comprehensive test coverage:

- TestSSHOneUserToOneCheckModeCLI: basic check mode with CLI approval
- TestSSHOneUserToOneCheckModeOIDC: check mode with OIDC approval
- TestSSHCheckModeUnapprovedTimeout: rejection on cache expiry
- TestSSHCheckModeCheckPeriodCLI: session expiry and re-auth
- TestSSHCheckModeAutoApprove: auto-approval within check period
- TestSSHCheckModeNegativeCLI: explicit rejection via CLI

Update existing integration tests to use headscale auth register.

Updates #1850
2026-02-25 21:28:05 +01:00
Kristoffer Dalby
7bab8da366 state, policy, noise: implement SSH check period auto-approval
Add SSH check period tracking so that recently authenticated users
are auto-approved without requiring manual intervention each time.

Introduce SSHCheckPeriod type with validation (min 1m, max 168h,
"always" for every request) and encode the compiled check period
as URL query parameters in the HoldAndDelegate URL.

The SSHActionHandler checks recorded auth times before creating a
new HoldAndDelegate flow. Auth timestamps are stored in-memory:
- Default period (no explicit checkPeriod): auth covers any
  destination, keyed by source node with Dst=0 sentinel
- Explicit period: auth covers only that specific destination,
  keyed by (source, destination) pair

Auth times are cleared on policy changes.

Updates #1850
2026-02-25 21:28:05 +01:00
Kristoffer Dalby
48cc98b787 hscontrol, cli: add auth register and approve commands
Implement AuthRegister and AuthApprove gRPC handlers and add
corresponding CLI commands (headscale auth register, approve, reject)
for managing pending auth requests including SSH check approvals.

Updates #1850
2026-02-25 21:28:05 +01:00
Kristoffer Dalby
107c2f2f70 policy, noise: implement SSH check action
Implement the SSH "check" action which requires additional
verification before allowing SSH access. The policy compiler generates
a HoldAndDelegate URL that the Tailscale client calls back to
headscale. The SSHActionHandler creates an auth session and waits for
approval via the generalised auth flow.

Sort check (HoldAndDelegate) rules before accept rules to match
Tailscale's first-match-wins evaluation order.

Updates #1850
2026-02-25 21:28:05 +01:00
Kristoffer Dalby
4a7e1475c0 templates: generalise auth templates for web and OIDC
Extract shared HTML/CSS design into a common template and create
generalised auth success and web auth templates that work for both
node registration and SSH check authentication flows.

Updates #1850
2026-02-25 21:28:05 +01:00
Kristoffer Dalby
cb3b6949ea auth: generalise auth flow and introduce AuthVerdict
Generalise the registration pipeline to a more general auth pipeline
supporting both node registrations and SSH check auth requests.
Rename RegistrationID to AuthID, unexport AuthRequest fields, and
introduce AuthVerdict to unify the auth finish API.

Add the urlParam generic helper for extracting typed URL parameters
from chi routes, used by the new auth request handler.

Updates #1850
2026-02-25 21:28:05 +01:00
Kristoffer Dalby
30338441c1 app: switch from gorilla to chi mux
Replace gorilla/mux with go-chi/chi as the HTTP router and add a
custom zerolog-based request logger to replace chi's default
stdlib-based middleware.Logger, consistent with the rest of the
application.

Updates #1850
2026-02-25 21:28:05 +01:00
Kristoffer Dalby
8048f10d13 hscontrol/state: extract findExistingNodeForPAK to reduce complexity
Extract the existing-node lookup logic from HandleNodeFromPreAuthKey
into a separate method. This reduces the cyclomatic complexity from
32 to 28, below the gocyclo limit of 30.

Updates #3077
2026-02-20 21:51:00 +01:00
Kristoffer Dalby
1e4fc3f179 hscontrol: add tests for deleting users with tagged nodes
Test the tagged-node-survives-user-deletion scenario at two layers:

DB layer (users_test.go):
- success_user_only_has_tagged_nodes: tagged nodes with nil
  user_id do not block user deletion and survive it
- error_user_has_tagged_and_owned_nodes: user-owned nodes
  still block deletion even when tagged nodes coexist

App layer (grpcv1_test.go):
- TestDeleteUser_TaggedNodeSurvives: full registration flow
  with tagged PreAuthKey verifies nil UserID after registration,
  absence from nodesByUser index, user deletion succeeds, and
  tagged node remains in global node list

Also update auth_tags_test.go assertions to expect nil UserID
on tagged nodes, consistent with the new invariant.

Updates #3077
2026-02-20 21:51:00 +01:00