Commit Graph

128 Commits

Author SHA1 Message Date
Kristoffer Dalby
afd3a6acbc mapper/batcher: remove disabled X-prefixed test functions
Remove XTestBatcherChannelClosingRace (~95 lines) and
XTestBatcherScalability (~515 lines). These were disabled by
prefixing with X (making them invisible to go test) and served
as dead code. The functionality they covered is exercised by the
active test suite.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
feaf85bfbc mapper/batcher: clean up test constants and output
L8: Rename SCREAMING_SNAKE_CASE test constants to idiomatic Go
camelCase. Remove highLoad* and extremeLoad* constants that were
only referenced by disabled (X-prefixed) tests.

L10: Fix misleading assert message that said "1337" while checking
for region ID 999.

L12: Remove emoji from test log output to avoid encoding issues
in CI environments.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
86e279869e mapper/batcher: minor production code cleanup
L1: Replace crypto/rand with an atomic counter for generating
connection IDs. These identifiers are process-local and do not need
cryptographic randomness; a monotonic counter is cheaper and
produces shorter, sortable IDs.

L5: Use getActiveConnectionCount() in Debug() instead of directly
locking the mutex and reading the connections slice. This avoids
bypassing the accessor that already exists for this purpose.

L6: Extract the hardcoded 15*time.Minute cleanup threshold into
the named constant offlineNodeCleanupThreshold.

L7: Inline the trivial addWork wrapper; AddWork now calls addToBatch
directly.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
7881f65358 mapper: extract node connection types to node_conn.go
Move connectionEntry, multiChannelNodeConn, generateConnectionID, and
all their methods from batcher.go into a dedicated file. This reduces
batcher.go from ~1170 lines to ~800 and separates per-node connection
management from batcher orchestration.

Pure move — no logic changes.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
2d549e579f mapper/batcher: add regression tests for M1, M3, M7 fixes
- TestBatcher_CloseBeforeStart_DoesNotHang: verifies Close() before
  Start() returns promptly now that done is initialized in NewBatcher.

- TestBatcher_QueueWorkAfterClose_DoesNotHang: verifies queueWork
  returns via the done channel after Close(), even without Start().

- TestIsConnected_FalseAfterAddNodeFailure: verifies IsConnected
  returns false after AddNode fails and removes the last connection.

- TestRemoveConnectionAtIndex_NilsTrailingSlot: verifies the backing
  array slot is nil-ed after removal to avoid retaining pointers.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
50e8b21471 mapper/batcher: fix pointer retention, done-channel init, and connected-map races
M7: Nil out trailing *connectionEntry pointers in the backing array
after slice removal in removeConnectionAtIndexLocked and send().
Without this, the GC cannot collect removed entries until the slice
is reallocated.

M1: Initialize the done channel in NewBatcher instead of Start().
Previously, calling Close() or queueWork before Start() would select
on a nil channel, blocking forever. Moving the make() to the
constructor ensures the channel is always usable.

M2: Move b.connected.Delete and b.totalNodes decrement inside the
Compute callback in cleanupOfflineNodes. Previously these ran after
the Compute returned, allowing a concurrent AddNode to reconnect
between the delete and the bookkeeping update, which would wipe the
fresh connected state.

M3: Call markDisconnectedIfNoConns on AddNode error paths. Previously,
when initial map generation or send timed out, the connection was
removed but b.connected retained its old nil (= connected) value,
making IsConnected return true for a node with zero connections.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
8e26651f2c mapper/batcher: add regression tests for timer leak and Close lifecycle
Add four unit tests guarding fixes introduced in recent commits:

- TestConnectionEntry_SendFastPath_TimerStopped: verifies the
  time.NewTimer fix (H1) does not leak goroutines after many
  fast-path sends on a buffered channel.

- TestBatcher_CloseWaitsForWorkers: verifies Close() blocks until all
  worker goroutines exit (H3), preventing sends on torn-down channels.

- TestBatcher_CloseThenStartIsNoop: verifies the one-shot lifecycle
  contract; Start() after Close() must not spawn new goroutines.

- TestBatcher_CloseStopsTicker: verifies Close() stops the internal
  ticker to prevent resource leaks.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
57a38b5678 mapper/batcher: reduce hot-path log verbosity
Remove Caller(), channel pointer formatting (fmt.Sprintf("%p",...)),
and mutex timing from send(), addConnection(), and
removeConnectionByChannel(). Move per-broadcast summary and
no-connection logs from Debug to Trace. Remove per-connection
"attempting"/"succeeded" logs entirely; keep Warn for failures.

These methods run on every MapResponse delivery, so the savings
compound quickly under load.

Updates #2545
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
051a38a4c4 mapper/batcher: track worker goroutines and stop ticker on Close
Close() previously closed the done channel and returned immediately,
without waiting for worker goroutines to exit. This caused goroutine
leaks in tests and allowed workers to race with connection teardown.
The ticker was also never stopped, leaking its internal goroutine.

Add a sync.WaitGroup to track the doWork goroutine and every worker
it spawns. Close() now calls wg.Wait() after signalling shutdown,
ensuring all goroutines have exited before tearing down connections.
Also stop the ticker to prevent resource leaks.

Document that a Batcher must not be reused after Close().
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
3276bda0c0 mapper/batcher: replace time.After with NewTimer to avoid timer leak
connectionEntry.send() is on the hot path: called once per connection
per broadcast tick. time.After allocates a timer that sits in the
runtime timer heap until it fires (50 ms), even when the channel send
succeeds immediately. At 1000 connected nodes, every tick leaks 1000
timers into the heap, creating continuous GC pressure.

Replace with time.NewTimer + defer timer.Stop() so the timer is
removed from the heap as soon as the fast-path send completes.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
2058343ad6 mapper: remove Batcher interface, rename to Batcher struct
Remove the Batcher interface since there is only one implementation.
Rename LockFreeBatcher to Batcher and merge batcher_lockfree.go into
batcher.go.

Drop type assertions in debug.go now that mapBatcher is a concrete
*mapper.Batcher pointer.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
9b24a39943 mapper/batcher: add scale benchmarks
Add benchmarks that systematically test node counts from 100 to
50,000 to identify scaling limits and validate performance under
load.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
3ebe4d99c1 mapper/batcher: reduce lock contention with two-phase send
Rewrite multiChannelNodeConn.send() to use a two-phase approach:
1. RLock: snapshot connections slice (cheap pointer copy)
2. Unlock: send to all connections (50ms timeouts happen here)
3. Lock: remove failed connections by pointer identity

Previously, send() held the write lock for the entire duration of
sending to all connections. With N stale connections each timing out
at 50ms, this blocked addConnection/removeConnection for N*50ms.
The two-phase approach holds the lock only for O(N) pointer
operations, not for N*50ms I/O waits.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
da33795e79 mapper/batcher: fix race conditions in cleanup and lookups
Replace the two-phase Load-check-Delete in cleanupOfflineNodes with
xsync.Map.Compute() for atomic check-and-delete. This prevents the
TOCTOU race where a node reconnects between the hasActiveConnections
check and the Delete call.

Add nil guards on all b.nodes.Load() and b.nodes.Range() call sites
to prevent nil pointer panics from concurrent cleanup races.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
57070680a5 mapper/batcher: restructure internals for correctness
Move per-node pending changes from a shared xsync.Map on the batcher
into multiChannelNodeConn, protected by a dedicated mutex. The new
appendPending/drainPending methods provide atomic append and drain
operations, eliminating data races in addToBatch and
processBatchedChanges.

Add sync.Once to multiChannelNodeConn.close() to make it idempotent,
preventing panics from concurrent close calls on the same channel.

Add started atomic.Bool to guard Start() against being called
multiple times, preventing orphaned goroutines.

Add comprehensive concurrency tests validating these changes.
2026-03-14 02:52:28 -07:00
Kristoffer Dalby
21e02e5d1f mapper/batcher: add unit tests and benchmarks
Add comprehensive unit tests for the LockFreeBatcher covering
AddNode/RemoveNode lifecycle, addToBatch routing (broadcast, targeted,
full update), processBatchedChanges deduplication, cleanup of offline
nodes, close/shutdown behavior, IsConnected state tracking, and
connected map consistency.

Add benchmarks for connection entry send, multi-channel send and
broadcast, peer diff computation, sentPeers updates, addToBatch at
various scales (10/100/1000 nodes), processBatchedChanges, broadcast
delivery, IsConnected lookups, connected map enumeration, connection
churn, and concurrent send+churn scenarios.

Widen setupBatcherWithTestData to accept testing.TB so benchmarks can
reuse the same database-backed test setup as unit tests.
2026-03-14 02:52:28 -07:00
DM
4aca9d6568 poll: stop stale map sessions through an explicit teardown hook
When stale-send cleanup prunes a connection from the batcher, the old serveLongPoll session needs an explicit stop signal. Pass a stop hook into AddNode and trigger it when that connection is removed, so the session exits through its normal cancel path instead of relying on channel closure from the batcher side.
2026-03-12 01:27:34 -07:00
DM
3daf45e88a mapper: close stale map channels after send timeouts
When the batcher timed out sending to a node, it removed the channel from multiChannelNodeConn but left the old serveLongPoll goroutine running on that channel. That left a live stale session behind: it no longer received new updates, but it could still keep the stream open and block shutdown.

Close the pruned channel when stale-send cleanup removes it so the old map session exits after draining any buffered update.
2026-03-12 01:27:34 -07:00
DM
b81d6c734d mapper: handle RemoveNode after channel cleanup
A connection can already be removed from multiChannelNodeConn by the stale-send cleanup path before serveLongPoll reaches its deferred RemoveNode call. In that case RemoveNode used to return early on "channel not found" and never updated the node's connected state.

Drop that early return so RemoveNode still checks whether any active connections remain and marks the node disconnected when the last one is gone.
2026-03-12 01:27:34 -07:00
Kristoffer Dalby
cb3b6949ea auth: generalise auth flow and introduce AuthVerdict
Generalise the registration pipeline to a more general auth pipeline
supporting both node registrations and SSH check auth requests.
Rename RegistrationID to AuthID, unexport AuthRequest fields, and
introduce AuthVerdict to unify the auth finish API.

Add the urlParam generic helper for extracting typed URL parameters
from chi routes, used by the new auth request handler.

Updates #1850
2026-02-25 21:28:05 +01:00
Kristoffer Dalby
0f6d312ada all: upgrade to Go 1.26rc2 and modernize codebase
This commit upgrades the codebase from Go 1.25.5 to Go 1.26rc2 and
adopts new language features.

Toolchain updates:
- go.mod: go 1.25.5 → go 1.26rc2
- flake.nix: buildGo125Module → buildGo126Module, go_1_25 → go_1_26
- flake.nix: build golangci-lint from source with Go 1.26
- Dockerfile.integration: golang:1.25-trixie → golang:1.26rc2-trixie
- Dockerfile.tailscale-HEAD: golang:1.25-alpine → golang:1.26rc2-alpine
- Dockerfile.derper: golang:alpine → golang:1.26rc2-alpine
- .goreleaser.yml: go mod tidy -compat=1.25 → -compat=1.26
- cmd/hi/run.go: fallback Go version 1.25 → 1.26rc2
- .pre-commit-config.yaml: simplify golangci-lint hook entry

Code modernization using Go 1.26 features:
- Replace tsaddr.SortPrefixes with slices.SortFunc + netip.Prefix.Compare
- Replace ptr.To(x) with new(x) syntax
- Replace errors.As with errors.AsType[T]

Lint rule updates:
- Add forbidigo rules to prevent regression to old patterns
2026-02-08 12:35:23 +01:00
Kristoffer Dalby
ce580f8245 all: fix golangci-lint issues (#3064) 2026-02-06 21:45:32 +01:00
Kristoffer Dalby
3acce2da87 errors: rewrite errors to follow go best practices
Errors should not start capitalised and they should not contain the word error
or state that they "failed" as we already know it is an error

Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>
2026-02-06 07:40:29 +01:00
Kristoffer Dalby
4a9a329339 all: use lowercase log messages
Go style recommends that log messages and error strings should not be
capitalized (unless beginning with proper nouns or acronyms) and should
not end with punctuation.

This change normalizes all zerolog .Msg() and .Msgf() calls to start
with lowercase letters, following Go conventions and making logs more
consistent across the codebase.
2026-02-06 07:40:29 +01:00
Kristoffer Dalby
53cdeff129 hscontrol/mapper: use sub-loggers and zf constants
Add sub-logger patterns to worker(), AddNode(), RemoveNode() and
multiChannelNodeConn to eliminate repeated field calls. Use zf.*
constants for consistent field naming.

Changes in batcher_lockfree.go:
- Add wlog sub-logger in worker() with worker.id context
- Add log field to multiChannelNodeConn struct
- Initialize mc.log with node.id in newMultiChannelNodeConn()
- Add nlog sub-loggers in AddNode() and RemoveNode()
- Update all connection methods to use mc.log

Changes in batcher.go:
- Use zf.NodeID and zf.Reason in handleNodeChange()
2026-02-06 07:40:29 +01:00
Kristoffer Dalby
91730e2a1d hscontrol: use EmbedObject for node logging
Replace manual Uint64("node.id")/Str("node.name") field patterns with
EmbedObject(node) which automatically includes all standard node fields
(id, name, machine key, node key, online status, tags, user).

This reduces code repetition and ensures consistent logging across:
- state.go: Connect/Disconnect, persistNodeToDB, AutoApproveRoutes
- auth.go: handleLogout, handleRegisterWithAuthKey
2026-02-06 07:40:29 +01:00
Kristoffer Dalby
ce7c256d1e state: set User pointer during tagged→user-owned conversion
processReauthTags sets UserID when converting a tagged node to
user-owned, but does not set the User pointer. When the node was
registered with a tags-only PreAuthKey (User: nil), the in-memory
NodeStore cache holds a node with User=nil. The mapper's
generateUserProfiles then calls node.Owner().Model().ID, which
dereferences the nil pointer and panics.

Set node.User alongside node.UserID in processReauthTags. Also add
defensive nil checks in generateUserProfiles to gracefully handle
nodes with invalid owners rather than panicking.

Fixes #3038
2026-02-04 15:44:55 +01:00
Shourya Gautam
4e1834adaf db: use PolicyManager for RequestTags migration
Refactor the RequestTags migration (202601121700-migrate-hostinfo-request-tags)
to use PolicyManager.NodeCanHaveTag() instead of reimplementing tag validation.

Changes:
- NewHeadscaleDatabase now accepts *types.Config to allow migrations
  access to policy configuration
- Add loadPolicyBytes helper to load policy from file or DB based on config
- Add standalone GetPolicy(tx *gorm.DB) for use during migrations
- Replace custom tag validation logic with PolicyManager

Benefits:
- Full HuJSON parsing support (not just JSON)
- Proper group expansion via PolicyManager
- Support for nested tags and autogroups
- Works with both file and database policy modes
- Single source of truth for tag validation


Co-Authored-By: Shourya Gautam <shouryamgautam@gmail.com>
2026-01-21 15:10:29 +01:00
Kristoffer Dalby
424e26d636 db: migrate tests from check.v1 to testify
Migrate all database tests from gopkg.in/check.v1 Suite-based testing
to standard Go tests with testify assert/require.

Changes:
- Remove empty Suite files (hscontrol/suite_test.go, hscontrol/mapper/suite_test.go)
- Convert hscontrol/db/suite_test.go to modern helpers only
- Convert 6 Suite test methods in node_test.go to standalone tests
- Convert 5 Suite test methods in api_key_test.go to standalone tests
- Fix stale global variable reference in db_test.go

The legacy TestListPeers Suite method was renamed to TestListPeersManyNodes
to avoid conflict with the existing modern TestListPeers function, as they
test different aspects (basic peer listing vs ID filtering).
2026-01-20 15:41:33 +01:00
Kristoffer Dalby
3b4b9a4436 hscontrol: fix tag updates not propagating to node self view
When SetNodeTags changed a node's tags, the node's self view wasn't
updated. The bug manifested as: the first SetNodeTags call updates
the server but the client's self view doesn't update until a second
call with the same tag.

Root cause: Three issues combined to prevent self-updates:

1. SetNodeTags returned PolicyChange which doesn't set OriginNode,
   so the mapper's self-update check failed.

2. The Change.Merge function didn't preserve OriginNode, so when
   changes were batched together, OriginNode was lost.

3. generateMapResponse checked OriginNode only in buildFromChange(),
   but PolicyChange uses RequiresRuntimePeerComputation which
   bypasses that code path entirely and calls policyChangeResponse()
   instead.

The fix addresses all three:
- state.go: Set OriginNode on the returned change
- change.go: Preserve OriginNode (and TargetNode) during merge
- batcher.go: Pass isSelfUpdate to policyChangeResponse so the
  origin node gets both self info AND packet filters
- mapper.go: Add includeSelf parameter to policyChangeResponse

Fixes #2978
2026-01-20 10:13:47 +01:00
Kristoffer Dalby
72fcb93ef3 cli: ensure tagged-devices is included in profile list (#2991) 2026-01-09 16:31:23 +01:00
Justin Angel
7be20912f5 oidc: make email verification configurable
Co-authored-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-12-18 11:42:32 +00:00
Kristoffer Dalby
82d4275c3b mapper: correct some variable names missed from change
Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>
2025-12-17 13:19:26 +01:00
Kristoffer Dalby
f3767dddf8 batcher: ensure removal from batcher
Fixes #2924

Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>
2025-12-17 13:19:26 +01:00
Kristoffer Dalby
5767ca5085 change: smarter change notifications
This commit replaces the ChangeSet with a simpler bool based
change model that can be directly used in the map builder to
build the appropriate map response based on the change that
has occured. Previously, we fell back to sending full maps
for a lot of changes as that was consider "the safe" thing to
do to ensure no updates were missed.

This was slightly problematic as a node that already has a list
of peers will only do full replacement of the peers if the list
is non-empty, meaning that it was not possible to remove all
nodes (if for example policy changed).

Now we will keep track of last seen nodes, so we can send remove
ids, but also we are much smarter on how we send smaller, partial
maps when needed.

Fixes #2389

Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>
2025-12-16 10:12:36 +01:00
Kristoffer Dalby
616c0e895d batcher: fix closed panic
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-12-15 16:28:27 +01:00
Kristoffer Dalby
642073f4b8 types: add option to disable taildrop, improve tests (#2955) 2025-12-12 11:35:16 +01:00
Kristoffer Dalby
87bd67318b golangci-lint: use forbidigo to block time.Sleep (#2946) 2025-12-10 16:45:59 +00:00
Kristoffer Dalby
0e1673041c all: remove deadcode (#2952) 2025-12-10 15:55:15 +01:00
Kristoffer Dalby
c8376e44a2 mapper: move tail node conversion to node type (#2950) 2025-12-10 09:16:22 +01:00
Kristoffer Dalby
22ee2bfc9c tags: process tags on registration, simplify policy (#2931)
This PR investigates, adds tests and aims to correctly implement Tailscale's model for how Tags should be accepted, assigned and used to identify nodes in the Tailscale access and ownership model.

When evaluating in Headscale's policy, Tags are now only checked against a nodes "tags" list, which defines the source of truth for all tags for a given node. This simplifies the code for dealing with tags greatly, and should help us have less access bugs related to nodes belonging to tags or users.

A node can either be owned by a user, or a tag.

Next, to ensure the tags list on the node is correctly implemented, we first add tests for every registration scenario and combination of user, pre auth key and pre auth key with tags with the same registration expectation as observed by trying them all with the Tailscale control server. This should ensure that we implement the correct behaviour and that it does not change or break over time.

Lastly, the missing parts of the auth has been added, or changed in the cases where it was wrong. This has in large parts allowed us to delete and simplify a lot of code.
Now, tags can only be changed when a node authenticates or if set via the CLI/API. Tags can only be fully overwritten/replaced and any use of either auth or CLI will replace the current set if different.

A user owned device can be converted to a tagged device, but it cannot be changed back. A tagged device can never remove the last tag either, it has to have a minimum of one.
2025-12-08 18:51:07 +01:00
Kristoffer Dalby
eb788cd007 make tags first class node owner (#2885)
This PR changes tags to be something that exists on nodes in addition to users, to being its own thing. It is part of moving our tags support towards the correct tailscale compatible implementation.

There are probably rough edges in this PR, but the intention is to get it in, and then start fixing bugs from 0.28.0 milestone (long standing tags issue) to discover what works and what doesnt.

Updates #2417
Closes #2619
2025-12-02 12:01:25 +01:00
Kristoffer Dalby
eec196d200 modernize: run gopls modernize to bring up to 1.25 (#2920) 2025-12-01 19:40:25 +01:00
Kristoffer Dalby
db293e0698 hscontrol/state: make NodeStore batch configuration tunable (#2886) 2025-11-28 16:38:29 +01:00
Kristoffer Dalby
7fb0f9a501 batcher: send endpoint and derp only updates. (#2856) 2025-11-13 20:38:49 +01:00
Kristoffer Dalby
d7a43a7cf1 state: use AllApprovedRoutes instead of SubnetRoutes
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-11-02 13:19:59 +01:00
Kristoffer Dalby
2bf1200483 policy: fix autogroup:self propagation and optimize cache invalidation (#2807) 2025-10-23 17:57:41 +02:00
Florian Preinstorfer
46477b8021 Downgrade completed broadcast message to debug 2025-10-18 07:56:59 +02:00
Vitalij Dovhanyc
c2a58a304d feat: add autogroup:self (#2789) 2025-10-16 12:59:52 +02:00
Kristoffer Dalby
ed3a9c8d6d mapper: send change instead of full update (#2775) 2025-09-17 14:23:21 +02:00