Rewrite multiChannelNodeConn.send() to use a two-phase approach:
1. RLock: snapshot connections slice (cheap pointer copy)
2. Unlock: send to all connections (50ms timeouts happen here)
3. Lock: remove failed connections by pointer identity
Previously, send() held the write lock for the entire duration of
sending to all connections. With N stale connections each timing out
at 50ms, this blocked addConnection/removeConnection for N*50ms.
The two-phase approach holds the lock only for O(N) pointer
operations, not for N*50ms I/O waits.
Replace the two-phase Load-check-Delete in cleanupOfflineNodes with
xsync.Map.Compute() for atomic check-and-delete. This prevents the
TOCTOU race where a node reconnects between the hasActiveConnections
check and the Delete call.
Add nil guards on all b.nodes.Load() and b.nodes.Range() call sites
to prevent nil pointer panics from concurrent cleanup races.
Move per-node pending changes from a shared xsync.Map on the batcher
into multiChannelNodeConn, protected by a dedicated mutex. The new
appendPending/drainPending methods provide atomic append and drain
operations, eliminating data races in addToBatch and
processBatchedChanges.
Add sync.Once to multiChannelNodeConn.close() to make it idempotent,
preventing panics from concurrent close calls on the same channel.
Add started atomic.Bool to guard Start() against being called
multiple times, preventing orphaned goroutines.
Add comprehensive concurrency tests validating these changes.
Add comprehensive unit tests for the LockFreeBatcher covering
AddNode/RemoveNode lifecycle, addToBatch routing (broadcast, targeted,
full update), processBatchedChanges deduplication, cleanup of offline
nodes, close/shutdown behavior, IsConnected state tracking, and
connected map consistency.
Add benchmarks for connection entry send, multi-channel send and
broadcast, peer diff computation, sentPeers updates, addToBatch at
various scales (10/100/1000 nodes), processBatchedChanges, broadcast
delivery, IsConnected lookups, connected map enumeration, connection
churn, and concurrent send+churn scenarios.
Widen setupBatcherWithTestData to accept testing.TB so benchmarks can
reuse the same database-backed test setup as unit tests.
Buffer the AuthRequest verdict channel to prevent a race where the
sender blocks indefinitely if the receiver has already timed out, and
increase the auth followup test timeout from 100ms to 5s to prevent
spurious failures under load.
Skip postgres-backed tests when the postgres server is unavailable
instead of calling t.Fatal, which was preventing the rest of the test
suite from running.
Add TestMain to db, types, and policy/v2 packages to chdir to the
source directory before running tests. This ensures relative testdata/
paths resolve correctly when the test binary is executed from an
arbitrary working directory (e.g., via "go tool stress").
When stale-send cleanup prunes a connection from the batcher, the old serveLongPoll session needs an explicit stop signal. Pass a stop hook into AddNode and trigger it when that connection is removed, so the session exits through its normal cancel path instead of relying on channel closure from the batcher side.
When the batcher timed out sending to a node, it removed the channel from multiChannelNodeConn but left the old serveLongPoll goroutine running on that channel. That left a live stale session behind: it no longer received new updates, but it could still keep the stream open and block shutdown.
Close the pruned channel when stale-send cleanup removes it so the old map session exits after draining any buffered update.
A connection can already be removed from multiChannelNodeConn by the stale-send cleanup path before serveLongPoll reaches its deferred RemoveNode call. In that case RemoveNode used to return early on "channel not found" and never updated the node's connected state.
Drop that early return so RemoveNode still checks whether any active connections remain and marks the node disconnected when the last one is gone.
Add support for localpart:*@<domain> entries in SSH policy users.
When a user SSHes into a target, their email local-part becomes the
OS username (e.g. alice@example.com → OS user alice).
Type system (types.go):
- SSHUser.IsLocalpart() and ParseLocalpart() for validation
- SSHUsers.LocalpartEntries(), NormalUsers(), ContainsLocalpart()
- Enforces format: localpart:*@<domain> (wildcard-only)
- UserWildcard.Resolve for user:*@domain SSH source aliases
- acceptEnv passthrough for SSH rules
Compilation (filter.go):
- resolveLocalparts: pure function mapping users to local-parts
by email domain. No node walking, easy to test.
- groupSourcesByUser: single walk producing per-user principals
with sorted user IDs, and tagged principals separately.
- ipSetToPrincipals: shared helper replacing 6 inline copies.
- selfPrincipalsForNode: self-access using pre-computed byUser.
The approach separates data gathering from rule assembly. Localpart
rules are interleaved per source user to match Tailscale SaaS
first-match-wins ordering.
Updates #3049
NodeView.CanAccess called node2.AsStruct() on every check. In peer-map construction we run CanAccess in O(n^2) pair scans (often twice per pair), so that per-call clone multiplied into large heap churn
Add ReadLog method to headscale integration container for log
inspection. Split SSH check mode tests into CLI and OIDC variants
and add comprehensive test coverage:
- TestSSHOneUserToOneCheckModeCLI: basic check mode with CLI approval
- TestSSHOneUserToOneCheckModeOIDC: check mode with OIDC approval
- TestSSHCheckModeUnapprovedTimeout: rejection on cache expiry
- TestSSHCheckModeCheckPeriodCLI: session expiry and re-auth
- TestSSHCheckModeAutoApprove: auto-approval within check period
- TestSSHCheckModeNegativeCLI: explicit rejection via CLI
Update existing integration tests to use headscale auth register.
Updates #1850
Add SSH check period tracking so that recently authenticated users
are auto-approved without requiring manual intervention each time.
Introduce SSHCheckPeriod type with validation (min 1m, max 168h,
"always" for every request) and encode the compiled check period
as URL query parameters in the HoldAndDelegate URL.
The SSHActionHandler checks recorded auth times before creating a
new HoldAndDelegate flow. Auth timestamps are stored in-memory:
- Default period (no explicit checkPeriod): auth covers any
destination, keyed by source node with Dst=0 sentinel
- Explicit period: auth covers only that specific destination,
keyed by (source, destination) pair
Auth times are cleared on policy changes.
Updates #1850
Implement the SSH "check" action which requires additional
verification before allowing SSH access. The policy compiler generates
a HoldAndDelegate URL that the Tailscale client calls back to
headscale. The SSHActionHandler creates an auth session and waits for
approval via the generalised auth flow.
Sort check (HoldAndDelegate) rules before accept rules to match
Tailscale's first-match-wins evaluation order.
Updates #1850
Extract shared HTML/CSS design into a common template and create
generalised auth success and web auth templates that work for both
node registration and SSH check authentication flows.
Updates #1850
Generalise the registration pipeline to a more general auth pipeline
supporting both node registrations and SSH check auth requests.
Rename RegistrationID to AuthID, unexport AuthRequest fields, and
introduce AuthVerdict to unify the auth finish API.
Add the urlParam generic helper for extracting typed URL parameters
from chi routes, used by the new auth request handler.
Updates #1850
Replace gorilla/mux with go-chi/chi as the HTTP router and add a
custom zerolog-based request logger to replace chi's default
stdlib-based middleware.Logger, consistent with the rest of the
application.
Updates #1850
Extract the existing-node lookup logic from HandleNodeFromPreAuthKey
into a separate method. This reduces the cyclomatic complexity from
32 to 28, below the gocyclo limit of 30.
Updates #3077
Test the tagged-node-survives-user-deletion scenario at two layers:
DB layer (users_test.go):
- success_user_only_has_tagged_nodes: tagged nodes with nil
user_id do not block user deletion and survive it
- error_user_has_tagged_and_owned_nodes: user-owned nodes
still block deletion even when tagged nodes coexist
App layer (grpcv1_test.go):
- TestDeleteUser_TaggedNodeSurvives: full registration flow
with tagged PreAuthKey verifies nil UserID after registration,
absence from nodesByUser index, user deletion succeeds, and
tagged node remains in global node list
Also update auth_tags_test.go assertions to expect nil UserID
on tagged nodes, consistent with the new invariant.
Updates #3077
Tagged nodes are owned by their tags, not a user. Enforce this
invariant at every write path:
- createAndSaveNewNode: do not set UserID for tagged PreAuthKey
registration; clear UserID when advertise-tags are applied
during OIDC/CLI registration
- SetNodeTags: clear UserID/User when tags are assigned
- processReauthTags: clear UserID/User when tags are applied
during re-authentication
- validateNodeOwnership: reject tagged nodes with non-nil UserID
- NodeStore: skip nodesByUser indexing for tagged nodes since
they have no owning user
- HandleNodeFromPreAuthKey: add fallback lookup for tagged PAK
re-registration (tagged nodes indexed under UserID(0)); guard
against nil User deref for tagged nodes in different-user check
Since tagged nodes now have user_id = NULL, ListNodesByUser
will not return them and DestroyUser naturally allows deleting
users whose nodes have all been tagged. The ON DELETE CASCADE
FK cannot reach tagged nodes through a NULL foreign key.
Also tone down shouty comments throughout state.go.
Fixes#3077
Tagged nodes are owned by their tags, not a user. Previously
user_id was kept as "created by" tracking, but this prevents
deleting users whose nodes have all been tagged, and the
ON DELETE CASCADE FK would destroy the tagged nodes.
Add a migration that sets user_id = NULL on all existing tagged
nodes. Subsequent commits enforce this invariant at write time.
Updates #3077
Add --disable flag to "headscale nodes expire" CLI command and
disable_expiry field handling in the gRPC API to allow disabling
key expiry for nodes. When disabled, the node's expiry is set to
NULL and IsExpired() returns false.
The CLI follows the new grpcRunE/RunE/printOutput patterns
introduced in the recent CLI refactor.
Also fix NodeSetExpiry to persist directly to the database instead
of going through persistNodeToDB which omits the expiry field.
Fixes#2681
Co-authored-by: Marco Santos <me@marcopsantos.com>
Add end-to-end test cases to TestUnmarshalPolicy that verify bracketed
IPv6 addresses are correctly parsed through the full policy pipeline
(JSON unmarshal -> splitDestinationAndPort -> parseAlias -> parsePortRange)
and survive JSON round-trips.
Cover single port, multiple ports, wildcard port, CIDR prefix, port
range, bracketed IPv4, and hostname rejection.
Updates #2754
Headscale rejects IPv6 addresses with square brackets in ACL policy
destinations (e.g. "[fd7a:115c:a1e0::87e1]:80,443"), while Tailscale
SaaS accepts them. The root cause is that splitDestinationAndPort uses
strings.LastIndex(":") which leaves brackets on the destination string,
and netip.ParseAddr does not accept brackets.
Add a bracket-handling branch at the top of splitDestinationAndPort that
uses net.SplitHostPort for RFC 3986 parsing when input starts with "[".
The extracted host is validated with netip.ParseAddr/ParsePrefix to
ensure brackets are only accepted around IP addresses and CIDR prefixes,
not hostnames or other alias types like tags and groups.
Fixes#2754
Use new(users["name"]) instead of extracting to intermediate
variables that staticcheck does not recognise as used with
Go 1.26 new(value) syntax.
Updates #3058
Fix issues found by the upgraded golangci-lint:
- wsl_v5: add required whitespace in CLI files
- staticcheck SA4006: replace new(var.Field) with &localVar
pattern since staticcheck does not recognize Go 1.26
new(value) as a use of the variable
- staticcheck SA5011: use t.Fatal instead of t.Error for
nil guard checks so execution stops
- unused: remove dead ptrTo helper function
Use GORM AutoMigrate instead of raw SQL to create the
database_versions table, since PostgreSQL does not support the
datetime type used in the raw SQL (it requires timestamp).
Add a version check that runs before database migrations to ensure
users do not skip minor versions or downgrade. This protects database
migrations and allows future cleanup of old migration code.
Rules enforced:
- Same minor version: always allowed (patch changes either way)
- Single minor upgrade (e.g. 0.27 -> 0.28): allowed
- Multi-minor upgrade (e.g. 0.25 -> 0.28): blocked with guidance
- Any minor downgrade: blocked
- Major version change: blocked
- Dev builds: warn but allow, preserve stored version
The version is stored in a purpose-built database_versions table
after migrations succeed. The table is created with raw SQL before
gormigrate runs to avoid circular dependencies.
Updates #3058
This commit upgrades the codebase from Go 1.25.5 to Go 1.26rc2 and
adopts new language features.
Toolchain updates:
- go.mod: go 1.25.5 → go 1.26rc2
- flake.nix: buildGo125Module → buildGo126Module, go_1_25 → go_1_26
- flake.nix: build golangci-lint from source with Go 1.26
- Dockerfile.integration: golang:1.25-trixie → golang:1.26rc2-trixie
- Dockerfile.tailscale-HEAD: golang:1.25-alpine → golang:1.26rc2-alpine
- Dockerfile.derper: golang:alpine → golang:1.26rc2-alpine
- .goreleaser.yml: go mod tidy -compat=1.25 → -compat=1.26
- cmd/hi/run.go: fallback Go version 1.25 → 1.26rc2
- .pre-commit-config.yaml: simplify golangci-lint hook entry
Code modernization using Go 1.26 features:
- Replace tsaddr.SortPrefixes with slices.SortFunc + netip.Prefix.Compare
- Replace ptr.To(x) with new(x) syntax
- Replace errors.As with errors.AsType[T]
Lint rule updates:
- Add forbidigo rules to prevent regression to old patterns
Errors should not start capitalised and they should not contain the word error
or state that they "failed" as we already know it is an error
Signed-off-by: Kristoffer Dalby <kristoffer@dalby.cc>
Go style recommends that log messages and error strings should not be
capitalized (unless beginning with proper nouns or acronyms) and should
not end with punctuation.
This change normalizes all zerolog .Msg() and .Msgf() calls to start
with lowercase letters, following Go conventions and making logs more
consistent across the codebase.
Replace raw string field names with zf constants in state.go and
db/node.go for consistent, type-safe logging.
state.go changes:
- User creation, hostinfo validation, node registration
- Tag processing during reauth (processReauthTags)
- Auth path and PreAuthKey handling
- Route auto-approval and MapRequest processing
db/node.go changes:
- RegisterNodeForTest logging
- Invalid hostname replacement logging
Add sub-logger patterns to worker(), AddNode(), RemoveNode() and
multiChannelNodeConn to eliminate repeated field calls. Use zf.*
constants for consistent field naming.
Changes in batcher_lockfree.go:
- Add wlog sub-logger in worker() with worker.id context
- Add log field to multiChannelNodeConn struct
- Initialize mc.log with node.id in newMultiChannelNodeConn()
- Add nlog sub-loggers in AddNode() and RemoveNode()
- Update all connection methods to use mc.log
Changes in batcher.go:
- Use zf.NodeID and zf.Reason in handleNodeChange()
Replace manual field extraction with EmbedObject for node logging
in gRPC handlers. Use zf.* constants for consistent field naming.
Changes:
- RegisterNode: use EmbedObject(node), zf.RegistrationKey, etc.
- SetTags: use EmbedObject(node)
- ExpireNode: use EmbedObject(node), zf.ExpiresAt
- RenameNode: use EmbedObject(node), zf.NewName
- SetApprovedRoutes: use zf.NodeID
Add sub-logger pattern to SetRoutes() to eliminate repeated node.id
field calls. Replace raw strings with zf.* constants throughout
the primary routes code for consistent field naming.
Changes:
- Add nlog sub-logger in SetRoutes() with node.id context
- Replace "prefix" with zf.Prefix
- Replace "changed" with zf.Changes
- Replace "newState" with zf.NewState
- Replace "finalState" with zf.FinalState
Replace the helper functions (logf, infof, tracef, errf) with a
zerolog sub-logger initialized in newMapSession(). The sub-logger
is pre-populated with session context (component, node, omitPeers,
stream) eliminating repeated field calls throughout the code.
Changes:
- Add log field to mapSession struct
- Initialize sub-logger with EmbedObject(node) and request context
- Remove logf/infof/tracef/errf helper functions
- Update all callers to use m.log.Level().Caller()... pattern
- Update noise.go to use sess.log instead of sess.tracef
This reduces code by ~20 lines and eliminates ~15 repeated field
calls per log statement.
Add hscontrol/util/zlog package with:
- zf subpackage: field name constants for compile-time safety
- SafeHostinfo: wrapper that redacts device fingerprinting data
- SafeMapRequest: wrapper that redacts client endpoints
The zf (zerolog fields) subpackage provides short constant names
(e.g., zf.NodeID instead of inline "node.id" strings) ensuring
consistent field naming across all log statements.
Security considerations:
- SafeHostinfo never logs: OSVersion, DeviceModel, DistroName
- SafeMapRequest only logs endpoint counts, not actual IPs
According to Tailscale SaaS behavior, autogroup:internet is handled
by exit node routing via AllowedIPs, not by packet filtering. ACL
rules with autogroup:internet as destination should produce no
filter rules for any node.
Previously, Headscale expanded autogroup:internet to public CIDR
ranges and distributed filters to exit nodes (because 0.0.0.0/0
"covers" internet destinations). This was incorrect.
Add detection for AutoGroupInternet in filter compilation to skip
filter generation for this autogroup. Update test expectations
accordingly.
Fix two compatibility issues discovered in Tailscale SaaS testing:
1. Wildcard DstPorts format: Headscale was expanding wildcard
destinations to CGNAT ranges (100.64.0.0/10, fd7a:115c:a1e0::/48)
while Tailscale uses {IP: "*"} directly. Add detection for
wildcard (Asterix) alias type in filter compilation to use the
correct format.
2. proto:icmp handling: The "icmp" protocol name was returning both
ICMPv4 (1) and ICMPv6 (58), but Tailscale only returns ICMPv4.
Users should use "ipv6-icmp" or protocol number 58 explicitly
for IPv6 ICMP.
Update all test expectations accordingly. This significantly reduces
test file line count by replacing duplicated CGNAT range patterns
with single wildcard entries.