[Feature] In-Memory State Management #1020

New Issue

adam · 2025-12-29T02:27:34+01:00

adam commented

2025-12-29 02:27:34 +01:00

Originally created by @CodingTil on GitHub (May 15, 2025).

Use case

Under certain loads, using a database for state management in headscale represents a performance bottleneck. To boost performance while ensuring database persistence, we propose to implement an in-memory state management layer.

Description

Profiling the server under certain loads revealed that most CPU time is spent on database operations, causing unresponsiveness and even node disconnections. This was observed on commit 0d3134720b.

Our previous attempts to accelerate DB read operations include:

Increasing the SQLite connection pool (#2571)
Replacing GORM's default JSONSerializer with a faster one using sonicJSON (references #2513)
Caching peers in the Mapper (a hack, only eventually consistent with the DB state)

Although these changes had a positive impact, the overall outcome was still unsatisfactory. Inspired by your in-code comments and proposals on other issues (e.g., d7a503a34e/hscontrol/app.go (L529) and https://github.com/juanfont/headscale/issues/2571#issuecomment-2858966975), we considered a reevaluation of the database access setup.

Contribution

I can write the design doc for this feature
I can contribute this feature

How can it be implemented?

Our proposal concerns a new in-memory state component, sitting between the server and the database. This component should have the following properties:

Contain all relevant data for headscale server operation in memory, currently residing in the database
Persist state:
- Initialize state from the DB
- Write to the DB when necessary (to be determined)
Allow concurrent reads of the state

Example struct:

// State holds the global state of the headscale server.
type State struct {
	// state is persisted in the database
	db *db.HSDatabase

	// concurrent reads are permitted
	mutex sync.RWMutex

	// ground truth data
	preauthKeys []types.PreAuthKey
	nodes       types.Nodes
	users       types.Users
}

We propose one of the following DB-Update schemes:

Immediate DB Updates: Update the DB immediately when the server updates the state.
Scheduled DB Updates: Update the in-memory state immediately, and dump it to the DB every X seconds (e.g., every minute or two) or on SIGTERM.

While individual immediate DB updates are smaller and faster, they may result in a larger number of updates over time. Scheduled DB updates might be simpler to implement and result in more readable code, making them our preferred choice.

If you agree with our proposal, we (@aergus-tng @Enkelmann @CMS-TNG @JanisCasper and me, and potentially a few more colleagues) would like to implement this.

Originally created by @CodingTil on GitHub (May 15, 2025). ### Use case Under certain loads, using a database for state management in headscale represents a performance bottleneck. To boost performance while ensuring database persistence, we propose to implement an in-memory state management layer. ### Description Profiling the server under certain loads revealed that most CPU time is spent on database operations, causing unresponsiveness and even node disconnections. This was observed on commit 0d3134720ba96e9719bab886525e175c5cfe0147. ![Image](https://github.com/user-attachments/assets/a0e4a943-66ae-4410-a432-6d072047fa09) ![Image](https://github.com/user-attachments/assets/c5de6579-1088-4853-9f37-9f5a2a31172a) ![Image](https://github.com/user-attachments/assets/7f9d26e4-6e20-4a74-bbe5-7b8c9464d859) Our previous attempts to accelerate DB read operations include: * Increasing the SQLite connection pool (#2571) * Replacing GORM's default JSONSerializer with a faster one using sonicJSON (references #2513) * Caching peers in the Mapper (a hack, only eventually consistent with the DB state) Although these changes had a positive impact, the overall outcome was still unsatisfactory. Inspired by your in-code comments and proposals on other issues (e.g., https://github.com/juanfont/headscale/blob/d7a503a34effa188e9bb27cb6b0fad2002112fb0/hscontrol/app.go#L529 and https://github.com/juanfont/headscale/issues/2571#issuecomment-2858966975), we considered a reevaluation of the database access setup. ### Contribution - [x] I can write the design doc for this feature - [x] I can contribute this feature ### How can it be implemented? Our proposal concerns a new in-memory state component, sitting between the server and the database. This component should have the following properties: - Contain all relevant data for headscale server operation in memory, currently residing in the database - Persist state: - Initialize state from the DB - Write to the DB when necessary (to be determined) - Allow concurrent reads of the state Example struct: ```go // State holds the global state of the headscale server. type State struct { // state is persisted in the database db *db.HSDatabase // concurrent reads are permitted mutex sync.RWMutex // ground truth data preauthKeys []types.PreAuthKey nodes types.Nodes users types.Users } ``` We propose one of the following DB-Update schemes: 1. Immediate DB Updates: Update the DB immediately when the server updates the state. 2. Scheduled DB Updates: Update the in-memory state immediately, and dump it to the DB every X seconds (e.g., every minute or two) or on SIGTERM. While individual immediate DB updates are smaller and faster, they may result in a larger number of updates over time. Scheduled DB updates might be simpler to implement and result in more readable code, making them our preferred choice. If you agree with our proposal, we (@aergus-tng @Enkelmann @CMS-TNG @JanisCasper and me, and potentially a few more colleagues) would like to implement this.

adam added the enhancement no-stale-bot labels 2025-12-29 02:27:34 +01:00

adam closed this issue

2025-12-29 02:27:34 +01:00

adam commented

2025-12-29 02:27:34 +01:00

@Codelica commented on GitHub (May 15, 2025):

Interesting. Seems that Web UIs (like Headscale-Admin) which poll the API continually for state changes would benefit as well.

@Codelica commented on GitHub (May 15, 2025): Interesting. Seems that Web UIs (like Headscale-Admin) which poll the API continually for state changes would benefit as well.

adam commented

2025-12-29 02:27:34 +01:00

@kradalby commented on GitHub (May 16, 2025):

Although these changes had a positive impact, the overall outcome was still unsatisfactory.

I appreciate your help so far and I am sorry that it is still unsatisfactory.

We are aware of that the database is a large bottleneck and it is in our plans to address this. Per now, we are on a very good trajectory for correcting a lot of inconsistencies and faults in Headscale and the next things on the roadmap is going to continue to focus on this.

I have plans to start work on this in the future, but for now, we will continue with our plan to "get things right, then make it faster" as there is a lot of moving parts involved in this and parts of the other changes will make it easier to achieve.

I imagine that we will get to it in 3-4 releases, after tags, autogroups, tls/serve.

For now, this is an effort that we really need to be on top of ourselves, there are lots of moving parts, we have to maintain it over time, we need to design it, so for now we will keep this on hold.
I am very happy for your efforts and happily will accept other smaller changes as you find them.

And when the time comes for this implementation, feedback and code review would be greatly appreciated!

@kradalby commented on GitHub (May 16, 2025): > Although these changes had a positive impact, the overall outcome was still unsatisfactory. I appreciate your help so far and I am sorry that it is still unsatisfactory. We are aware of that the database is a large bottleneck and it is in our plans to address this. Per now, we are on a very good trajectory for correcting a lot of inconsistencies and faults in Headscale and the next things on the roadmap is going to continue to focus on this. I have plans to start work on this in the future, but for now, we will continue with our plan to "get things right, then make it faster" as there is a lot of moving parts involved in this and parts of the other changes will make it easier to achieve. I imagine that we will get to it in 3-4 releases, after tags, autogroups, tls/serve. For now, this is an effort that we really need to be on top of ourselves, there are lots of moving parts, we have to maintain it over time, we need to design it, so for now we will keep this on hold. I am very happy for your efforts and happily will accept other smaller changes as you find them. And when the time comes for this implementation, feedback and code review would be greatly appreciated!

adam commented

2025-12-29 02:27:35 +01:00

@github-actions[bot] commented on GitHub (Aug 15, 2025):

This issue is stale because it has been open for 90 days with no activity.

@github-actions[bot] commented on GitHub (Aug 15, 2025): This issue is stale because it has been open for 90 days with no activity.

adam commented

2025-12-29 02:27:35 +01:00

@kradalby commented on GitHub (Sep 9, 2025):

And initial version of this has been merged in #2670, have not benched marked it, so it could potentially have moved the bottleneck around, but it is a step in the right direction.

@kradalby commented on GitHub (Sep 9, 2025): And initial version of this has been merged in #2670, have not benched marked it, so it could potentially have moved the bottleneck around, but it is a step in the right direction.

adam referenced this issue

2025-12-29 02:31:51 +01:00

[PR #1020] [CLOSED] docs(README): update contributors #1820

Sign in to join this conversation.

Branches Tags

main

update_flake_lock_action

gh-pages

kradalby/3038-reg-panic

kradalby/release-v0.27.2

dependabot/go_modules/golang.org/x/crypto-0.45.0

dependabot/go_modules/github.com/opencontainers/runc-1.3.3

copilot/investigate-headscale-issue-2788

copilot/investigate-visibility-issue-2788

copilot/investigate-issue-2833

copilot/debug-issue-2846

copilot/fix-issue-2847

dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0

dependabot/go_modules/github.com/docker/docker-28.3.3incompatible

kradalby/cli-experiement3

doc/0.26.1

doc/0.25.1

doc/0.25.0

doc/0.24.3

doc/0.24.2

doc/0.24.1

doc/0.24.0

kradalby/build-docker-on-pr

topic/docu-versioning

topic/docker-kos

juanfont/fix-crash-node-id

juanfont/better-disclaimer

update-contributors

topic/prettier

revert-1893-add-test-stage-to-docs

add-test-stage-to-docs

remove-node-check-interval

fix-empty-prefix

fix-ephemeral-reusable

bug_report-debuginfo

autogroups

logs-to-stderr

revert-1414-topic/fix_unix_socket

rename-machine-node

port-embedded-derp-tests-v2

port-derp-tests

duplicate-word-linter

update-tailscale-1.36

warn-against-apache

ko-fi-link

more-acl-tests

fix-typo-standalone

parallel-nolint

tparallel-fix

rerouting

ssh-changelog-docs

oidc-cleanup

web-auth-flow-tests

kradalby-gh-runner

fix-proto-lint

remove-funding-links

go-1.19

enable-1.30-in-tests

0.16.x

cosmetic-changes-integration

tmp-fix-integration-docker

fix-integration-docker

configurable-update-interval

show-nodes-online

hs2021

acl-syntax-fixes

ts2021-implementation

fix-spurious-updates

unstable-integration-tests

mandatory-stun

embedded-derp

prtemplate-fix

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/headscale#1020