[Feature] In-Memory State Management #1020

Closed
opened 2025-12-29 02:27:34 +01:00 by adam · 4 comments
Owner

Originally created by @CodingTil on GitHub (May 15, 2025).

Use case

Under certain loads, using a database for state management in headscale represents a performance bottleneck. To boost performance while ensuring database persistence, we propose to implement an in-memory state management layer.

Description

Profiling the server under certain loads revealed that most CPU time is spent on database operations, causing unresponsiveness and even node disconnections. This was observed on commit 0d3134720b.

Image
Image
Image

Our previous attempts to accelerate DB read operations include:

  • Increasing the SQLite connection pool (#2571)
  • Replacing GORM's default JSONSerializer with a faster one using sonicJSON (references #2513)
  • Caching peers in the Mapper (a hack, only eventually consistent with the DB state)

Although these changes had a positive impact, the overall outcome was still unsatisfactory. Inspired by your in-code comments and proposals on other issues (e.g., d7a503a34e/hscontrol/app.go (L529) and https://github.com/juanfont/headscale/issues/2571#issuecomment-2858966975), we considered a reevaluation of the database access setup.

Contribution

  • I can write the design doc for this feature
  • I can contribute this feature

How can it be implemented?

Our proposal concerns a new in-memory state component, sitting between the server and the database. This component should have the following properties:

  • Contain all relevant data for headscale server operation in memory, currently residing in the database
  • Persist state:
    • Initialize state from the DB
    • Write to the DB when necessary (to be determined)
  • Allow concurrent reads of the state

Example struct:

// State holds the global state of the headscale server.
type State struct {
	// state is persisted in the database
	db *db.HSDatabase

	// concurrent reads are permitted
	mutex sync.RWMutex

	// ground truth data
	preauthKeys []types.PreAuthKey
	nodes       types.Nodes
	users       types.Users
}

We propose one of the following DB-Update schemes:

  1. Immediate DB Updates: Update the DB immediately when the server updates the state.
  2. Scheduled DB Updates: Update the in-memory state immediately, and dump it to the DB every X seconds (e.g., every minute or two) or on SIGTERM.

While individual immediate DB updates are smaller and faster, they may result in a larger number of updates over time. Scheduled DB updates might be simpler to implement and result in more readable code, making them our preferred choice.

If you agree with our proposal, we (@aergus-tng @Enkelmann @CMS-TNG @JanisCasper and me, and potentially a few more colleagues) would like to implement this.

Originally created by @CodingTil on GitHub (May 15, 2025). ### Use case Under certain loads, using a database for state management in headscale represents a performance bottleneck. To boost performance while ensuring database persistence, we propose to implement an in-memory state management layer. ### Description Profiling the server under certain loads revealed that most CPU time is spent on database operations, causing unresponsiveness and even node disconnections. This was observed on commit 0d3134720ba96e9719bab886525e175c5cfe0147. ![Image](https://github.com/user-attachments/assets/a0e4a943-66ae-4410-a432-6d072047fa09) ![Image](https://github.com/user-attachments/assets/c5de6579-1088-4853-9f37-9f5a2a31172a) ![Image](https://github.com/user-attachments/assets/7f9d26e4-6e20-4a74-bbe5-7b8c9464d859) Our previous attempts to accelerate DB read operations include: * Increasing the SQLite connection pool (#2571) * Replacing GORM's default JSONSerializer with a faster one using sonicJSON (references #2513) * Caching peers in the Mapper (a hack, only eventually consistent with the DB state) Although these changes had a positive impact, the overall outcome was still unsatisfactory. Inspired by your in-code comments and proposals on other issues (e.g., https://github.com/juanfont/headscale/blob/d7a503a34effa188e9bb27cb6b0fad2002112fb0/hscontrol/app.go#L529 and https://github.com/juanfont/headscale/issues/2571#issuecomment-2858966975), we considered a reevaluation of the database access setup. ### Contribution - [x] I can write the design doc for this feature - [x] I can contribute this feature ### How can it be implemented? Our proposal concerns a new in-memory state component, sitting between the server and the database. This component should have the following properties: - Contain all relevant data for headscale server operation in memory, currently residing in the database - Persist state: - Initialize state from the DB - Write to the DB when necessary (to be determined) - Allow concurrent reads of the state Example struct: ```go // State holds the global state of the headscale server. type State struct { // state is persisted in the database db *db.HSDatabase // concurrent reads are permitted mutex sync.RWMutex // ground truth data preauthKeys []types.PreAuthKey nodes types.Nodes users types.Users } ``` We propose one of the following DB-Update schemes: 1. Immediate DB Updates: Update the DB immediately when the server updates the state. 2. Scheduled DB Updates: Update the in-memory state immediately, and dump it to the DB every X seconds (e.g., every minute or two) or on SIGTERM. While individual immediate DB updates are smaller and faster, they may result in a larger number of updates over time. Scheduled DB updates might be simpler to implement and result in more readable code, making them our preferred choice. If you agree with our proposal, we (@aergus-tng @Enkelmann @CMS-TNG @JanisCasper and me, and potentially a few more colleagues) would like to implement this.
adam added the enhancementno-stale-bot labels 2025-12-29 02:27:34 +01:00
adam closed this issue 2025-12-29 02:27:34 +01:00
Author
Owner

@Codelica commented on GitHub (May 15, 2025):

Interesting. Seems that Web UIs (like Headscale-Admin) which poll the API continually for state changes would benefit as well.

@Codelica commented on GitHub (May 15, 2025): Interesting. Seems that Web UIs (like Headscale-Admin) which poll the API continually for state changes would benefit as well.
Author
Owner

@kradalby commented on GitHub (May 16, 2025):

Although these changes had a positive impact, the overall outcome was still unsatisfactory.

I appreciate your help so far and I am sorry that it is still unsatisfactory.

We are aware of that the database is a large bottleneck and it is in our plans to address this. Per now, we are on a very good trajectory for correcting a lot of inconsistencies and faults in Headscale and the next things on the roadmap is going to continue to focus on this.

I have plans to start work on this in the future, but for now, we will continue with our plan to "get things right, then make it faster" as there is a lot of moving parts involved in this and parts of the other changes will make it easier to achieve.

I imagine that we will get to it in 3-4 releases, after tags, autogroups, tls/serve.

For now, this is an effort that we really need to be on top of ourselves, there are lots of moving parts, we have to maintain it over time, we need to design it, so for now we will keep this on hold.
I am very happy for your efforts and happily will accept other smaller changes as you find them.

And when the time comes for this implementation, feedback and code review would be greatly appreciated!

@kradalby commented on GitHub (May 16, 2025): > Although these changes had a positive impact, the overall outcome was still unsatisfactory. I appreciate your help so far and I am sorry that it is still unsatisfactory. We are aware of that the database is a large bottleneck and it is in our plans to address this. Per now, we are on a very good trajectory for correcting a lot of inconsistencies and faults in Headscale and the next things on the roadmap is going to continue to focus on this. I have plans to start work on this in the future, but for now, we will continue with our plan to "get things right, then make it faster" as there is a lot of moving parts involved in this and parts of the other changes will make it easier to achieve. I imagine that we will get to it in 3-4 releases, after tags, autogroups, tls/serve. For now, this is an effort that we really need to be on top of ourselves, there are lots of moving parts, we have to maintain it over time, we need to design it, so for now we will keep this on hold. I am very happy for your efforts and happily will accept other smaller changes as you find them. And when the time comes for this implementation, feedback and code review would be greatly appreciated!
Author
Owner

@github-actions[bot] commented on GitHub (Aug 15, 2025):

This issue is stale because it has been open for 90 days with no activity.

@github-actions[bot] commented on GitHub (Aug 15, 2025): This issue is stale because it has been open for 90 days with no activity.
Author
Owner

@kradalby commented on GitHub (Sep 9, 2025):

And initial version of this has been merged in #2670, have not benched marked it, so it could potentially have moved the bottleneck around, but it is a step in the right direction.

@kradalby commented on GitHub (Sep 9, 2025): And initial version of this has been merged in #2670, have not benched marked it, so it could potentially have moved the bottleneck around, but it is a step in the right direction.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#1020