[Feature] Benchmarking PostgreSQL vs. SQLite for Large-Scale VPN Networks Using Headscale #738

New Issue

adam · 2025-12-29T02:23:04+01:00

adam commented

2025-12-29 02:23:04 +01:00

Originally created by @nadongjun on GitHub (Jul 7, 2024).

Use case

Although Headscale is primarily designed for small-scale VPN environments (e.g., home VPNs), there is increasing interest in using Headscale to deploy large-scale VPN networks with over 500 clients. However, there is a lack of comprehensive guides and benchmark materials for such use cases. This benchmark provides recommendations for database configurations and comparison metrics to assist users in setting up large-scale VPN networks with Headscale.

The aim of this benchmark was to evaluate the suitability of SQLite (with the WAL option) versus PostgreSQL for large-scale client networks. The evaluation criteria included error occurrence frequency, client request processing time, and profiling results.
The benchmark results showed that PostgreSQL outperformed SQLite in terms of total client creation time, error occurrence frequency, and profiling results.
Based on these results, it is recommended to update the documentation and config.yaml to include criteria for selecting a database based on different use cases.

Description

Benchmark Environment

A single Headscale server was used to create and connect 60 Tailscale clients on each of 10 EC2 instances.
- Headscale Instance: t2.large instance deployed in the ap-northeast-2 region
- Tailscale Clients: 10 t2.medium instances (5 in ap-northeast-2 and 5 in us-east-1), with 60 Tailscale clients deployed per instance (total 600 tailscale client)
  - The benchmark was performed in a consistent environment where the only variable was the choice of database specified in the headscale config.yaml: either PostgreSQL or SQLite (with the WAL option).

Comparison Metrics and Results

The benchmark evaluated SQLite (with the WAL option) and PostgreSQL based on the following three criteria: total client creation time and error occurrence rate.

Total Client (600) Creation Time and Error Occurrence Rate
- SQLite: sqlite-node-list.json
  - Total Creation Time: 3.46 hours (First user created at: 1720269701, Last user created at: 1720282145)
  - Successful Client Creations: 442
  - Errors Encountered: 158 (ERR: context deadline exceeded)
- PostgreSQL: postgres-node-list.json
  - Total Creation Time: 2.47 hours (First user created at: 1720183268, Last user created at: 1720192120)
  - Successful Client Creations: 501
  - Errors Encountered: 99 (ERR: context deadline exceeded)
Error Occurrences
The following provides a verification of the errors that occurred in Issue #1966, including their occurrence rates:
- ERROR-1: The "ERR update not sent, context cancelled..." error occurred more frequently with SQLite. PostgreSQL experienced the Cannot create user: context deadline exceeded error.
- ERROR-2: The issue where some clients go from online to offline was caused by insufficient resources on the EC2 instances. This was not related to Headscale itself. With t2.medium instances, up to 60 Tailscale clients could be deployed effectively, but deploying more than 100 clients led to resource shortages, causing some clients to go offline.
Profiling Results
- The profiling results show that SQLite had higher query and database-related time compared to PostgreSQL. Please refer to the attached files for detailed profiling data.

Profiling Results: cpu

postgres

sqlite

Profiling Results: memory

postgres

sqlite

Conclusion

Client Creation Time: PostgreSQL outperformed SQLite in terms of faster client creation and a higher number of successful client creations.
Error Occurrence Frequency: PostgreSQL had a lower error(context deadline exceeded) occurrence rate compared to SQLite.
Profiling Performance: PostgreSQL demonstrated better performance in query execution and database-related operations compared to SQLite.

Recommendation

Based on the benchmark results, PostgreSQL is recommended for large-scale client environments over SQLite. It is suggested to update the documentation and config.yaml to include guidelines for selecting the appropriate database based on the use case.

Attachments

Headscale Node Lists:
- postgres-node-list.json
- sqlite-node-list.json
pprof Profiling and Heap Measurement Results:
- cpu-postgres.png
- cpu-sqlite.png
- heap-postgres.png
- heap-sqlite.png

Contribution

I can write the design doc for this feature
I can contribute this feature

How can it be implemented?

No response

Originally created by @nadongjun on GitHub (Jul 7, 2024). ### Use case Although Headscale is primarily designed for small-scale VPN environments (e.g., home VPNs), there is increasing interest in using Headscale to deploy large-scale VPN networks with over 500 clients. However, there is a lack of comprehensive guides and benchmark materials for such use cases. This benchmark provides recommendations for database configurations and comparison metrics to assist users in setting up large-scale VPN networks with Headscale. - The aim of this benchmark was to evaluate the suitability of SQLite (with the WAL option) versus PostgreSQL for large-scale client networks. The evaluation criteria included error occurrence frequency, client request processing time, and profiling results. - The benchmark results showed that PostgreSQL outperformed SQLite in terms of total client creation time, error occurrence frequency, and profiling results. - Based on these results, it is recommended to update the documentation and config.yaml to include criteria for selecting a database based on different use cases. ### Description ### Benchmark Environment - A single Headscale server was used to create and connect 60 Tailscale clients on each of 10 EC2 instances. - **Headscale Instance**: t2.large instance deployed in the ap-northeast-2 region - **Tailscale Clients**: 10 t2.medium instances (5 in ap-northeast-2 and 5 in us-east-1), with 60 Tailscale clients deployed per instance (total 600 tailscale client) **- The benchmark was performed in a consistent environment where the only variable was the choice of database specified in the headscale config.yaml: either PostgreSQL or SQLite (with the WAL option).** ### Comparison Metrics and Results The benchmark evaluated SQLite (with the WAL option) and PostgreSQL based on the following three criteria: total client creation time and error occurrence rate. 1. **Total Client (600) Creation Time and Error Occurrence Rate** - **SQLite**: [sqlite-node-list.json](https://github.com/user-attachments/files/16118216/sqlite-node-list.json) - Total Creation Time: 3.46 hours (First user created at: 1720269701, Last user created at: 1720282145) - Successful Client Creations: 442 - Errors Encountered: 158 (ERR: context deadline exceeded) - **PostgreSQL**: [postgres-node-list.json](https://github.com/user-attachments/files/16118214/postgres-node-list.json) - Total Creation Time: 2.47 hours (First user created at: 1720183268, Last user created at: 1720192120) - Successful Client Creations: 501 - Errors Encountered: 99 (ERR: context deadline exceeded) 2. **Error Occurrences** The following provides a verification of the errors that occurred in Issue #1966, including their occurrence rates: - **ERROR-1**: The "ERR update not sent, context cancelled..." error occurred more frequently with SQLite. PostgreSQL experienced the `Cannot create user: context deadline exceeded` error. - **ERROR-2**: The issue where some clients go from online to offline was caused by insufficient resources on the EC2 instances. **This was not related to Headscale itself.** With t2.medium instances, up to 60 Tailscale clients could be deployed effectively, but deploying more than 100 clients led to resource shortages, causing some clients to go offline. 3. **Profiling Results** - The profiling results show that SQLite had higher query and database-related time compared to PostgreSQL. Please refer to the attached files for detailed profiling data. ### Profiling Results: cpu #### postgres ![cpu-postgres](https://github.com/juanfont/headscale/assets/40987943/6e0fbf55-9746-427f-9461-9a25031e8088) #### sqlite ![cpu-sqlite](https://github.com/juanfont/headscale/assets/40987943/fe881ebe-0763-417d-afab-251f2946ccd0) ### Profiling Results: memory #### postgres ![heap-postgres](https://github.com/juanfont/headscale/assets/40987943/96728f78-9fe1-4544-bedb-0bb42b61bc5c) #### sqlite ![heap-sqlite](https://github.com/juanfont/headscale/assets/40987943/399f9bd1-ece2-4177-b43d-55c394bf074d) ### Conclusion - **Client Creation Time**: PostgreSQL outperformed SQLite in terms of faster client creation and a higher number of successful client creations. - **Error Occurrence Frequency**: PostgreSQL had a lower error(context deadline exceeded) occurrence rate compared to SQLite. - **Profiling Performance**: PostgreSQL demonstrated better performance in query execution and database-related operations compared to SQLite. ### Recommendation Based on the benchmark results, PostgreSQL is recommended for large-scale client environments over SQLite. It is suggested to update the documentation and `config.yaml` to include guidelines for selecting the appropriate database based on the use case. ### Attachments - **Headscale Node Lists**: - `postgres-node-list.json` - `sqlite-node-list.json` - **pprof Profiling and Heap Measurement Results**: - `cpu-postgres.png` - `cpu-sqlite.png` - `heap-postgres.png` - `heap-sqlite.png` ### Contribution - [X] I can write the design doc for this feature - [X] I can contribute this feature ### How can it be implemented? _No response_

adam added the enhancement stale labels 2025-12-29 02:23:04 +01:00

adam closed this issue

2025-12-29 02:23:04 +01:00

adam commented

2025-12-29 02:23:05 +01:00

@chriswiggins commented on GitHub (Jul 11, 2024):

Does this highlight a need (regardless of database) to be keeping more info in memory, or doing less database work every time a peer update occurs (or a mixture of both)?

I've been diving into the source, and I see a couple of TODOs by @kradalby mentioning caching. Would there be appetite for that?

@chriswiggins commented on GitHub (Jul 11, 2024): Does this highlight a need (regardless of database) to be keeping more info in memory, or doing less database work every time a peer update occurs (or a mixture of both)? I've been diving into the source, and I see a couple of TODOs by @kradalby mentioning caching. Would there be appetite for that?

adam commented

2025-12-29 02:23:05 +01:00

@CNLHC commented on GitHub (Jul 11, 2024):

Thanks for your meticulous benchmark.
It astonishes me that creating just 600 clients will take hours both in postgresql and sqlite. Intuitively speaking, I think the vital bottleneck here maybe not the database🤔
According to your benchmark results, the gorm overhead seems unexpectedly high.

BTW, if you have raw profile data, maybe flamegraph will provide more insight.

@CNLHC commented on GitHub (Jul 11, 2024): Thanks for your meticulous benchmark. It astonishes me that creating just 600 clients will take hours both in postgresql and sqlite. Intuitively speaking, I think the vital bottleneck here maybe not the database🤔 According to your benchmark results, the gorm overhead seems unexpectedly high. BTW, if you have raw profile data, maybe flamegraph will provide more insight.

adam commented

2025-12-29 02:23:05 +01:00

@nadongjun commented on GitHub (Jul 11, 2024):

@CNLHC

The reason creating 600 clients took several hours seems to be because I made the requests synchronously (using tailscale CLI to headscale by shell script). The latency observed in the benchmark is being analyzed in two areas: network latency due to AWS region distances and the database. Once the analysis is complete, I'll share the results.

Currently, I only have raw profile data for SQLite, so I am attaching a flamegraph for that.

@nadongjun commented on GitHub (Jul 11, 2024): @CNLHC The reason creating 600 clients took several hours seems to be because I made the requests synchronously (using tailscale CLI to headscale by shell script). The latency observed in the benchmark is being analyzed in two areas: network latency due to AWS region distances and the database. Once the analysis is complete, I'll share the results. Currently, I only have raw profile data for SQLite, so I am attaching a flamegraph for that. ![sqlite](https://github.com/juanfont/headscale/assets/40987943/353b8424-734d-456e-99d8-13b62b60cf7d)

adam commented

2025-12-29 02:23:06 +01:00

@kradalby commented on GitHub (Jul 13, 2024):

I've posted a comment touching on some of these topics here: https://github.com/juanfont/headscale/issues/1993#issuecomment-2226829944

@kradalby commented on GitHub (Jul 13, 2024): I've posted a comment touching on some of these topics here: https://github.com/juanfont/headscale/issues/1993#issuecomment-2226829944

adam commented

2025-12-29 02:23:06 +01:00

@github-actions[bot] commented on GitHub (Dec 26, 2024):

This issue is stale because it has been open for 90 days with no activity.

@github-actions[bot] commented on GitHub (Dec 26, 2024): This issue is stale because it has been open for 90 days with no activity.

adam commented

2025-12-29 02:23:06 +01:00

@github-actions[bot] commented on GitHub (Jan 2, 2025):

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions[bot] commented on GitHub (Jan 2, 2025): This issue was closed because it has been inactive for 14 days since being marked as stale.

adam referenced this issue

2025-12-29 02:30:56 +01:00

[PR #738] [MERGED] Implement TS2021 protocol in headscale #1610

Sign in to join this conversation.

Branches Tags

main

update_flake_lock_action

gh-pages

kradalby/3038-reg-panic

kradalby/release-v0.27.2

dependabot/go_modules/golang.org/x/crypto-0.45.0

dependabot/go_modules/github.com/opencontainers/runc-1.3.3

copilot/investigate-headscale-issue-2788

copilot/investigate-visibility-issue-2788

copilot/investigate-issue-2833

copilot/debug-issue-2846

copilot/fix-issue-2847

dependabot/go_modules/github.com/go-viper/mapstructure/v2-2.4.0

dependabot/go_modules/github.com/docker/docker-28.3.3incompatible

kradalby/cli-experiement3

doc/0.26.1

doc/0.25.1

doc/0.25.0

doc/0.24.3

doc/0.24.2

doc/0.24.1

doc/0.24.0

kradalby/build-docker-on-pr

topic/docu-versioning

topic/docker-kos

juanfont/fix-crash-node-id

juanfont/better-disclaimer

update-contributors

topic/prettier

revert-1893-add-test-stage-to-docs

add-test-stage-to-docs

remove-node-check-interval

fix-empty-prefix

fix-ephemeral-reusable

bug_report-debuginfo

autogroups

logs-to-stderr

revert-1414-topic/fix_unix_socket

rename-machine-node

port-embedded-derp-tests-v2

port-derp-tests

duplicate-word-linter

update-tailscale-1.36

warn-against-apache

ko-fi-link

more-acl-tests

fix-typo-standalone

parallel-nolint

tparallel-fix

rerouting

ssh-changelog-docs

oidc-cleanup

web-auth-flow-tests

kradalby-gh-runner

fix-proto-lint

remove-funding-links

go-1.19

enable-1.30-in-tests

0.16.x

cosmetic-changes-integration

tmp-fix-integration-docker

fix-integration-docker

configurable-update-interval

show-nodes-online

hs2021

acl-syntax-fixes

ts2021-implementation

fix-spurious-updates

unstable-integration-tests

mandatory-stun

embedded-derp

prtemplate-fix

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/headscale#738