[Feature] Benchmarking PostgreSQL vs. SQLite for Large-Scale VPN Networks Using Headscale #738

Closed
opened 2025-12-29 02:23:04 +01:00 by adam · 6 comments
Owner

Originally created by @nadongjun on GitHub (Jul 7, 2024).

Use case

Although Headscale is primarily designed for small-scale VPN environments (e.g., home VPNs), there is increasing interest in using Headscale to deploy large-scale VPN networks with over 500 clients. However, there is a lack of comprehensive guides and benchmark materials for such use cases. This benchmark provides recommendations for database configurations and comparison metrics to assist users in setting up large-scale VPN networks with Headscale.

  • The aim of this benchmark was to evaluate the suitability of SQLite (with the WAL option) versus PostgreSQL for large-scale client networks. The evaluation criteria included error occurrence frequency, client request processing time, and profiling results.
  • The benchmark results showed that PostgreSQL outperformed SQLite in terms of total client creation time, error occurrence frequency, and profiling results.
  • Based on these results, it is recommended to update the documentation and config.yaml to include criteria for selecting a database based on different use cases.

Description

Benchmark Environment

  • A single Headscale server was used to create and connect 60 Tailscale clients on each of 10 EC2 instances.
    • Headscale Instance: t2.large instance deployed in the ap-northeast-2 region
    • Tailscale Clients: 10 t2.medium instances (5 in ap-northeast-2 and 5 in us-east-1), with 60 Tailscale clients deployed per instance (total 600 tailscale client)
      - The benchmark was performed in a consistent environment where the only variable was the choice of database specified in the headscale config.yaml: either PostgreSQL or SQLite (with the WAL option).

Comparison Metrics and Results

The benchmark evaluated SQLite (with the WAL option) and PostgreSQL based on the following three criteria: total client creation time and error occurrence rate.

  1. Total Client (600) Creation Time and Error Occurrence Rate

    • SQLite: sqlite-node-list.json
      • Total Creation Time: 3.46 hours (First user created at: 1720269701, Last user created at: 1720282145)
      • Successful Client Creations: 442
      • Errors Encountered: 158 (ERR: context deadline exceeded)
    • PostgreSQL: postgres-node-list.json
      • Total Creation Time: 2.47 hours (First user created at: 1720183268, Last user created at: 1720192120)
      • Successful Client Creations: 501
      • Errors Encountered: 99 (ERR: context deadline exceeded)
  2. Error Occurrences
    The following provides a verification of the errors that occurred in Issue #1966, including their occurrence rates:

    • ERROR-1: The "ERR update not sent, context cancelled..." error occurred more frequently with SQLite. PostgreSQL experienced the Cannot create user: context deadline exceeded error.
    • ERROR-2: The issue where some clients go from online to offline was caused by insufficient resources on the EC2 instances. This was not related to Headscale itself. With t2.medium instances, up to 60 Tailscale clients could be deployed effectively, but deploying more than 100 clients led to resource shortages, causing some clients to go offline.
  3. Profiling Results

    • The profiling results show that SQLite had higher query and database-related time compared to PostgreSQL. Please refer to the attached files for detailed profiling data.

Profiling Results: cpu

postgres

cpu-postgres

sqlite

cpu-sqlite

Profiling Results: memory

postgres

heap-postgres

sqlite

heap-sqlite

Conclusion

  • Client Creation Time: PostgreSQL outperformed SQLite in terms of faster client creation and a higher number of successful client creations.
  • Error Occurrence Frequency: PostgreSQL had a lower error(context deadline exceeded) occurrence rate compared to SQLite.
  • Profiling Performance: PostgreSQL demonstrated better performance in query execution and database-related operations compared to SQLite.

Recommendation

Based on the benchmark results, PostgreSQL is recommended for large-scale client environments over SQLite. It is suggested to update the documentation and config.yaml to include guidelines for selecting the appropriate database based on the use case.

Attachments

  • Headscale Node Lists:
    • postgres-node-list.json
    • sqlite-node-list.json
  • pprof Profiling and Heap Measurement Results:
    • cpu-postgres.png
    • cpu-sqlite.png
    • heap-postgres.png
    • heap-sqlite.png

Contribution

  • I can write the design doc for this feature
  • I can contribute this feature

How can it be implemented?

No response

Originally created by @nadongjun on GitHub (Jul 7, 2024). ### Use case Although Headscale is primarily designed for small-scale VPN environments (e.g., home VPNs), there is increasing interest in using Headscale to deploy large-scale VPN networks with over 500 clients. However, there is a lack of comprehensive guides and benchmark materials for such use cases. This benchmark provides recommendations for database configurations and comparison metrics to assist users in setting up large-scale VPN networks with Headscale. - The aim of this benchmark was to evaluate the suitability of SQLite (with the WAL option) versus PostgreSQL for large-scale client networks. The evaluation criteria included error occurrence frequency, client request processing time, and profiling results. - The benchmark results showed that PostgreSQL outperformed SQLite in terms of total client creation time, error occurrence frequency, and profiling results. - Based on these results, it is recommended to update the documentation and config.yaml to include criteria for selecting a database based on different use cases. ### Description ### Benchmark Environment - A single Headscale server was used to create and connect 60 Tailscale clients on each of 10 EC2 instances. - **Headscale Instance**: t2.large instance deployed in the ap-northeast-2 region - **Tailscale Clients**: 10 t2.medium instances (5 in ap-northeast-2 and 5 in us-east-1), with 60 Tailscale clients deployed per instance (total 600 tailscale client) **- The benchmark was performed in a consistent environment where the only variable was the choice of database specified in the headscale config.yaml: either PostgreSQL or SQLite (with the WAL option).** ### Comparison Metrics and Results The benchmark evaluated SQLite (with the WAL option) and PostgreSQL based on the following three criteria: total client creation time and error occurrence rate. 1. **Total Client (600) Creation Time and Error Occurrence Rate** - **SQLite**: [sqlite-node-list.json](https://github.com/user-attachments/files/16118216/sqlite-node-list.json) - Total Creation Time: 3.46 hours (First user created at: 1720269701, Last user created at: 1720282145) - Successful Client Creations: 442 - Errors Encountered: 158 (ERR: context deadline exceeded) - **PostgreSQL**: [postgres-node-list.json](https://github.com/user-attachments/files/16118214/postgres-node-list.json) - Total Creation Time: 2.47 hours (First user created at: 1720183268, Last user created at: 1720192120) - Successful Client Creations: 501 - Errors Encountered: 99 (ERR: context deadline exceeded) 2. **Error Occurrences** The following provides a verification of the errors that occurred in Issue #1966, including their occurrence rates: - **ERROR-1**: The "ERR update not sent, context cancelled..." error occurred more frequently with SQLite. PostgreSQL experienced the `Cannot create user: context deadline exceeded` error. - **ERROR-2**: The issue where some clients go from online to offline was caused by insufficient resources on the EC2 instances. **This was not related to Headscale itself.** With t2.medium instances, up to 60 Tailscale clients could be deployed effectively, but deploying more than 100 clients led to resource shortages, causing some clients to go offline. 3. **Profiling Results** - The profiling results show that SQLite had higher query and database-related time compared to PostgreSQL. Please refer to the attached files for detailed profiling data. ### Profiling Results: cpu #### postgres ![cpu-postgres](https://github.com/juanfont/headscale/assets/40987943/6e0fbf55-9746-427f-9461-9a25031e8088) #### sqlite ![cpu-sqlite](https://github.com/juanfont/headscale/assets/40987943/fe881ebe-0763-417d-afab-251f2946ccd0) ### Profiling Results: memory #### postgres ![heap-postgres](https://github.com/juanfont/headscale/assets/40987943/96728f78-9fe1-4544-bedb-0bb42b61bc5c) #### sqlite ![heap-sqlite](https://github.com/juanfont/headscale/assets/40987943/399f9bd1-ece2-4177-b43d-55c394bf074d) ### Conclusion - **Client Creation Time**: PostgreSQL outperformed SQLite in terms of faster client creation and a higher number of successful client creations. - **Error Occurrence Frequency**: PostgreSQL had a lower error(context deadline exceeded) occurrence rate compared to SQLite. - **Profiling Performance**: PostgreSQL demonstrated better performance in query execution and database-related operations compared to SQLite. ### Recommendation Based on the benchmark results, PostgreSQL is recommended for large-scale client environments over SQLite. It is suggested to update the documentation and `config.yaml` to include guidelines for selecting the appropriate database based on the use case. ### Attachments - **Headscale Node Lists**: - `postgres-node-list.json` - `sqlite-node-list.json` - **pprof Profiling and Heap Measurement Results**: - `cpu-postgres.png` - `cpu-sqlite.png` - `heap-postgres.png` - `heap-sqlite.png` ### Contribution - [X] I can write the design doc for this feature - [X] I can contribute this feature ### How can it be implemented? _No response_
adam added the enhancementstale labels 2025-12-29 02:23:04 +01:00
adam closed this issue 2025-12-29 02:23:04 +01:00
Author
Owner

@chriswiggins commented on GitHub (Jul 11, 2024):

Does this highlight a need (regardless of database) to be keeping more info in memory, or doing less database work every time a peer update occurs (or a mixture of both)?

I've been diving into the source, and I see a couple of TODOs by @kradalby mentioning caching. Would there be appetite for that?

@chriswiggins commented on GitHub (Jul 11, 2024): Does this highlight a need (regardless of database) to be keeping more info in memory, or doing less database work every time a peer update occurs (or a mixture of both)? I've been diving into the source, and I see a couple of TODOs by @kradalby mentioning caching. Would there be appetite for that?
Author
Owner

@CNLHC commented on GitHub (Jul 11, 2024):

Thanks for your meticulous benchmark.
It astonishes me that creating just 600 clients will take hours both in postgresql and sqlite. Intuitively speaking, I think the vital bottleneck here maybe not the database🤔
According to your benchmark results, the gorm overhead seems unexpectedly high.

BTW, if you have raw profile data, maybe flamegraph will provide more insight.

@CNLHC commented on GitHub (Jul 11, 2024): Thanks for your meticulous benchmark. It astonishes me that creating just 600 clients will take hours both in postgresql and sqlite. Intuitively speaking, I think the vital bottleneck here maybe not the database🤔 According to your benchmark results, the gorm overhead seems unexpectedly high. BTW, if you have raw profile data, maybe flamegraph will provide more insight.
Author
Owner

@nadongjun commented on GitHub (Jul 11, 2024):

@CNLHC

The reason creating 600 clients took several hours seems to be because I made the requests synchronously (using tailscale CLI to headscale by shell script). The latency observed in the benchmark is being analyzed in two areas: network latency due to AWS region distances and the database. Once the analysis is complete, I'll share the results.

Currently, I only have raw profile data for SQLite, so I am attaching a flamegraph for that.

sqlite

@nadongjun commented on GitHub (Jul 11, 2024): @CNLHC The reason creating 600 clients took several hours seems to be because I made the requests synchronously (using tailscale CLI to headscale by shell script). The latency observed in the benchmark is being analyzed in two areas: network latency due to AWS region distances and the database. Once the analysis is complete, I'll share the results. Currently, I only have raw profile data for SQLite, so I am attaching a flamegraph for that. ![sqlite](https://github.com/juanfont/headscale/assets/40987943/353b8424-734d-456e-99d8-13b62b60cf7d)
Author
Owner

@kradalby commented on GitHub (Jul 13, 2024):

I've posted a comment touching on some of these topics here: https://github.com/juanfont/headscale/issues/1993#issuecomment-2226829944

@kradalby commented on GitHub (Jul 13, 2024): I've posted a comment touching on some of these topics here: https://github.com/juanfont/headscale/issues/1993#issuecomment-2226829944
Author
Owner

@github-actions[bot] commented on GitHub (Dec 26, 2024):

This issue is stale because it has been open for 90 days with no activity.

@github-actions[bot] commented on GitHub (Dec 26, 2024): This issue is stale because it has been open for 90 days with no activity.
Author
Owner

@github-actions[bot] commented on GitHub (Jan 2, 2025):

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions[bot] commented on GitHub (Jan 2, 2025): This issue was closed because it has been inactive for 14 days since being marked as stale.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/headscale#738