Netbox API: request failed with code 500 - Cannot assing <SimpleLazyObject: <django.contrib.auth.models.AnonymousUser>> #5568

New Issue

@tyler-8 commented on GitHub (Nov 1, 2021):

Threading in general should improve performance for most users by allowing gunicorn/NetBox to serve more users at a time more efficiently - especially on the HTTP GET side of things. This change would only impact write-operations (POST/PATCH/PUT), GETs are unaffected. Right now, anyone running gunicorn with threads = 1 would see zero difference, no matter the number of workers.

For users running threads > 1 they shouldn't see the error noted in OP (and other GitHub issues like #7696 ) but like all Python threading, there's likely a thread-count/per-worker of diminishing returns or where performance would suffer (though this is no different than before my change really). 2-4 threads per-worker seems like a good number since generally NetBox isn't very CPU intensive (at least on the gunicorn side).

My rough understanding is that signals and middleware have quirks when using a WSGI server with threaded workers. This change adds a lock around the change_logging context manager's tasks to make sure one request doesn't stomp on another somehow (which seems to be what was happening before). When there are concurrent write-operations happening, a lock introduces a slowdown, by design, to keep data accurate. Whether or not that difference is noticeable depends on the number of concurrent client connections doing write operations.

In the other two similar plugins I linked earlier, both rely on threading.Local() to ensure each thread has its own local storage for request data, which may also work, and require more complexity than I've gone with here. More testing is needed for sure though.

@tyler-8 commented on GitHub (Nov 1, 2021): Threading in general should _improve_ performance for most users by allowing gunicorn/NetBox to serve more users at a time more efficiently - especially on the HTTP GET side of things. This change would only impact write-operations (POST/PATCH/PUT), GETs are unaffected. Right now, anyone running gunicorn with `threads = 1` would see zero difference, no matter the number of workers. For users running threads > 1 they shouldn't see the error noted in OP (and other GitHub issues like #7696 ) but like all Python threading, there's likely a thread-count/per-worker of diminishing returns or where performance would suffer (though this is no different than before my change really). 2-4 threads per-worker seems like a good number since generally NetBox isn't very CPU intensive (at least on the gunicorn side). My rough understanding is that signals and middleware have quirks when using a WSGI server with threaded workers. This change adds a lock around the `change_logging` context manager's tasks to make sure one request doesn't stomp on another somehow (which _seems_ to be what was happening before). When there are concurrent write-operations happening, a lock introduces a slowdown, by design, to keep data accurate. Whether or not that difference is noticeable depends on the number of concurrent client connections doing write operations. In the other two similar plugins I linked earlier, both rely on `threading.Local()` to ensure each thread has its own local storage for request data, which may also work, and require more complexity than I've gone with here. More testing is needed for sure though.

adam commented

@jeremystretch commented on GitHub (Nov 1, 2021):

I apologize for not keeping up this as well as I should have. (I've been focused on v3.1 feature work lately.) Allow me to summarize the conversation so far.

Is the current change logging middleware implementation thread-safe?

The working assumption is that no, it is not, hence the error reported above. @tyler-8's work appears to confirm this.

Can the current implementation be made thread-safe?

Per @tyler-8's proposed modification, probably.

Would this change have a significant negative performance impact?

My understanding is that wrapping the request processing inside threading.Lock() essentially effects single-threaded operation, but I may be mistaken. I need to do some reading on this topic. (Thanks for the links above, Tyler!)

Would an alternative solution using thread-local storage (TLS) be preferable?

IIRC we originally used TLS for the change logging middleware but moved away from it to resolve a similar bug. However, a lot has changed since then, and it may have been an error in the implementation. It's certainly worth taking another look at. I'm happy to draw inspiration from django-crum as a proven solution, though I'm not sure it makes sense to take on as a dependency, primarily because we also need to store the webhook queue and likely similar per-request data in the future.

(Please correct me if any of the above is wrong.)

@jeremystretch commented on GitHub (Nov 1, 2021): I apologize for not keeping up this as well as I should have. (I've been focused on v3.1 feature work lately.) Allow me to summarize the conversation so far. 1. Is the current change logging middleware implementation thread-safe? The working assumption is that **no**, it is not, hence the error reported above. @tyler-8's work appears to confirm this. 2. Can the current implementation be made thread-safe? Per @tyler-8's [proposed modification](https://github.com/netbox-community/netbox/compare/develop...tyler-8:objectchange_threadsafe), probably. 3. Would this change have a significant negative performance impact? My understanding is that wrapping the request processing inside `threading.Lock()` essentially effects single-threaded operation, but I may be mistaken. I need to do some reading on this topic. (Thanks for the links above, Tyler!) 4. Would an alternative solution using thread-local storage (TLS) be preferable? IIRC we originally used TLS for the change logging middleware but moved away from it to resolve a similar bug. However, a lot has changed since then, and it may have been an error in the implementation. It's certainly worth taking another look at. I'm happy to draw inspiration from `django-crum` as a proven solution, though I'm not sure it makes sense to take on as a dependency, primarily because we also need to store the webhook queue and likely similar per-request data in the future. (Please correct me if any of the above is wrong.)

adam commented

@tyler-8 commented on GitHub (Nov 1, 2021):

I think it may be two truly separate issues related to threading; this one (and the previous ones like it "SimpleLazyObject") , and #7696 - which might be made thread-safe by simply using unique dispatch_uid's like `post_save.connect(handle_changed_object, dispatch_uid=f'handle_changed_object_{request.id}') - I just started testing with that.

It's possible that the threading.Lock() helped the mismatched author problem by ensuring only one series of signals were used at a time using the handle_changed_object name. If the UID is made truly unique, that issue looks like it may go away even with threads enabled and no Lock.

I certainly think so, just need to find the right combination. One solution isn't going to fix both problems I suspect.
The lock is a bottleneck, but wrapped around a potentially non-thread-safe operation, so the goal is to keep the amount of 'locked' work minimal. In this case it's only write operations that invoke the object change workflow. Even then - it's only for that one worker process, threads in other work processes don't share with threads in the "current" process. And as noted above, potentially solved another way for at least one of the two issues.
For the "SimpleLazyObject" issue, TLS may be required still. I need to do more test runs with the lock in place. "We" know how to re-create it (perform lots of concurrent write-operations with server threading enabled) but it doesn't happen 100% of the time - just due to the way the requests are being "load balanced" across the workers and threads.

@tyler-8 commented on GitHub (Nov 1, 2021): 1. I think it may be two truly separate issues related to threading; this one (and the previous ones like it "SimpleLazyObject") , and #7696 - which might be made thread-safe by simply using unique `dispatch_uid`'s like `post_save.connect(handle_changed_object, dispatch_uid=f'handle_changed_object_{request.id}') - I just started testing with that. It's possible that the `threading.Lock()` helped the mismatched author problem by ensuring only one series of signals were used at a time using the `handle_changed_object` name. If the UID is made truly unique, that issue looks like it may go away even with threads enabled and no Lock. 2. I certainly think so, just need to find the right combination. One solution isn't going to fix both problems I suspect. 3. The lock is a bottleneck, but wrapped around a potentially non-thread-safe operation, so the goal is to keep the amount of 'locked' work minimal. In this case it's only write operations that invoke the object change workflow. Even then - it's only for that one worker process, threads in other work processes don't share with threads in the "current" process. And as noted above, potentially solved another way for at least one of the two issues. 4. For the "SimpleLazyObject" issue, TLS may be required still. I need to do more test runs with the lock in place. "We" know how to re-create it (perform lots of concurrent write-operations with server threading enabled) but it doesn't happen 100% of the time - just due to the way the requests are being "load balanced" across the workers and threads.

adam commented

@tyler-8 commented on GitHub (Nov 1, 2021):

Posting here as well:
https://github.com/netbox-community/netbox/issues/7696#issuecomment-956396714

Initially, using truly unique dispatch_uid seemed to work through a couple test runs of the test_case.py script, but now I'm seeing the user-mixups. On the other hand the original branch https://github.com/tyler-8/netbox/tree/objectchange_threadsafe is still solving the issue consistently.

Locking seems to solve the mixed user IDs in the change log, still trying to get a script that can more consistently invoke the error in the OP of this issue so I can see if my branch (with locking) makes any difference.

@tyler-8 commented on GitHub (Nov 1, 2021): Posting here as well: https://github.com/netbox-community/netbox/issues/7696#issuecomment-956396714 > Initially, using truly unique `dispatch_uid` seemed to work through a couple test runs of the `test_case.py` script, but now I'm seeing the user-mixups. On the other hand the original branch https://github.com/tyler-8/netbox/tree/objectchange_threadsafe is still solving the issue consistently. Locking seems to solve the mixed user IDs in the change log, still trying to get a script that can more consistently invoke the error in the OP of this issue so I can see if my branch (with locking) makes any difference.

adam commented

@DanSheps commented on GitHub (Nov 1, 2021):

Got to say, thank you @tyler-8 for all the hard work. I do not understand threading in Python so it is definitely appreciated.

@DanSheps commented on GitHub (Nov 1, 2021): Got to say, thank you @tyler-8 for all the hard work. I do not understand threading in Python so it is definitely appreciated.

adam commented

@tyler-8 commented on GitHub (Nov 1, 2021):

Originally I was thinking #7696 was distinct from this one because while the test_case.py script from that issue showed the behavior of wrong users' being populated in the changelog, it didn't cause an HTTP 500 at any point. But it's been noted in #5142 & #5609 that the error itself is only raised in rare occasions. The real problem of wrong users being attributed to changes is happening whether or not the HTTP 500 is being raised.

Perhaps in the future a more elegant solution will be found, but based on the benchmarks, I think the addition of a Lock allows users to gain the GET performance boost/efficiency offered by threads and still maintain good changelog history. In the end, we're not losing much since the workaround has always been to set threads = 1 anyway.

@tyler-8 commented on GitHub (Nov 1, 2021): Originally I was thinking #7696 was distinct from this one because while the `test_case.py` script from that issue showed the behavior of wrong users' being populated in the changelog, it didn't cause an HTTP 500 at any point. But it's been noted in #5142 & #5609 that the error itself is only raised in rare occasions. The real problem of wrong users being attributed to changes is happening whether or not the HTTP 500 is being raised. Perhaps in the future a more elegant solution will be found, but based on the benchmarks, I think the addition of a Lock allows users to gain the GET performance boost/efficiency offered by threads and still maintain good changelog history. In the end, we're not losing much since the workaround has always been to set `threads = 1` anyway.

adam commented

@robertpenz commented on GitHub (Nov 2, 2021):

Thx for the quick fix, we'll test it and get back to you.

@robertpenz commented on GitHub (Nov 2, 2021): Thx for the quick fix, we'll test it and get back to you.

adam commented

@alinmear commented on GitHub (Nov 3, 2021):

We patched our Installation with changes from https://github.com/tyler-8/netbox/tree/objectchange_threadsafe and the threadlock. We updated our worker / thread gunicorn config:

workers = 20
threads = 20

The problem could not be reproduced so far. We are keeping an eye on it.

@alinmear commented on GitHub (Nov 3, 2021): We patched our Installation with changes from https://github.com/tyler-8/netbox/tree/objectchange_threadsafe and the threadlock. We updated our worker / thread gunicorn config: ``` workers = 20 threads = 20 ``` The problem could not be reproduced so far. We are keeping an eye on it.

adam commented

@tyler-8 commented on GitHub (Nov 3, 2021):

Re-posting here since it's relevant.

Benchmarks using a modified test_case.py (from https://github.com/netbox-community/netbox/issues/7696#issue-1040078126) (ThreadPoolExecutor instead of ProcessPoolExecutor). Six executions of the script with each config.

Test Scenario 1
Client: 7 threads, 1 user per thread
Server: 8 processes (essentially, no thread-locks)
1:09.33
1:03.45
1:00.36
1:00.26
0:55.067
1:00.82

Test Scenario 2
Client: 7 threads, 1 user per thread
Server: 2 processes, 5 threads
1:40.41
1:40.45
1:43.61
1:41.18
1:39.32
1:40.59

Test Scenario 3
Client: 7 threads, 1 user per thread
Server: 5 processes, 3 threads
1:19.74
1:04.45
1:05.68
1:10.65
1:08.53
1:04.65

Test Scenario 4
Client: 7 threads, 1 user per thread
Server: 4 processes, 4 threads
1:02.83
1:02.15
1:44.20
0:59.19
1:22.28
1:00.38

@tyler-8 commented on GitHub (Nov 3, 2021): Re-posting here since it's relevant. Benchmarks using a modified `test_case.py` (from https://github.com/netbox-community/netbox/issues/7696#issue-1040078126) (ThreadPoolExecutor instead of ProcessPoolExecutor). Six executions of the script with each config. ``` Test Scenario 1 Client: 7 threads, 1 user per thread Server: 8 processes (essentially, no thread-locks) 1:09.33 1:03.45 1:00.36 1:00.26 0:55.067 1:00.82 Test Scenario 2 Client: 7 threads, 1 user per thread Server: 2 processes, 5 threads 1:40.41 1:40.45 1:43.61 1:41.18 1:39.32 1:40.59 Test Scenario 3 Client: 7 threads, 1 user per thread Server: 5 processes, 3 threads 1:19.74 1:04.45 1:05.68 1:10.65 1:08.53 1:04.65 Test Scenario 4 Client: 7 threads, 1 user per thread Server: 4 processes, 4 threads 1:02.83 1:02.15 1:44.20 0:59.19 1:22.28 1:00.38 ```

adam commented

@DanSheps commented on GitHub (Nov 3, 2021):

Correct me if I am wrong, that seems to indicate a significant hit to performance if threads are enabled.

@DanSheps commented on GitHub (Nov 3, 2021): Correct me if I am wrong, that seems to indicate a significant hit to performance if threads are enabled.

adam commented

@tyler-8 commented on GitHub (Nov 3, 2021):

2 processes, 5 threads (2p5t) is pretty weak-sauce in terms of a deployment scale, I wouldn't go that route unless you had a really good reason. I used it more to force a thread-heavy workflow.

The results for 5p3t and 4p4t are nearly the same times as no threads, with a couple hiccups with much higher times. They're also more reasonable and common deployment configurations I would imagine (the former being the default recommended config). The benchmark also doesn't cover requests/per-second. If you had heavy GET-concurrency you're going to see lots of benefit from using threads. The benchmark script is almost entirely write operations, on the same app.model.

What I didn't benchmark is 5p3t without a lock at all. Because 5p3t with no lock leads to inaccurate changelog data so isn't a reasonable comparison, IMO. Broken and faster < correct and slower

Using threads is going to make for more efficient memory & CPU usage. If your machine had 8 CPUs, you could run 8 server processes with 3-4 threads and would be able to handle an absurd amount of concurrent requests.

I'm sure there's a more elegant way long-term solution using threading.locals but that's going to require much more complexity to implement. This lock at the very least ensures changelog data is accurate without changing much code, which could be a huge concern to some users.

@tyler-8 commented on GitHub (Nov 3, 2021): 2 processes, 5 threads (2p5t) is pretty weak-sauce in terms of a deployment scale, I wouldn't go that route unless you had a really good reason. I used it more to force a thread-heavy workflow. The results for 5p3t and 4p4t are nearly the same times as no threads, with a couple hiccups with much higher times. They're also more reasonable and common deployment configurations I would imagine (the former being the [default recommended config](https://github.com/netbox-community/netbox/blob/7b1335825be66074ef98db7bd96a5f85d8b42810/contrib/gunicorn.py#L6)). The benchmark also doesn't cover requests/per-second. If you had heavy GET-concurrency you're going to see lots of benefit from using threads. The benchmark script is almost entirely write operations, on the same app.model. What I didn't benchmark is 5p3t without a lock at all. Because 5p3t with no lock leads to inaccurate changelog data so isn't a reasonable comparison, IMO. ` Broken and faster < correct and slower` Using threads is going to make for more efficient memory & CPU usage. If your machine had 8 CPUs, you could run 8 server processes with 3-4 threads and would be able to handle an absurd amount of concurrent requests. I'm sure there's a more elegant way long-term solution using `threading.locals` but that's going to require much more complexity to implement. This lock at the very least ensures changelog data is accurate without changing much code, which could be a huge concern to some users.

adam commented

@DanSheps commented on GitHub (Nov 3, 2021):

Yes, but in almost all instances, it seems 7 workers, 1 thread seems to beat out the threading as far as performance.

@DanSheps commented on GitHub (Nov 3, 2021): Yes, but in almost all instances, it seems 7 workers, 1 thread seems to beat out the threading as far as performance.

adam commented

@kkthxbye-code commented on GitHub (Nov 3, 2021):

It is to be expected that multiprocessing is faster than threading in Python, but with higher memory consumption.

@kkthxbye-code commented on GitHub (Nov 3, 2021): It is to be expected that multiprocessing is faster than threading in Python, but with higher memory consumption.

adam commented

@tyler-8 commented on GitHub (Nov 4, 2021):

Yes, but in almost all instances, it seems 7 workers, 1 thread seems to beat out the threading as far as performance.

I tweaked the benchmark labeling a little bit more for clarity. The "7" refers to the numbers of threads being used by the client, not the server.

Client: 7 threads, 1 user per thread
Server: 2 processes, 5 threads

Means that the server was receiving ~7 requests at a time, and had 10 threads available (2 processes, 5 threads, or 2p5t shorthand) to generate a response. 5p3t means 15 threads available to respond, 4p4t means 16 threads available. In Test Scenario 1, the server only had 8 threads (8p1t) available to respond.

So with the client load being equal in all four scenarios, Test Scenarios 3 & 4 both performed very close to the times seen with Test Scenario 1 - and still provided the same safe & accurate changelog data. Those couple instances where the times spiked up in 3 & 4 could've been other stuff (security and what not) running on my machine - these tests were all done locally on my laptop.

8 processes and 1 thread per-process is terribly inefficient - both in CPU usage and a higher memory footprint as @kkthxbye-code noted. And because NetBox's web-facing portion (running under gunicorn) is mostly network IO (between server and database, server and redis, and server and client) not using threading would be a "waste" of resources.

@tyler-8 commented on GitHub (Nov 4, 2021): > Yes, but in almost all instances, it seems 7 workers, 1 thread seems to beat out the threading as far as performance. I tweaked the benchmark labeling a little bit more for clarity. The "7" refers to the numbers of threads being used by the _client_, not the server. > Client: 7 threads, 1 user per thread > Server: 2 processes, 5 threads Means that the server was receiving ~7 requests at a time, and had 10 threads available (2 processes, 5 threads, or 2p5t shorthand) to generate a response. 5p3t means 15 threads available to respond, 4p4t means 16 threads available. In Test Scenario 1, the server only had 8 threads (8p1t) available to respond. So with the client load being equal in all four scenarios, Test Scenarios 3 & 4 both performed very close to the times seen with Test Scenario 1 - and still provided the same safe & accurate changelog data. Those couple instances where the times spiked up in 3 & 4 could've been other stuff (security and what not) running on my machine - these tests were all done locally on my laptop. 8 processes and 1 thread per-process is terribly inefficient - both in CPU usage and a higher memory footprint as @kkthxbye-code noted. And because NetBox's web-facing portion (running under gunicorn) is mostly network IO (between server and database, server and redis, and server and client) _not_ using threading would be a "waste" of resources.

adam commented

@alinmear commented on GitHub (Nov 11, 2021):

Just for the record: we are running our installation with the changes from https://github.com/tyler-8/netbox/tree/objectchange_threadsafe till now and have no problems regarding threading so far.

workers = 20
threads = 20

@alinmear commented on GitHub (Nov 11, 2021): Just for the record: we are running our installation with the changes from https://github.com/tyler-8/netbox/tree/objectchange_threadsafe till now and have no problems regarding threading so far. ``` workers = 20 threads = 20 ```

adam commented

@jeremystretch commented on GitHub (Nov 17, 2021):

Finally found some time to really dig into this today. I moved the request object and webhook queue into thread-local storage, which should resolve the "cross talk" among threads without significantly impacting multi-threaded performance.

I've opened draft PR #7861 to propose my changes. Please test and let me know if you encounter any issues.

And a huge thank you to @alinmear, @drygdryg, @tyler-8, @robertpenz, and everyone else who helped identify and resolve this bug!

@jeremystretch commented on GitHub (Nov 17, 2021): Finally found some time to really dig into this today. I moved the request object and webhook queue into thread-local storage, which should resolve the "cross talk" among threads without significantly impacting multi-threaded performance. I've opened draft PR #7861 to propose my changes. Please test and let me know if you encounter any issues. And a huge thank you to @alinmear, @drygdryg, @tyler-8, @robertpenz, and everyone else who helped identify and resolve this bug!

adam commented

@tyler-8 commented on GitHub (Nov 22, 2021):

I did some testing against #7861 (commit) using test_case.py (source). Surprisingly, thread-local storage was significantly slower. I also tried increasing the number of threads to see if that narrowed the gap any and the results stayed about the same. Lowering the processes to 4 and keeping 6 threads widened the delta even more.

Test Scenario 5
Client: 7 threads, 1 user per thread
Server: 5 processes, 3 threads

Thread local storage
02:35.46
02:32.61
02:36.14

Locking
01:05.09
01:25.48
01:09.82

---------------------------
Test Scenario 6
Client: 7 threads, 1 user per thread
Server: 5 processes, 6 threads

Thread local storage
02:27.68
02:33.90
02:23.75

Locking
01:05.33
01:06.68
01:20.28

---------------------------
Test Scenario 7
Client: 7 threads, 1 user per thread
Server: 4 processes, 6 threads

Thread local storage
3:07.36

Locking
01:05.68

@tyler-8 commented on GitHub (Nov 22, 2021): I did some testing against #7861 ([commit](https://github.com/netbox-community/netbox/pull/7861/commits/0d600995883ce3ec5758f850d5fc7fc2f823e0d1)) using `test_case.py` ([source](https://github.com/netbox-community/netbox/issues/7696#issue-1040078126)). Surprisingly, thread-local storage was significantly slower. I also tried increasing the number of threads to see if that narrowed the gap any and the results stayed about the same. Lowering the processes to 4 and keeping 6 threads widened the delta even more. ``` Test Scenario 5 Client: 7 threads, 1 user per thread Server: 5 processes, 3 threads Thread local storage 02:35.46 02:32.61 02:36.14 Locking 01:05.09 01:25.48 01:09.82 --------------------------- Test Scenario 6 Client: 7 threads, 1 user per thread Server: 5 processes, 6 threads Thread local storage 02:27.68 02:33.90 02:23.75 Locking 01:05.33 01:06.68 01:20.28 --------------------------- Test Scenario 7 Client: 7 threads, 1 user per thread Server: 4 processes, 6 threads Thread local storage 3:07.36 Locking 01:05.68 ```

adam commented

@jeremystretch commented on GitHub (Nov 22, 2021):

I did a quick local test using four Gunicorn workers with six threads each, and ran the test script five times per branch. Here are the average times for each:

develop (current release): 52.082s
7657-threadsafe-changelog (TLS): 47.602s
objectchange_threadsafe (thread locking): 58.10s

However, I noticed remarkable deviation between test runs within branches, so I'm going to take some time soon to set up a dedicated testing instance and repeat these tests. Will report back once that's been completed.

@jeremystretch commented on GitHub (Nov 22, 2021): I did a quick local test using four Gunicorn workers with six threads each, and ran the test script five times per branch. Here are the average times for each: `develop` (current release): 52.082s `7657-threadsafe-changelog` (TLS): 47.602s `objectchange_threadsafe` (thread locking): 58.10s However, I noticed remarkable deviation between test runs within branches, so I'm going to take some time soon to set up a dedicated testing instance and repeat these tests. Will report back once that's been completed.

adam commented

@tyler-8 commented on GitHub (Nov 23, 2021):

I spun a couple external VMs as I'd been running all of my tests on my Macbook - which doesn't exactly have a ton of CPU cores.

VM1 has 12 CPU Cores, 12GB RAM and is running NetBox, Postgresql, and Redis.
VM2 has 4 CPU Cores, 4GB of RAM and is only running the test_case.py script.

I did 10 test runs against each branch, here are the results

@tyler-8 commented on GitHub (Nov 23, 2021): I spun a couple external VMs as I'd been running all of my tests on my Macbook - which doesn't exactly have a ton of CPU cores. VM1 has 12 CPU Cores, 12GB RAM and is running NetBox, Postgresql, and Redis. VM2 has 4 CPU Cores, 4GB of RAM and is only running the test_case.py script. I did 10 test runs against each branch, here are the results ![image](https://user-images.githubusercontent.com/17618971/142950273-0a14119b-9128-4b0b-92f7-749cbaad88d0.png)

adam commented

@kkthxbye-code commented on GitHub (Nov 23, 2021):

I was interested in how the performance looked in netbox-docker with the changes. Not sure if useful but here they are:

Preface: netbox-docker uses nginx unit as an application server instead of uwsgi. The default configuration is 4 processes and 1 process per thread.

Testing was done using tyler-8's script and timed with perf with 10 runs. The output is the average runtime and deviation. Run on a i7-9700 (8 cores no HT).

4 processes 1 threads per process:

Locking:
5,2421 +- 0,0892 seconds time elapsed ( +- 1,70% )

TLS:
5,2254 +- 0,0476 seconds time elapsed ( +- 0,91% )

Develop:
5,2388 +- 0,0488 seconds time elapsed ( +- 0,93% )

5 processes 6 workers per process:

Locking:
6,501 +- 0,574 seconds time elapsed ( +- 8,83% )

TLS:
7,610 +- 0,634 seconds time elapsed ( +- 8,33% )

Develop:
8,466 +- 0,527 seconds time elapsed ( +- 6,22% )

Not sure what can be gleaned from this. I don't think people really change the default processes/threads in netbox-docker as it is not exposed, and you have to map your own nginx-unit configuration in. The default setup in netbox-docker doesn't seem to be affected and seems more stable then using a high number of threads.

Personally we are not memory constrained in our netbox deployment, so using threads seem to be pure downside for us anyway, at least when using the docker image.

@kkthxbye-code commented on GitHub (Nov 23, 2021): I was interested in how the performance looked in netbox-docker with the changes. Not sure if useful but here they are: Preface: netbox-docker uses nginx unit as an application server instead of uwsgi. The default configuration is 4 processes and 1 process per thread. Testing was done using tyler-8's script and timed with perf with 10 runs. The output is the average runtime and deviation. Run on a i7-9700 (8 cores no HT). ### 4 processes 1 threads per process: Locking: 5,2421 +- 0,0892 seconds time elapsed ( +- 1,70% ) TLS: 5,2254 +- 0,0476 seconds time elapsed ( +- 0,91% ) Develop: 5,2388 +- 0,0488 seconds time elapsed ( +- 0,93% ) ### 5 processes 6 workers per process: Locking: 6,501 +- 0,574 seconds time elapsed ( +- 8,83% ) TLS: 7,610 +- 0,634 seconds time elapsed ( +- 8,33% ) Develop: 8,466 +- 0,527 seconds time elapsed ( +- 6,22% ) Not sure what can be gleaned from this. I don't think people really change the default processes/threads in netbox-docker as it is not exposed, and you have to map your own nginx-unit configuration in. The default setup in netbox-docker doesn't seem to be affected and seems more stable then using a high number of threads. Personally we are not memory constrained in our netbox deployment, so using threads seem to be pure downside for us anyway, at least when using the docker image.

adam commented

@jeremystretch commented on GitHub (Nov 23, 2021):

The issue with the thread locking approach is that is effects single-thread operation for most of the request:

with changelog_lock:
    # Bulk of the request is processed here

To illustrate this, I added time.sleep(5) immediately before the yield statement in the change logging context manager to simulate a compute-intensive request. I ran my tests from earlier with Gunicorn configured for a single worker with six threads to maximize the impact of single-threaded operation.

Under the thread-local storage approach, a test run took 2m16s. With thread locking, it took 13m23, because each request was processed essentially one at a time.

After changing the Gunicorn configuration to 6 workers with a single thread each, both approaches take around 2m15s, which is to be expected.

Obviously there are a lot of variables to consider, such as the ration of WSGI worker processes to threads per worker, as well as the nature of each individual request. However I think these results are sufficient to rule out thread locking as a viable solution. Of course, I'd be interested to hear if anyone finds differently with a similar test.

@jeremystretch commented on GitHub (Nov 23, 2021): The issue with the thread locking approach is that is effects single-thread operation for most of the request: ```python with changelog_lock: # Bulk of the request is processed here ``` To illustrate this, I added `time.sleep(5)` immediately before the `yield` statement in the change logging context manager to simulate a compute-intensive request. I ran my tests from earlier with Gunicorn configured for a single worker with six threads to maximize the impact of single-threaded operation. Under the thread-local storage approach, a test run took 2m16s. With thread locking, it took 13m23, because each request was processed essentially one at a time. After changing the Gunicorn configuration to 6 workers with a single thread each, both approaches take around 2m15s, which is to be expected. Obviously there are a lot of variables to consider, such as the ration of WSGI worker processes to threads per worker, as well as the nature of each individual request. However I think these results are sufficient to rule out thread locking as a viable solution. Of course, I'd be interested to hear if anyone finds differently with a similar test.

adam commented

@tyler-8 commented on GitHub (Nov 23, 2021):

Given the relatively small difference in times, TLS is absolutely the superior solution - even if it comes at a tiny performance hit under a write-heavy workload. It will scale better for different write workloads.

My original tests against the TLS branch were somewhat faulty since my Macbook was basically already oversubscribed on processes vs CPU cores.

@tyler-8 commented on GitHub (Nov 23, 2021): Given the relatively small difference in times, TLS is absolutely the superior solution - even if it comes at a tiny performance hit under a write-heavy workload. It will scale better for different write workloads. My original tests against the TLS branch were somewhat faulty since my Macbook was basically already oversubscribed on processes vs CPU cores.

adam commented

@jeremystretch commented on GitHub (Nov 24, 2021):

I'm going to proceed with the TLS implementation for now, but will be paying careful attention to any feedback from the change. This also cleans up a fair bit of hoop-jumping within the middleware, which is nice.

@jeremystretch commented on GitHub (Nov 24, 2021): I'm going to proceed with the TLS implementation for now, but will be paying careful attention to any feedback from the change. This also cleans up a fair bit of hoop-jumping within the middleware, which is nice.

adam commented