mirror of
https://github.com/netbox-community/netbox.git
synced 2026-01-14 14:23:32 +01:00
Netbox API: request failed with code 500 - Cannot assing <SimpleLazyObject: <django.contrib.auth.models.AnonymousUser>> #5568
Closed
opened 2025-12-29 19:29:29 +01:00 by adam
·
33 comments
No Branch/Tag Specified
main
21142-device-component-graphql-filters
21050-device-oob-ip-may-become-orphaned
21102-fix-graphiql-explorer
20911-dropdown
20239-plugin-menu-classes-mutable-state
21097-graphql-id-lookups
feature
fix_module_substitution
20923-dcim-templates
20044-elevation-stuck-lightmode
feature-ip-prefix-link
v4.5-beta1-release
20068-import-moduletype-attrs
20766-fix-german-translation-code-literals
20378-del-script
7604-filter-modifiers-v3
circuit-swap
12318-case-insensitive-uniqueness
20637-improve-device-q-filter
20660-script-load
19724-graphql
20614-update-ruff
14884-script
02496-max-page
19720-macaddress-interface-generic-relation
19408-circuit-terminations-export-templates
20203-openapi-check
fix-19669-api-image-download
7604-filter-modifiers
19275-fixes-interface-bulk-edit
fix-17794-get_field_value_return_list
11507-show-aggregate-and-rir-on-api
9583-add_column_specific_search_field_to_tables
v4.5.0
v4.4.10
v4.4.9
v4.5.0-beta1
v4.4.8
v4.4.7
v4.4.6
v4.4.5
v4.4.4
v4.4.3
v4.4.2
v4.4.1
v4.4.0
v4.3.7
v4.4.0-beta1
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
v4.2.9
v4.3.0-beta2
v4.2.8
v4.3.0-beta1
v4.2.7
v4.2.6
v4.2.5
v4.2.4
v4.2.3
v4.2.2
v4.2.1
v4.2.0
v4.1.11
v4.1.10
v4.1.9
v4.1.8
v4.2-beta1
v4.1.7
v4.1.6
v4.1.5
v4.1.4
v4.1.3
v4.1.2
v4.1.1
v4.1.0
v4.0.11
v4.0.10
v4.0.9
v4.1-beta1
v4.0.8
v4.0.7
v4.0.6
v4.0.5
v4.0.3
v4.0.2
v4.0.1
v4.0.0
v3.7.8
v3.7.7
v4.0-beta2
v3.7.6
v3.7.5
v4.0-beta1
v3.7.4
v3.7.3
v3.7.2
v3.7.1
v3.7.0
v3.6.9
v3.6.8
v3.6.7
v3.7-beta1
v3.6.6
v3.6.5
v3.6.4
v3.6.3
v3.6.2
v3.6.1
v3.6.0
v3.5.9
v3.6-beta2
v3.5.8
v3.6-beta1
v3.5.7
v3.5.6
v3.5.5
v3.5.4
v3.5.3
v3.5.2
v3.5.1
v3.5.0
v3.4.10
v3.4.9
v3.5-beta2
v3.4.8
v3.5-beta1
v3.4.7
v3.4.6
v3.4.5
v3.4.4
v3.4.3
v3.4.2
v3.4.1
v3.4.0
v3.3.10
v3.3.9
v3.4-beta1
v3.3.8
v3.3.7
v3.3.6
v3.3.5
v3.3.4
v3.3.3
v3.3.2
v3.3.1
v3.3.0
v3.2.9
v3.2.8
v3.3-beta2
v3.2.7
v3.3-beta1
v3.2.6
v3.2.5
v3.2.4
v3.2.3
v3.2.2
v3.2.1
v3.2.0
v3.1.11
v3.1.10
v3.2-beta2
v3.1.9
v3.2-beta1
v3.1.8
v3.1.7
v3.1.6
v3.1.5
v3.1.4
v3.1.3
v3.1.2
v3.1.1
v3.1.0
v3.0.12
v3.0.11
v3.0.10
v3.1-beta1
v3.0.9
v3.0.8
v3.0.7
v3.0.6
v3.0.5
v3.0.4
v3.0.3
v3.0.2
v3.0.1
v3.0.0
v2.11.12
v3.0-beta2
v2.11.11
v2.11.10
v3.0-beta1
v2.11.9
v2.11.8
v2.11.7
v2.11.6
v2.11.5
v2.11.4
v2.11.3
v2.11.2
v2.11.1
v2.11.0
v2.10.10
v2.10.9
v2.11-beta1
v2.10.8
v2.10.7
v2.10.6
v2.10.5
v2.10.4
v2.10.3
v2.10.2
v2.10.1
v2.10.0
v2.9.11
v2.10-beta2
v2.9.10
v2.10-beta1
v2.9.9
v2.9.8
v2.9.7
v2.9.6
v2.9.5
v2.9.4
v2.9.3
v2.9.2
v2.9.1
v2.9.0
v2.9-beta2
v2.8.9
v2.9-beta1
v2.8.8
v2.8.7
v2.8.6
v2.8.5
v2.8.4
v2.8.3
v2.8.2
v2.8.1
v2.8.0
v2.7.12
v2.7.11
v2.7.10
v2.7.9
v2.7.8
v2.7.7
v2.7.6
v2.7.5
v2.7.4
v2.7.3
v2.7.2
v2.7.1
v2.7.0
v2.6.12
v2.6.11
v2.6.10
v2.6.9
v2.7-beta1
Solcon-2020-01-06
v2.6.8
v2.6.7
v2.6.6
v2.6.5
v2.6.4
v2.6.3
v2.6.2
v2.6.1
v2.6.0
v2.5.13
v2.5.12
v2.6-beta1
v2.5.11
v2.5.10
v2.5.9
v2.5.8
v2.5.7
v2.5.6
v2.5.5
v2.5.4
v2.5.3
v2.5.2
v2.5.1
v2.5.0
v2.4.9
v2.5-beta2
v2.4.8
v2.5-beta1
v2.4.7
v2.4.6
v2.4.5
v2.4.4
v2.4.3
v2.4.2
v2.4.1
v2.4.0
v2.3.7
v2.4-beta1
v2.3.6
v2.3.5
v2.3.4
v2.3.3
v2.3.2
v2.3.1
v2.3.0
v2.2.10
v2.3-beta2
v2.2.9
v2.3-beta1
v2.2.8
v2.2.7
v2.2.6
v2.2.5
v2.2.4
v2.2.3
v2.2.2
v2.2.1
v2.2.0
v2.1.6
v2.2-beta2
v2.1.5
v2.2-beta1
v2.1.4
v2.1.3
v2.1.2
v2.1.1
v2.1.0
v2.0.10
v2.1-beta1
v2.0.9
v2.0.8
v2.0.7
v2.0.6
v2.0.5
v2.0.4
v2.0.3
v2.0.2
v2.0.1
v2.0.0
v2.0-beta3
v1.9.6
v1.9.5
v2.0-beta2
v1.9.4-r1
v1.9.3
v2.0-beta1
v1.9.2
v1.9.1
v1.9.0-r1
v1.8.4
v1.8.3
v1.8.2
v1.8.1
v1.8.0
v1.7.3
v1.7.2-r1
v1.7.1
v1.7.0
v1.6.3
v1.6.2-r1
v1.6.1-r1
1.6.1
v1.6.0
v1.5.2
v1.5.1
v1.5.0
v1.4.2
v1.4.1
v1.4.0
v1.3.2
v1.3.1
v1.3.0
v1.2.2
v1.2.1
v1.2.0
v1.1.0
v1.0.7-r1
v1.0.7
v1.0.6
v1.0.5
v1.0.4
v1.0.3-r1
v1.0.3
1.0.0
Labels
Clear labels
beta
breaking change
complexity: high
complexity: low
complexity: medium
needs milestone
netbox
pending closure
plugin candidate
pull-request
severity: high
severity: low
severity: medium
status: accepted
status: backlog
status: blocked
status: duplicate
status: needs owner
status: needs triage
status: revisions needed
status: under review
topic: GraphQL
topic: Internationalization
topic: OpenAPI
topic: UI/UX
topic: cabling
topic: event rules
topic: htmx navigation
topic: industrialization
topic: migrations
topic: plugins
topic: scripts
topic: templating
topic: testing
type: bug
type: deprecation
type: documentation
type: feature
type: housekeeping
type: translation
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: starred/netbox#5568
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @alinmear on GitHub (Oct 27, 2021).
Originally assigned to: @jeremystretch on GitHub.
NetBox version
2.11.6
Python version
3.7
Steps to Reproduce
We have multiple Scripts that are running against our Netbox Instance. Some of them are utilizing read only tokens and others read_write tokens. From time to time the following exception occurs:
After that we have Changes on Objects with the change_user of the read_only token. Maybe there is any kind of race condition or something like that. Does anybody has an idea, how to further investigate this problem?
Expected Behavior
There is no exception and change user are not mixed with read_only user.
Observed Behavior
Exception occurs and change_user is read_only user
@jeremystretch commented on GitHub (Oct 27, 2021):
This keeps coming up, and unfortunately has proven quite difficult to squash. How are you running NetBox? With gunicorn or uWSGI? Can you share your WSGI configuration?
The current understanding (I think) is that this bug can be side-stepped by limiting each worker to a single thread. @tyler-8 please correct me if that's wrong.
@robertpenz commented on GitHub (Oct 27, 2021):
We are running on gunicorn with apache as reverse proxy for terminating TLS
gunicorn is running with the following:
workers = 9
threads = 9
Should we switch to wsgi?
@jeremystretch commented on GitHub (Oct 27, 2021):
Try setting
threads = 1@tyler-8 commented on GitHub (Oct 27, 2021):
Correct - one thread per worker-process on the NetBox server seems to resolve the issue for multi-threaded clients.
My hunch is that there's something that isn't thread-safe in the handling of either user API tokens for creating objects, or the object changelog process that's initiated from doing a write operation. At one point I was trying to test with
uwsgivsgunicornon this issue to see if it made any difference but haven't had a chance to pick up that thread (lol) again.@robertpenz commented on GitHub (Oct 28, 2021):
Thx we will try that. We're quite able to reproduce that error with a stress test. We've one script that loops over all devices/vms and put them into a queue. This queue is worked on by 5-10 threads which first checks if the devices should be monitored and if yes loops over all interfaces of that switch/router and imports that to the monitoring system. (It does a little bit more, but that sums it up nicely). That script on its own works without a problem, but when another script is changing something at the same time we run into the problem. The easiest way to trigger the problem was to run a script with one thread which just changes the dnsname or description of all IP Addresses.
When you do that, you get 2 possible outcomes.
@tyler-8 commented on GitHub (Oct 31, 2021):
@alinmear @robertpenz would you both try this branch and see if you can still re-create the issue?
https://github.com/tyler-8/netbox/tree/objectchange_threadsafe
[Edit] Doesn't look like it's working as consistently as I initially thought.
@tyler-8 commented on GitHub (Oct 31, 2021):
I think the only way forward to fixing this for real is to use
threading.local()similar to https://github.com/jazzband/django-auditlog or https://github.com/ninemoreminutes/django-crum/ (or just usedjango-crumitself depending on if @jeremystretch wants to take on a dependency).Related reading I went through while digging in to this:
@tyler-8 commented on GitHub (Nov 1, 2021):
Okay - try the branch again with the current commit https://github.com/tyler-8/netbox/tree/objectchange_threadsafe . I've implemented a thread-lock for the change_logging context manager. Signal connections themselves aren't thread-safe, at least not in the way they're being used for the object change log.
@DanSheps commented on GitHub (Nov 1, 2021):
@tyler-8 What affect might this have on NetBox overall (performance, etc)?
@tyler-8 commented on GitHub (Nov 1, 2021):
Threading in general should improve performance for most users by allowing gunicorn/NetBox to serve more users at a time more efficiently - especially on the HTTP GET side of things. This change would only impact write-operations (POST/PATCH/PUT), GETs are unaffected. Right now, anyone running gunicorn with
threads = 1would see zero difference, no matter the number of workers.For users running threads > 1 they shouldn't see the error noted in OP (and other GitHub issues like #7696 ) but like all Python threading, there's likely a thread-count/per-worker of diminishing returns or where performance would suffer (though this is no different than before my change really). 2-4 threads per-worker seems like a good number since generally NetBox isn't very CPU intensive (at least on the gunicorn side).
My rough understanding is that signals and middleware have quirks when using a WSGI server with threaded workers. This change adds a lock around the
change_loggingcontext manager's tasks to make sure one request doesn't stomp on another somehow (which seems to be what was happening before). When there are concurrent write-operations happening, a lock introduces a slowdown, by design, to keep data accurate. Whether or not that difference is noticeable depends on the number of concurrent client connections doing write operations.In the other two similar plugins I linked earlier, both rely on
threading.Local()to ensure each thread has its own local storage for request data, which may also work, and require more complexity than I've gone with here. More testing is needed for sure though.@jeremystretch commented on GitHub (Nov 1, 2021):
I apologize for not keeping up this as well as I should have. (I've been focused on v3.1 feature work lately.) Allow me to summarize the conversation so far.
The working assumption is that no, it is not, hence the error reported above. @tyler-8's work appears to confirm this.
Per @tyler-8's proposed modification, probably.
My understanding is that wrapping the request processing inside
threading.Lock()essentially effects single-threaded operation, but I may be mistaken. I need to do some reading on this topic. (Thanks for the links above, Tyler!)IIRC we originally used TLS for the change logging middleware but moved away from it to resolve a similar bug. However, a lot has changed since then, and it may have been an error in the implementation. It's certainly worth taking another look at. I'm happy to draw inspiration from
django-crumas a proven solution, though I'm not sure it makes sense to take on as a dependency, primarily because we also need to store the webhook queue and likely similar per-request data in the future.(Please correct me if any of the above is wrong.)
@tyler-8 commented on GitHub (Nov 1, 2021):
dispatch_uid's like `post_save.connect(handle_changed_object, dispatch_uid=f'handle_changed_object_{request.id}') - I just started testing with that.It's possible that the
threading.Lock()helped the mismatched author problem by ensuring only one series of signals were used at a time using thehandle_changed_objectname. If the UID is made truly unique, that issue looks like it may go away even with threads enabled and no Lock.I certainly think so, just need to find the right combination. One solution isn't going to fix both problems I suspect.
The lock is a bottleneck, but wrapped around a potentially non-thread-safe operation, so the goal is to keep the amount of 'locked' work minimal. In this case it's only write operations that invoke the object change workflow. Even then - it's only for that one worker process, threads in other work processes don't share with threads in the "current" process. And as noted above, potentially solved another way for at least one of the two issues.
For the "SimpleLazyObject" issue, TLS may be required still. I need to do more test runs with the lock in place. "We" know how to re-create it (perform lots of concurrent write-operations with server threading enabled) but it doesn't happen 100% of the time - just due to the way the requests are being "load balanced" across the workers and threads.
@tyler-8 commented on GitHub (Nov 1, 2021):
Posting here as well:
https://github.com/netbox-community/netbox/issues/7696#issuecomment-956396714
Locking seems to solve the mixed user IDs in the change log, still trying to get a script that can more consistently invoke the error in the OP of this issue so I can see if my branch (with locking) makes any difference.
@DanSheps commented on GitHub (Nov 1, 2021):
Got to say, thank you @tyler-8 for all the hard work. I do not understand threading in Python so it is definitely appreciated.
@tyler-8 commented on GitHub (Nov 1, 2021):
Originally I was thinking #7696 was distinct from this one because while the
test_case.pyscript from that issue showed the behavior of wrong users' being populated in the changelog, it didn't cause an HTTP 500 at any point. But it's been noted in #5142 & #5609 that the error itself is only raised in rare occasions. The real problem of wrong users being attributed to changes is happening whether or not the HTTP 500 is being raised.Perhaps in the future a more elegant solution will be found, but based on the benchmarks, I think the addition of a Lock allows users to gain the GET performance boost/efficiency offered by threads and still maintain good changelog history. In the end, we're not losing much since the workaround has always been to set
threads = 1anyway.@robertpenz commented on GitHub (Nov 2, 2021):
Thx for the quick fix, we'll test it and get back to you.
@alinmear commented on GitHub (Nov 3, 2021):
We patched our Installation with changes from https://github.com/tyler-8/netbox/tree/objectchange_threadsafe and the threadlock. We updated our worker / thread gunicorn config:
The problem could not be reproduced so far. We are keeping an eye on it.
@tyler-8 commented on GitHub (Nov 3, 2021):
Re-posting here since it's relevant.
Benchmarks using a modified
test_case.py(from https://github.com/netbox-community/netbox/issues/7696#issue-1040078126) (ThreadPoolExecutor instead of ProcessPoolExecutor). Six executions of the script with each config.@DanSheps commented on GitHub (Nov 3, 2021):
Correct me if I am wrong, that seems to indicate a significant hit to performance if threads are enabled.
@tyler-8 commented on GitHub (Nov 3, 2021):
2 processes, 5 threads (2p5t) is pretty weak-sauce in terms of a deployment scale, I wouldn't go that route unless you had a really good reason. I used it more to force a thread-heavy workflow.
The results for 5p3t and 4p4t are nearly the same times as no threads, with a couple hiccups with much higher times. They're also more reasonable and common deployment configurations I would imagine (the former being the default recommended config). The benchmark also doesn't cover requests/per-second. If you had heavy GET-concurrency you're going to see lots of benefit from using threads. The benchmark script is almost entirely write operations, on the same app.model.
What I didn't benchmark is 5p3t without a lock at all. Because 5p3t with no lock leads to inaccurate changelog data so isn't a reasonable comparison, IMO.
Broken and faster < correct and slowerUsing threads is going to make for more efficient memory & CPU usage. If your machine had 8 CPUs, you could run 8 server processes with 3-4 threads and would be able to handle an absurd amount of concurrent requests.
I'm sure there's a more elegant way long-term solution using
threading.localsbut that's going to require much more complexity to implement. This lock at the very least ensures changelog data is accurate without changing much code, which could be a huge concern to some users.@DanSheps commented on GitHub (Nov 3, 2021):
Yes, but in almost all instances, it seems 7 workers, 1 thread seems to beat out the threading as far as performance.
@kkthxbye-code commented on GitHub (Nov 3, 2021):
It is to be expected that multiprocessing is faster than threading in Python, but with higher memory consumption.
@tyler-8 commented on GitHub (Nov 4, 2021):
I tweaked the benchmark labeling a little bit more for clarity. The "7" refers to the numbers of threads being used by the client, not the server.
Means that the server was receiving ~7 requests at a time, and had 10 threads available (2 processes, 5 threads, or 2p5t shorthand) to generate a response. 5p3t means 15 threads available to respond, 4p4t means 16 threads available. In Test Scenario 1, the server only had 8 threads (8p1t) available to respond.
So with the client load being equal in all four scenarios, Test Scenarios 3 & 4 both performed very close to the times seen with Test Scenario 1 - and still provided the same safe & accurate changelog data. Those couple instances where the times spiked up in 3 & 4 could've been other stuff (security and what not) running on my machine - these tests were all done locally on my laptop.
8 processes and 1 thread per-process is terribly inefficient - both in CPU usage and a higher memory footprint as @kkthxbye-code noted. And because NetBox's web-facing portion (running under gunicorn) is mostly network IO (between server and database, server and redis, and server and client) not using threading would be a "waste" of resources.
@alinmear commented on GitHub (Nov 11, 2021):
Just for the record: we are running our installation with the changes from https://github.com/tyler-8/netbox/tree/objectchange_threadsafe till now and have no problems regarding threading so far.
@jeremystretch commented on GitHub (Nov 17, 2021):
Finally found some time to really dig into this today. I moved the request object and webhook queue into thread-local storage, which should resolve the "cross talk" among threads without significantly impacting multi-threaded performance.
I've opened draft PR #7861 to propose my changes. Please test and let me know if you encounter any issues.
And a huge thank you to @alinmear, @drygdryg, @tyler-8, @robertpenz, and everyone else who helped identify and resolve this bug!
@tyler-8 commented on GitHub (Nov 22, 2021):
I did some testing against #7861 (commit) using
test_case.py(source). Surprisingly, thread-local storage was significantly slower. I also tried increasing the number of threads to see if that narrowed the gap any and the results stayed about the same. Lowering the processes to 4 and keeping 6 threads widened the delta even more.@jeremystretch commented on GitHub (Nov 22, 2021):
I did a quick local test using four Gunicorn workers with six threads each, and ran the test script five times per branch. Here are the average times for each:
develop(current release): 52.082s7657-threadsafe-changelog(TLS): 47.602sobjectchange_threadsafe(thread locking): 58.10sHowever, I noticed remarkable deviation between test runs within branches, so I'm going to take some time soon to set up a dedicated testing instance and repeat these tests. Will report back once that's been completed.
@tyler-8 commented on GitHub (Nov 23, 2021):
I spun a couple external VMs as I'd been running all of my tests on my Macbook - which doesn't exactly have a ton of CPU cores.
VM1 has 12 CPU Cores, 12GB RAM and is running NetBox, Postgresql, and Redis.
VM2 has 4 CPU Cores, 4GB of RAM and is only running the test_case.py script.
I did 10 test runs against each branch, here are the results
@kkthxbye-code commented on GitHub (Nov 23, 2021):
I was interested in how the performance looked in netbox-docker with the changes. Not sure if useful but here they are:
Preface: netbox-docker uses nginx unit as an application server instead of uwsgi. The default configuration is 4 processes and 1 process per thread.
Testing was done using tyler-8's script and timed with perf with 10 runs. The output is the average runtime and deviation. Run on a i7-9700 (8 cores no HT).
4 processes 1 threads per process:
Locking:
5,2421 +- 0,0892 seconds time elapsed ( +- 1,70% )
TLS:
5,2254 +- 0,0476 seconds time elapsed ( +- 0,91% )
Develop:
5,2388 +- 0,0488 seconds time elapsed ( +- 0,93% )
5 processes 6 workers per process:
Locking:
6,501 +- 0,574 seconds time elapsed ( +- 8,83% )
TLS:
7,610 +- 0,634 seconds time elapsed ( +- 8,33% )
Develop:
8,466 +- 0,527 seconds time elapsed ( +- 6,22% )
Not sure what can be gleaned from this. I don't think people really change the default processes/threads in netbox-docker as it is not exposed, and you have to map your own nginx-unit configuration in. The default setup in netbox-docker doesn't seem to be affected and seems more stable then using a high number of threads.
Personally we are not memory constrained in our netbox deployment, so using threads seem to be pure downside for us anyway, at least when using the docker image.
@jeremystretch commented on GitHub (Nov 23, 2021):
The issue with the thread locking approach is that is effects single-thread operation for most of the request:
To illustrate this, I added
time.sleep(5)immediately before theyieldstatement in the change logging context manager to simulate a compute-intensive request. I ran my tests from earlier with Gunicorn configured for a single worker with six threads to maximize the impact of single-threaded operation.Under the thread-local storage approach, a test run took 2m16s. With thread locking, it took 13m23, because each request was processed essentially one at a time.
After changing the Gunicorn configuration to 6 workers with a single thread each, both approaches take around 2m15s, which is to be expected.
Obviously there are a lot of variables to consider, such as the ration of WSGI worker processes to threads per worker, as well as the nature of each individual request. However I think these results are sufficient to rule out thread locking as a viable solution. Of course, I'd be interested to hear if anyone finds differently with a similar test.
@tyler-8 commented on GitHub (Nov 23, 2021):
Given the relatively small difference in times, TLS is absolutely the superior solution - even if it comes at a tiny performance hit under a write-heavy workload. It will scale better for different write workloads.
My original tests against the TLS branch were somewhat faulty since my Macbook was basically already oversubscribed on processes vs CPU cores.
@jeremystretch commented on GitHub (Nov 24, 2021):
I'm going to proceed with the TLS implementation for now, but will be paying careful attention to any feedback from the change. This also cleans up a fair bit of hoop-jumping within the middleware, which is nice.
@jeremystretch commented on GitHub (Nov 24, 2021):
So happy to have this finally addressed. Thanks again to everyone who contributed!