Netbox workers exhibit slow memory leak #2948

Closed
opened 2025-12-29 18:23:52 +01:00 by adam · 6 comments
Owner

Originally created by @ajknv on GitHub (Oct 11, 2019).

Environment

  • Python version: 3.6.8
  • NetBox version: 2.6.3

Steps to Reproduce

(I did this as an A/B test on two servers in a clustered HA configuration)

  1. Deploy server A with Netbox configured to have no max_requests value for gunicorn workers (e.g. unlimited). I deploy via the community docker image for what it's worth, though backed by RDS and ElastiCache DBs rather than using the DB containers.
  2. Deploy server B with Netbox configured to set a max_requests value for gunicorn workers. I set it to 2500.
  3. Observe the memory use of the servers over a course of several days/few weeks. Server A free memory slowly drops, in my case from an initial high of ~12GB (VM officially has 16GB allocated) to a current value of ~336K. Gunicorn worker processes are reported consuming gigabytes of memory (current title holder is at 3.7GB). Server B holds more or less steady on free memory.

Expected Behavior

While the gunicorn max_requests configuration does provide a workaround, ideally a server service application like Netbox shouldn't slowly drain all memory on the host until it seizes up.

Observed Behavior

I initially got turned on to tracing and experimenting with performance parameters like the gunicorn max_requests in the course of debugging why both instances in the HA cluster became completely unresponsive, and seized up the VM instances to such a degree that they had to be hard-terminated in AWS for replacement. I'm continuing to observe server A for a similar total failure for complete verification of that root cause, but feel fairly confident that this memory leaking behavior is the primary smoking gun.

Originally created by @ajknv on GitHub (Oct 11, 2019). ### Environment * Python version: 3.6.8 * NetBox version: 2.6.3 ### Steps to Reproduce (I did this as an A/B test on two servers in a clustered HA configuration) 1. Deploy server A with Netbox configured to have no max_requests value for gunicorn workers (e.g. unlimited). I deploy via the community docker image for what it's worth, though backed by RDS and ElastiCache DBs rather than using the DB containers. 2. Deploy server B with Netbox configured to set a max_requests value for gunicorn workers. I set it to 2500. 3. Observe the memory use of the servers over a course of several days/few weeks. Server A free memory slowly drops, in my case from an initial high of ~12GB (VM officially has 16GB allocated) to a current value of ~336K. Gunicorn worker processes are reported consuming gigabytes of memory (current title holder is at 3.7GB). Server B holds more or less steady on free memory. ### Expected Behavior While the gunicorn max_requests configuration does provide a workaround, ideally a server service application like Netbox shouldn't slowly drain all memory on the host until it seizes up. ### Observed Behavior I initially got turned on to tracing and experimenting with performance parameters like the gunicorn max_requests in the course of debugging why both instances in the HA cluster became completely unresponsive, and seized up the VM instances to such a degree that they had to be hard-terminated in AWS for replacement. I'm continuing to observe server A for a similar total failure for complete verification of that root cause, but feel fairly confident that this memory leaking behavior is the primary smoking gun.
adam closed this issue 2025-12-29 18:23:52 +01:00
Author
Owner

@DanSheps commented on GitHub (Oct 11, 2019):

I am going to go out on a limb, and say this is going to be an upstream issue.

There are four layers here:

  • Python
  • Gunicorn
  • DJango
  • Netbox

Most functions are abstracted by the first 2 so Netbox itself does not have a lot of interaction with the operating system at all. The memory leak, if you determine there is one, could come from any layer.

@DanSheps commented on GitHub (Oct 11, 2019): I am going to go out on a limb, and say this is going to be an upstream issue. There are four layers here: * Python * Gunicorn * DJango * Netbox Most functions are abstracted by the first 2 so Netbox itself does not have a lot of interaction with the operating system at all. The memory leak, if you determine there is one, could come from any layer.
Author
Owner

@jeremystretch commented on GitHub (Oct 12, 2019):

Agreed. Unless you can attribute the leak to something particular inside NetBox, I don't think there's much we can do. Although, it might be worth adding max_requests to the reference gunicorn config in the docs to mitigate the issue.

@jeremystretch commented on GitHub (Oct 12, 2019): Agreed. Unless you can attribute the leak to something particular inside NetBox, I don't think there's much we can do. Although, it might be worth adding `max_requests` to the reference gunicorn config in the docs to mitigate the issue.
Author
Owner

@ajknv commented on GitHub (Oct 14, 2019):

It is possible it could be upstream, no doubt, though given the popularity of all of those layers that conclusion would seem to imply that every single project built on pretty widely adopted frameworks suffers from a fairly aggressive memory leak. IMO that doesn't really pass the Occam's Razor test for explaining the issue.

Netbox itself does not have a lot of interaction with the operating system at all

What I mean by memory leak in this context is the likelihood of something hanging on to and accumulating references to Python objects such that they can't be garbage collected, e.g. a memory leak in the sense typical of higher-level languages; not just something like a malloc request directly to the OS. Something like DB results sets or whatnot.

The memory leak, if you determine there is one

I'm fairly puzzled, the description provided in the issue description isn't even convincing to you that there is definitely a memory leak, setting aside doubts about which layer it is coming from?

Unless you can attribute the leak to something particular inside NetBox, I don't think there's much we can do.

I can try to find the time to help, but I guess I was hoping you might consider it worthwhile as the maintainers of the project, to consider running some tools (static analysis/linters, memory analysis, etc.) or even just do an inspection pass over the code around data handling to look for possible culprits within Netbox.

@ajknv commented on GitHub (Oct 14, 2019): It is possible it could be upstream, no doubt, though given the popularity of all of those layers that conclusion would seem to imply that every single project built on pretty widely adopted frameworks suffers from a fairly aggressive memory leak. IMO that doesn't really pass the Occam's Razor test for explaining the issue. > Netbox itself does not have a lot of interaction with the operating system at all What I mean by memory leak in this context is the likelihood of something hanging on to and accumulating references to Python objects such that they can't be garbage collected, e.g. a memory leak in the sense typical of higher-level languages; not just something like a malloc request directly to the OS. Something like DB results sets or whatnot. > The memory leak, if you determine there is one I'm fairly puzzled, the description provided in the issue description isn't even convincing to you that there is definitely a memory leak, setting aside doubts about which layer it is coming from? > Unless you can attribute the leak to something particular inside NetBox, I don't think there's much we can do. I can try to find the time to help, but I guess I was hoping you might consider it worthwhile as the maintainers of the project, to consider running some tools (static analysis/linters, memory analysis, etc.) or even just do an inspection pass over the code around data handling to look for possible culprits within Netbox.
Author
Owner

@jeremystretch commented on GitHub (Oct 15, 2019):

I can try to find the time to help, but I guess I was hoping you might consider it worthwhile as the maintainers of the project, to consider running some tools

I'm sure it's worthwhile, but so are the 186 other issues currently open that our four (part-time) maintainers are tasked with managing. Open source projects depend on contributions from their users to survive. @ajknv if you're willing to commit to doing this work I'll leave this issue open. Otherwise, I think adding a line in the docs to suggest adding a max_requests limit in the gunicorn configuration is a very acceptable workaround.

@jeremystretch commented on GitHub (Oct 15, 2019): > I can try to find the time to help, but I guess I was hoping you might consider it worthwhile as the maintainers of the project, to consider running some tools I'm sure it's worthwhile, but so are the 186 other issues currently open that our four (part-time) maintainers are tasked with managing. Open source projects depend on contributions from their users to survive. @ajknv if you're willing to commit to doing this work I'll leave this issue open. Otherwise, I think adding a line in the docs to suggest adding a `max_requests` limit in the gunicorn configuration is a very acceptable workaround.
Author
Owner

@tyler-8 commented on GitHub (Oct 16, 2019):

I agree that max_requests is really the way to go. Memory leaks can be difficult and time-intensive to pin down, and even then may eventually be attributed to any number of third party libraries in use (if the issue truly lies in the Netbox code to begin with).

I would add that max_requests should be paired with max_requests_jitter in the suggested configuration. This helps prevent all the gunicorn workers from being restarted at nearly the same time. I'm using uwsgi myself but it has similar configuration parameters.

Further general recommendations: https://adamj.eu/tech/2019/09/19/working-around-memory-leaks-in-your-django-app/

@tyler-8 commented on GitHub (Oct 16, 2019): I agree that `max_requests` is really the way to go. Memory leaks can be difficult and time-intensive to pin down, and even then may eventually be attributed to any number of third party libraries in use (if the issue truly lies in the Netbox code to begin with). I would add that `max_requests` should be paired with `max_requests_jitter` in the suggested configuration. This helps prevent all the gunicorn workers from being restarted at nearly the same time. I'm using `uwsgi` myself but it has similar configuration parameters. Further general recommendations: https://adamj.eu/tech/2019/09/19/working-around-memory-leaks-in-your-django-app/
Author
Owner

@jeremystretch commented on GitHub (Nov 1, 2019):

Going to close this out as #3658 seems to provide an adequate workaround in the absence of a known root issue.

@jeremystretch commented on GitHub (Nov 1, 2019): Going to close this out as #3658 seems to provide an adequate workaround in the absence of a known root issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/netbox#2948