Clearing out expired ObjectChanges can cause HTTP worker memory consumption to balloon #4702

New Issue

adam · 2025-12-29T19:19:38+01:00

adam commented

2025-12-29 19:19:38 +01:00

Originally created by @mpalmer on GitHub (Mar 26, 2021).

Originally assigned to: @jeremystretch on GitHub.

NetBox version

v2.10.6

Python version

3.8

Steps to Reproduce

Use Netbox as normal.
Make a significant number of changing requests.
At some random time (with 0.1% probability) this line of code will be executed:
```
ObjectChange.objects.filter(time__lt=cutoff).delete()
```
If there are a lot of objects whose time is before the cutoff, then the memory required to load all those objects before deleting them will cause the HTTP worker process to use a lot of memory, potentially causing all manner of unpleasantness.

Expected Behavior

Preferably, log trimming would be moved out of the HTTP service path altogether, because running a potentially expensive query (only sometimes) when you're trying to respond quickly to HTTP requests is a great way to make your p99 stats look really bad. At the very least, though, the query needs to be a straight-up DELETE FROM ... WHERE ..., rather than a load-then-delete.

Observed Behavior

Much memory. Very OOM.

Originally created by @mpalmer on GitHub (Mar 26, 2021). Originally assigned to: @jeremystretch on GitHub. ### NetBox version v2.10.6 ### Python version 3.8 ### Steps to Reproduce 1. Use Netbox as normal. 2. Make a significant number of changing requests. 3. At some random time (with 0.1% probability) [this line of code](https://github.com/netbox-community/netbox/blob/develop/netbox/extras/signals.py#L55) will be executed: ```python ObjectChange.objects.filter(time__lt=cutoff).delete() ``` 4. If there are a lot of objects whose `time` is before the cutoff, then the memory required to load all those objects before deleting them will cause the HTTP worker process to use a *lot* of memory, potentially causing all manner of unpleasantness. ### Expected Behavior Preferably, log trimming would be moved out of the HTTP service path altogether, because running a potentially expensive query (only sometimes) when you're trying to respond quickly to HTTP requests is a great way to make your p99 stats look *really* bad. At the very least, though, the query needs to be a straight-up `DELETE FROM ... WHERE ...`, rather than a load-then-delete. ### Observed Behavior Much memory. Very OOM.

adam added the type: bug status: accepted labels 2025-12-29 19:19:38 +01:00

adam closed this issue

2025-12-29 19:19:38 +01:00

adam commented

2025-12-29 19:19:39 +01:00

@davekempe commented on GitHub (Mar 26, 2021):

Oh Nice work @mpalmer

Ahh I think this explains why we needed this script:

#!/bin/bash
# threshold in seconds
threshold=$1
seconds=$2
re='^[0-9]+([.][0-9]+)?$'
while true; do
        mem=`ps aux | grep gunicorn | grep -v grep | awk '{print $4}' | sort -n |  tail -n1`
        echo -n "`date` - "
        if [[ $mem =~ $re ]] ; then
                echo $mem
                if [ `printf "%.0f" "$mem"` -gt $threshold ]; then
                        logger "Netbox restarted with memory value $mem"
                        service netbox restart
                        echo "Netbox restarted with memory value $mem"
                        sleep 120
                else
                        echo "Netbox is doing fine ($mem)"
                fi
        else
                echo "Number not found ($mem)"
                echo `ps aux | grep gunicorn | grep -v grep`
        fi
        sleep $seconds
done

Hopefully we can get this one sorted out and not need to restart Netbox regularly when it gets stuck. Note that ours is a large netbox instance (120K devices+ with many many interfaces), and it limps along with script keeping it going.

@davekempe commented on GitHub (Mar 26, 2021): Oh Nice work @mpalmer Ahh I think this explains why we needed this script: ``` #!/bin/bash # threshold in seconds threshold=$1 seconds=$2 re='^[0-9]+([.][0-9]+)?$' while true; do mem=`ps aux | grep gunicorn | grep -v grep | awk '{print $4}' | sort -n | tail -n1` echo -n "`date` - " if [[ $mem =~ $re ]] ; then echo $mem if [ `printf "%.0f" "$mem"` -gt $threshold ]; then logger "Netbox restarted with memory value $mem" service netbox restart echo "Netbox restarted with memory value $mem" sleep 120 else echo "Netbox is doing fine ($mem)" fi else echo "Number not found ($mem)" echo `ps aux | grep gunicorn | grep -v grep` fi sleep $seconds done ``` Hopefully we can get this one sorted out and not need to restart Netbox regularly when it gets stuck. Note that ours is a large netbox instance (120K devices+ with many many interfaces), and it limps along with script keeping it going.

adam commented

2025-12-29 19:19:39 +01:00

@jeremystretch commented on GitHub (Mar 26, 2021):

AFAICT delete() should just be executing a single DELETE SQL query on the matching objects, however debugging shows that it's loading all matching objects first and then deleting them by unique ID through a series of DELETE queries. Per the Django docs:

Django needs to fetch objects into memory to send signals and handle cascades. However, if there are no cascades and no signals, then Django may take a fast-path and delete objects without fetching into memory. For large deletes this can result in significantly reduced memory usage. The amount of executed queries can be reduced, too.

It's possible to side-step this by fetching only the relevant PKs ourselves (to avoid loading objects into memory) and deleting them directly. However, I'd like to figure out why Django isn't taking the fast-path automatically.

@jeremystretch commented on GitHub (Mar 26, 2021): AFAICT `delete()` should just be executing a single `DELETE` SQL query on the matching objects, however debugging shows that it's loading all matching objects first and then deleting them by unique ID through a series of `DELETE` queries. Per the [Django docs](https://docs.djangoproject.com/en/3.1/ref/models/querysets/#delete): > Django needs to fetch objects into memory to send signals and handle cascades. However, if there are no cascades and no signals, then Django may take a fast-path and delete objects without fetching into memory. For large deletes this can result in significantly reduced memory usage. The amount of executed queries can be reduced, too. It's possible to side-step this by fetching _only_ the relevant PKs ourselves (to avoid loading objects into memory) and deleting them directly. However, I'd like to figure out why Django isn't taking the fast-path automatically.

adam referenced this issue

2025-12-29 22:24:20 +01:00

[PR #4762] [MERGED] Release v2.8.6 #12913

Sign in to join this conversation.

Branches Tags

main

21524-invlaid-paths-exception

21518-cf-decimal-zero

21356-etags

feature

20787-spectacular

21477-extend-graphql-api-filters-for-cables

21331-deprecate-querystring-tag

21304-deprecate-housekeeping-command

21481-facility-id-doesnt-show-in-rack-page

21429-cable-create-add-another-does-not-carry-over-termination

21364-swagger

20442-callable-audit

feature-ip-prefix-link

20923-dcim-templates

20911-dropdown-3

fix_module_substitution

21203-q-attr-denorm

21160-filterset

21118-site

20911-dropdown-2

21102-fix-graphiql-explorer

20044-elevation-stuck-lightmode

v4.5-beta1-release

20068-import-moduletype-attrs

20766-fix-german-translation-code-literals

20378-del-script

7604-filter-modifiers-v3

circuit-swap

12318-case-insensitive-uniqueness

20637-improve-device-q-filter

20660-script-load

19724-graphql

20614-update-ruff

14884-script

02496-max-page

19720-macaddress-interface-generic-relation

19408-circuit-terminations-export-templates

20203-openapi-check

fix-19669-api-image-download

7604-filter-modifiers

19275-fixes-interface-bulk-edit

fix-17794-get_field_value_return_list

11507-show-aggregate-and-rir-on-api

9583-add_column_specific_search_field_to_tables

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/netbox#4702