Clearing out expired ObjectChanges can cause HTTP worker memory consumption to balloon #4702

Closed
opened 2025-12-29 19:19:38 +01:00 by adam · 2 comments
Owner

Originally created by @mpalmer on GitHub (Mar 26, 2021).

Originally assigned to: @jeremystretch on GitHub.

NetBox version

v2.10.6

Python version

3.8

Steps to Reproduce

  1. Use Netbox as normal.
  2. Make a significant number of changing requests.
  3. At some random time (with 0.1% probability) this line of code will be executed:
    ObjectChange.objects.filter(time__lt=cutoff).delete()
    
  4. If there are a lot of objects whose time is before the cutoff, then the memory required to load all those objects before deleting them will cause the HTTP worker process to use a lot of memory, potentially causing all manner of unpleasantness.

Expected Behavior

Preferably, log trimming would be moved out of the HTTP service path altogether, because running a potentially expensive query (only sometimes) when you're trying to respond quickly to HTTP requests is a great way to make your p99 stats look really bad. At the very least, though, the query needs to be a straight-up DELETE FROM ... WHERE ..., rather than a load-then-delete.

Observed Behavior

Much memory. Very OOM.

Originally created by @mpalmer on GitHub (Mar 26, 2021). Originally assigned to: @jeremystretch on GitHub. ### NetBox version v2.10.6 ### Python version 3.8 ### Steps to Reproduce 1. Use Netbox as normal. 2. Make a significant number of changing requests. 3. At some random time (with 0.1% probability) [this line of code](https://github.com/netbox-community/netbox/blob/develop/netbox/extras/signals.py#L55) will be executed: ```python ObjectChange.objects.filter(time__lt=cutoff).delete() ``` 4. If there are a lot of objects whose `time` is before the cutoff, then the memory required to load all those objects before deleting them will cause the HTTP worker process to use a *lot* of memory, potentially causing all manner of unpleasantness. ### Expected Behavior Preferably, log trimming would be moved out of the HTTP service path altogether, because running a potentially expensive query (only sometimes) when you're trying to respond quickly to HTTP requests is a great way to make your p99 stats look *really* bad. At the very least, though, the query needs to be a straight-up `DELETE FROM ... WHERE ...`, rather than a load-then-delete. ### Observed Behavior Much memory. Very OOM.
adam added the type: bugstatus: accepted labels 2025-12-29 19:19:38 +01:00
adam closed this issue 2025-12-29 19:19:38 +01:00
Author
Owner

@davekempe commented on GitHub (Mar 26, 2021):

Oh Nice work @mpalmer

Ahh I think this explains why we needed this script:

#!/bin/bash
# threshold in seconds
threshold=$1
seconds=$2
re='^[0-9]+([.][0-9]+)?$'
while true; do
        mem=`ps aux | grep gunicorn | grep -v grep | awk '{print $4}' | sort -n |  tail -n1`
        echo -n "`date` - "
        if [[ $mem =~ $re ]] ; then
                echo $mem
                if [ `printf "%.0f" "$mem"` -gt $threshold ]; then
                        logger "Netbox restarted with memory value $mem"
                        service netbox restart
                        echo "Netbox restarted with memory value $mem"
                        sleep 120
                else
                        echo "Netbox is doing fine ($mem)"
                fi
        else
                echo "Number not found ($mem)"
                echo `ps aux | grep gunicorn | grep -v grep`
        fi
        sleep $seconds
done

Hopefully we can get this one sorted out and not need to restart Netbox regularly when it gets stuck. Note that ours is a large netbox instance (120K devices+ with many many interfaces), and it limps along with script keeping it going.

@davekempe commented on GitHub (Mar 26, 2021): Oh Nice work @mpalmer Ahh I think this explains why we needed this script: ``` #!/bin/bash # threshold in seconds threshold=$1 seconds=$2 re='^[0-9]+([.][0-9]+)?$' while true; do mem=`ps aux | grep gunicorn | grep -v grep | awk '{print $4}' | sort -n | tail -n1` echo -n "`date` - " if [[ $mem =~ $re ]] ; then echo $mem if [ `printf "%.0f" "$mem"` -gt $threshold ]; then logger "Netbox restarted with memory value $mem" service netbox restart echo "Netbox restarted with memory value $mem" sleep 120 else echo "Netbox is doing fine ($mem)" fi else echo "Number not found ($mem)" echo `ps aux | grep gunicorn | grep -v grep` fi sleep $seconds done ``` Hopefully we can get this one sorted out and not need to restart Netbox regularly when it gets stuck. Note that ours is a large netbox instance (120K devices+ with many many interfaces), and it limps along with script keeping it going.
Author
Owner

@jeremystretch commented on GitHub (Mar 26, 2021):

AFAICT delete() should just be executing a single DELETE SQL query on the matching objects, however debugging shows that it's loading all matching objects first and then deleting them by unique ID through a series of DELETE queries. Per the Django docs:

Django needs to fetch objects into memory to send signals and handle cascades. However, if there are no cascades and no signals, then Django may take a fast-path and delete objects without fetching into memory. For large deletes this can result in significantly reduced memory usage. The amount of executed queries can be reduced, too.

It's possible to side-step this by fetching only the relevant PKs ourselves (to avoid loading objects into memory) and deleting them directly. However, I'd like to figure out why Django isn't taking the fast-path automatically.

@jeremystretch commented on GitHub (Mar 26, 2021): AFAICT `delete()` should just be executing a single `DELETE` SQL query on the matching objects, however debugging shows that it's loading all matching objects first and then deleting them by unique ID through a series of `DELETE` queries. Per the [Django docs](https://docs.djangoproject.com/en/3.1/ref/models/querysets/#delete): > Django needs to fetch objects into memory to send signals and handle cascades. However, if there are no cascades and no signals, then Django may take a fast-path and delete objects without fetching into memory. For large deletes this can result in significantly reduced memory usage. The amount of executed queries can be reduced, too. It's possible to side-step this by fetching _only_ the relevant PKs ourselves (to avoid loading objects into memory) and deleting them directly. However, I'd like to figure out why Django isn't taking the fast-path automatically.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/netbox#4702