Add configuration to prevent inaccurate prometheus metrics #3081

New Issue

adam · 2025-12-29T18:25:26+01:00

adam commented

2025-12-29 18:25:26 +01:00

Originally created by @tyler-8 on GitHub (Dec 19, 2019).

Environment

Python version: 3.6.8
NetBox version: 2.6.7

Proposed Functionality

When running django-prometheus in a multiprocessed environment (as with gunicorn or uwsgi), the metrics files are generated using the pid of each worker process. This becomes an issue as worker processes recycle (by whatever cause) and get new a new pid and the old metrics files persist - and those numbers are scraped by prometheus_client. This topic has been discussed further in the main prometheus_client repo: https://github.com/prometheus/client_python/issues/275 & https://github.com/prometheus/client_python/issues/204

I have combined the two fixes documented (from here and here). The below block would be added near the top of the Netbox settings.py. There's no impact to users not running with metrics enabled. And if users do have metrics enabled, the below block is only executed if the prometheus_multiproc_dir environment variable is set (which is essentially mandatory to get accurate metrics with a multiprocessed wsgi server).

if "prometheus_multiproc_dir" in os.environ:
    try:
        import prometheus_client
        import uwsgi
        prometheus_client.values.ValueClass = prometheus_client.values.MultiProcessValue(
            _pidFunc=uwsgi.worker_id)
    except ImportError:
        pass  # not running in uwsgi

Use Case

Any user that wants to take advantage of the prometheus metrics will eventually have performance issues (as the number of metrics files grows) and inaccurate metrics.

Database Changes

N/A

External Dependencies

N/A

Originally created by @tyler-8 on GitHub (Dec 19, 2019).  ### Environment * Python version: 3.6.8 * NetBox version: 2.6.7  ### Proposed Functionality When running `django-prometheus` in a multiprocessed environment (as with `gunicorn` or `uwsgi`), the metrics files are generated using the `pid` of each worker process. This becomes an issue as worker processes recycle (by whatever cause) and get new a new `pid` and the old metrics files persist - and those numbers are scraped by `prometheus_client`. This topic has been discussed further in the main `prometheus_client` repo: https://github.com/prometheus/client_python/issues/275 & https://github.com/prometheus/client_python/issues/204 I have combined the two fixes documented (from [here](https://github.com/prometheus/client_python/issues/275#issuecomment-504755024) and [here](https://github.com/korfuri/django-prometheus/blob/master/documentation/exports.md#exporting-metrics-in-a-wsgi-application-with-multiple-processes-globally)). The below block would be added near the top of the Netbox `settings.py`. There's no impact to users not running with metrics enabled. And if users do have metrics enabled, the below block is only executed if the [prometheus_multiproc_dir](https://netbox.readthedocs.io/en/stable/additional-features/prometheus-metrics/) environment variable is set (which is essentially mandatory to get accurate metrics with a multiprocessed wsgi server). ``` if "prometheus_multiproc_dir" in os.environ: try: import prometheus_client import uwsgi prometheus_client.values.ValueClass = prometheus_client.values.MultiProcessValue( _pidFunc=uwsgi.worker_id) except ImportError: pass # not running in uwsgi ```  ### Use Case Any user that wants to take advantage of the prometheus metrics will eventually have performance issues (as the number of metrics files grows) and inaccurate metrics.  ### Database Changes N/A  ### External Dependencies N/A

adam added the status: accepted type: documentation labels 2025-12-29 18:25:26 +01:00

adam closed this issue

2025-12-29 18:25:26 +01:00

adam commented

2025-12-29 18:25:26 +01:00

@kobayashi commented on GitHub (Dec 24, 2019):

Thanks to give us this useful tip! Could we describe about this under Multi Processing Notes
?

@kobayashi commented on GitHub (Dec 24, 2019): Thanks to give us this useful tip! Could we describe about this under [Multi Processing Notes ](https://netbox.readthedocs.io/en/stable/additional-features/prometheus-metrics/#multi-processing-notes) ?

adam commented

2025-12-29 18:25:27 +01:00

@tyler-8 commented on GitHub (Dec 24, 2019):

I think this configuration snippet may actually work in configuration.py instead of settings.py - in which case users can add the relevant parts on their own (and therefore would just be a documentation update). I need to tinker around with that more.

@tyler-8 commented on GitHub (Dec 24, 2019): I think this configuration snippet may actually work in `configuration.py` instead of `settings.py` - in which case users can add the relevant parts on their own (and therefore would just be a documentation update). I need to tinker around with that more.

adam commented

2025-12-29 18:25:27 +01:00

@lampwins commented on GitHub (Dec 30, 2019):

I agree this should work in configuaration.py. Furthermore, core netbox should really steer clear of any implementation that makes assumptions about the runtime environment like uwsgi specifics. I know you have covered your bases here by accounting both situations but I wouldn't want to set a precedent with this.

That being said, I am fine with accepting this as a documentation addition.

@lampwins commented on GitHub (Dec 30, 2019): I agree this _should_ work in configuaration.py. Furthermore, core netbox should really steer clear of any implementation that makes assumptions about the runtime environment like uwsgi specifics. I know you have covered your bases here by accounting both situations but I wouldn't want to set a precedent with this. That being said, I am fine with accepting this as a documentation addition.

adam commented

2025-12-29 18:25:27 +01:00

@tyler-8 commented on GitHub (Dec 30, 2019):

steer clear of any implementation that makes assumptions about the runtime environment like uwsgi specifics

Agreed. I think suggestions/examples in the documentation are the way to go here.

As I continue to dig in to the issue however, it seems more complicated for gunicorn. gunicorn doesn't have any sort of static worker ID like uwsgi so the workaround in the OP doesn't translate.

From https://github.com/prometheus/client_python#multiprocess-mode-gunicorn:

This directory must be wiped between Gunicorn runs (before startup is recommended).

I'm not sure what the best approach for this would be. A parameter in the systemd service file that wipes the directory on service startup/restart?

I also considered an external Python script that wipes the directory of unused PID files every couple of minutes but that doesn't seem ideal.

Needless to say, it's complicated.

@tyler-8 commented on GitHub (Dec 30, 2019): > steer clear of any implementation that makes assumptions about the runtime environment like uwsgi specifics Agreed. I think suggestions/examples in the documentation are the way to go here. As I continue to dig in to the issue however, it seems more complicated for `gunicorn`. `gunicorn` doesn't have any sort of static worker ID like `uwsgi` so the workaround in the OP doesn't translate. From https://github.com/prometheus/client_python#multiprocess-mode-gunicorn: > This directory must be wiped between Gunicorn runs (before startup is recommended). I'm not sure what the best approach for this would be. A parameter in the `systemd` service file that wipes the directory on service startup/restart? I also considered an external Python script that wipes the directory of unused PID files every couple of minutes but that doesn't seem ideal. Needless to say, it's complicated.

adam commented

2025-12-29 18:25:27 +01:00

@tyler-8 commented on GitHub (Jan 22, 2020):

I've been digging through this on and off and I haven't really come around to a solution for gunicorn. Without some heavy custom code (found digging around https://github.com/prometheus/client_python/issues) that may not translate across environments, I don't know that there's actually a way to have clean, accurate metrics with prometheus_client and gunicorn.

With uwsgi the solution is rather simple (noted in OP) and that keeps the metric data clean and accurate. I'm happy to write up the documentation piece but it'd be for uwsgi only, which is of course counter to what the official docs specify (gunicorn).

@tyler-8 commented on GitHub (Jan 22, 2020): I've been digging through this on and off and I haven't really come around to a solution for `gunicorn`. Without some heavy custom code (found digging around https://github.com/prometheus/client_python/issues) that may not translate across environments, I don't know that there's actually a way to have clean, accurate metrics with `prometheus_client` and `gunicorn`. With `uwsgi` the solution is rather simple (noted in OP) and that keeps the metric data clean and accurate. I'm happy to write up the documentation piece but it'd be for `uwsgi` only, which is of course counter to what the official docs specify (`gunicorn`).

adam commented

2025-12-29 18:25:27 +01:00

@jeremystretch commented on GitHub (Feb 21, 2020):

@tyler-8 Have you made any further progress with gunicorn? What do you want to do with this issue?

@jeremystretch commented on GitHub (Feb 21, 2020): @tyler-8 Have you made any further progress with gunicorn? What do you want to do with this issue?

adam commented

2025-12-29 18:25:28 +01:00

@tyler-8 commented on GitHub (Feb 24, 2020):

Without fairly invasive changes to Netbox that are specific to a gunicorn implementation, I don't think there is a work around for gunicorn.

What's happening here is:

gunicorn workers run, each process creates a counter_<pid>.db file that maintains the metrics for that pid. The prometheus-client (used by django-prometheus) then collects all the metrics from all the .db files in the metrics file path for viewing at /metrics. Up to this point, everything is fine.

The problems arise when workers recycle and get new PIDs and/or when the metrics file path is cleaned up. prometheus-client continues accounting for the metrics in the now "old" counter_<oldpid>.db files - but if those old PID files are cleaned up (which they should be because they will continue to generate new files and cause metric calculation to be more intensive) then from the perspective of your metrics system (prometheus, ELK, Wavefront, etc), the counter is less than the previous reported number and it assumes a reset to 0 has happened (even though it hasn't because the count is just all_db_files_counts - the_now_missing_db_file(s).

This will cause huge spikes in the metrics to appear in the graphs, which makes it difficult to form any kind of alerting or thresholding around the metrics.

If a worker reuses a PID later down the line, it will just tack on to the existing counter_<pid>.db but this event is rare and definitely not reliable for accurate metrics.

uwsgi gets around all of this by having a relatively simple way to use the worker ID (0 through X number of workers) rather than the worker PID, so the workers always reattach to their specific metrics file. There's no need for cleanup (unless rebooting the system in which case the metrics should be cleaned up and it's no problem, particularly if your metrics path is in /run).

From reading through a number of prometheus-client issues, it doesn't seem like there is any clear implementation or changes in the works to help resolve this issue with gunicorn and multiprocessing.

All this to say, I'm not sure where to leave this issue, other than maybe putting something similar to my above text as a major caveat of using the metrics with gunicorn.

@tyler-8 commented on GitHub (Feb 24, 2020): Without fairly invasive changes to Netbox that are specific to a `gunicorn` implementation, I don't think there is a work around for `gunicorn`. What's happening here is: gunicorn workers run, each process creates a `counter_<pid>.db` file that maintains the metrics for that pid. The `prometheus-client` (used by `django-prometheus`) then collects all the metrics from all the `.db` files in the metrics file path for viewing at `/metrics`. Up to this point, everything is fine. The problems arise when workers recycle and get new PIDs and/or when the metrics file path is cleaned up. `prometheus-client` continues accounting for the metrics in the now "old" `counter_<oldpid>.db` files - but if those old PID files are cleaned up (which they should be because they will continue to generate new files and cause metric calculation to be more intensive) then from the perspective of your metrics system (prometheus, ELK, Wavefront, etc), the counter is less than the previous reported number and it assumes a reset to 0 has happened (even though it hasn't because the count is just `all_db_files_counts - the_now_missing_db_file(s)`. This will cause huge spikes in the metrics to appear in the graphs, which makes it difficult to form any kind of alerting or thresholding around the metrics. If a worker reuses a PID later down the line, it will just tack on to the existing `counter_<pid>.db` but this event is rare and definitely not reliable for accurate metrics. `uwsgi` gets around all of this by having a relatively simple way to use the worker ID (0 through X number of workers) rather than the worker PID, so the workers always reattach to their specific metrics file. There's no need for cleanup (unless rebooting the system in which case the metrics should be cleaned up and it's no problem, particularly if your metrics path is in `/run`). From reading through a number of `prometheus-client` issues, it doesn't seem like there is any clear implementation or changes in the works to help resolve this issue with `gunicorn` and multiprocessing. All this to say, I'm not sure where to leave this issue, other than maybe putting something similar to my above text as a major caveat of using the metrics with `gunicorn`.

adam commented

2025-12-29 18:25:28 +01:00

@lampwins commented on GitHub (Feb 24, 2020):

Yeah, I think we can draw a line here with documentation. Perhaps a very brief version of your above summary and a note to say that if metrics are super important to you, then use uwsgi. To be honest, the people for which this will matter are likely already doing this.

@lampwins commented on GitHub (Feb 24, 2020): Yeah, I think we can draw a line here with documentation. Perhaps a very brief version of your above summary and a note to say that if metrics are super important to you, then use uwsgi. To be honest, the people for which this will matter are likely already doing this.

adam commented

2025-12-29 18:25:28 +01:00

@tyler-8 commented on GitHub (Feb 24, 2020):

I'm happy to piece something together for proposal if the issue is accepted.

@tyler-8 commented on GitHub (Feb 24, 2020): I'm happy to piece something together for proposal if the issue is accepted.

adam commented

2025-12-29 18:25:28 +01:00

@tyler-8 commented on GitHub (Mar 13, 2020):

I'm going to tackle this PR this weekend. Sorry for the delay!

@tyler-8 commented on GitHub (Mar 13, 2020): I'm going to tackle this PR this weekend. Sorry for the delay!

adam referenced this issue

2025-12-29 22:21:55 +01:00

[PR #3081] [CLOSED] Potential fix for randomly recurring AnonymousUser bug #12498

Sign in to join this conversation.

Branches Tags

main

21524-invlaid-paths-exception

21518-cf-decimal-zero

21356-etags

feature

20787-spectacular

21477-extend-graphql-api-filters-for-cables

21331-deprecate-querystring-tag

21304-deprecate-housekeeping-command

21429-cable-create-add-another-does-not-carry-over-termination

21364-swagger

20442-callable-audit

feature-ip-prefix-link

20923-dcim-templates

20911-dropdown-3

fix_module_substitution

21203-q-attr-denorm

21160-filterset

21118-site

20911-dropdown-2

21102-fix-graphiql-explorer

20044-elevation-stuck-lightmode

v4.5-beta1-release

20068-import-moduletype-attrs

20766-fix-german-translation-code-literals

20378-del-script

7604-filter-modifiers-v3

circuit-swap

12318-case-insensitive-uniqueness

20637-improve-device-q-filter

20660-script-load

19724-graphql

20614-update-ruff

14884-script

02496-max-page

19720-macaddress-interface-generic-relation

19408-circuit-terminations-export-templates

20203-openapi-check

fix-19669-api-image-download

7604-filter-modifiers

19275-fixes-interface-bulk-edit

fix-17794-get_field_value_return_list

11507-show-aggregate-and-rir-on-api

9583-add_column_specific_search_field_to_tables

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/netbox#3081