Add configuration to prevent inaccurate prometheus metrics #3081

Closed
opened 2025-12-29 18:25:26 +01:00 by adam · 10 comments
Owner

Originally created by @tyler-8 on GitHub (Dec 19, 2019).

Environment

  • Python version: 3.6.8
  • NetBox version: 2.6.7

Proposed Functionality

When running django-prometheus in a multiprocessed environment (as with gunicorn or uwsgi), the metrics files are generated using the pid of each worker process. This becomes an issue as worker processes recycle (by whatever cause) and get new a new pid and the old metrics files persist - and those numbers are scraped by prometheus_client. This topic has been discussed further in the main prometheus_client repo: https://github.com/prometheus/client_python/issues/275 & https://github.com/prometheus/client_python/issues/204

I have combined the two fixes documented (from here and here). The below block would be added near the top of the Netbox settings.py. There's no impact to users not running with metrics enabled. And if users do have metrics enabled, the below block is only executed if the prometheus_multiproc_dir environment variable is set (which is essentially mandatory to get accurate metrics with a multiprocessed wsgi server).

if "prometheus_multiproc_dir" in os.environ:
    try:
        import prometheus_client
        import uwsgi
        prometheus_client.values.ValueClass = prometheus_client.values.MultiProcessValue(
            _pidFunc=uwsgi.worker_id)
    except ImportError:
        pass  # not running in uwsgi

Use Case

Any user that wants to take advantage of the prometheus metrics will eventually have performance issues (as the number of metrics files grows) and inaccurate metrics.

Database Changes

N/A

External Dependencies

N/A

Originally created by @tyler-8 on GitHub (Dec 19, 2019). <!-- NOTE: This form is only for proposing specific new features or enhancements. If you have a general idea or question, please post to our mailing list instead of opening an issue: https://groups.google.com/forum/#!forum/netbox-discuss NOTE: Due to an excessive backlog of feature requests, we are not currently accepting any proposals which significantly extend NetBox's feature scope. Please describe the environment in which you are running NetBox. Be sure that you are running an unmodified instance of the latest stable release before submitting a bug report. --> ### Environment * Python version: 3.6.8<!-- Example: 3.5.4 --> * NetBox version: 2.6.7<!-- Example: 2.3.6 --> <!-- Describe in detail the new functionality you are proposing. Include any specific changes to work flows, data models, or the user interface. --> ### Proposed Functionality When running `django-prometheus` in a multiprocessed environment (as with `gunicorn` or `uwsgi`), the metrics files are generated using the `pid` of each worker process. This becomes an issue as worker processes recycle (by whatever cause) and get new a new `pid` and the old metrics files persist - and those numbers are scraped by `prometheus_client`. This topic has been discussed further in the main `prometheus_client` repo: https://github.com/prometheus/client_python/issues/275 & https://github.com/prometheus/client_python/issues/204 I have combined the two fixes documented (from [here](https://github.com/prometheus/client_python/issues/275#issuecomment-504755024) and [here](https://github.com/korfuri/django-prometheus/blob/master/documentation/exports.md#exporting-metrics-in-a-wsgi-application-with-multiple-processes-globally)). The below block would be added near the top of the Netbox `settings.py`. There's no impact to users not running with metrics enabled. And if users do have metrics enabled, the below block is only executed if the [prometheus_multiproc_dir](https://netbox.readthedocs.io/en/stable/additional-features/prometheus-metrics/) environment variable is set (which is essentially mandatory to get accurate metrics with a multiprocessed wsgi server). ``` if "prometheus_multiproc_dir" in os.environ: try: import prometheus_client import uwsgi prometheus_client.values.ValueClass = prometheus_client.values.MultiProcessValue( _pidFunc=uwsgi.worker_id) except ImportError: pass # not running in uwsgi ``` <!-- Convey an example use case for your proposed feature. Write from the perspective of a NetBox user who would benefit from the proposed functionality and describe how. ---> ### Use Case Any user that wants to take advantage of the prometheus metrics will eventually have performance issues (as the number of metrics files grows) and inaccurate metrics. <!-- Note any changes to the database schema necessary to support the new feature. For example, does the proposal require adding a new model or field? (Not all new features require database changes.) ---> ### Database Changes N/A <!-- List any new dependencies on external libraries or services that this new feature would introduce. For example, does the proposal require the installation of a new Python package? (Not all new features introduce new dependencies.) --> ### External Dependencies N/A
adam added the status: acceptedtype: documentation labels 2025-12-29 18:25:26 +01:00
adam closed this issue 2025-12-29 18:25:26 +01:00
Author
Owner

@kobayashi commented on GitHub (Dec 24, 2019):

Thanks to give us this useful tip! Could we describe about this under Multi Processing Notes
?

@kobayashi commented on GitHub (Dec 24, 2019): Thanks to give us this useful tip! Could we describe about this under [Multi Processing Notes ](https://netbox.readthedocs.io/en/stable/additional-features/prometheus-metrics/#multi-processing-notes) ?
Author
Owner

@tyler-8 commented on GitHub (Dec 24, 2019):

I think this configuration snippet may actually work in configuration.py instead of settings.py - in which case users can add the relevant parts on their own (and therefore would just be a documentation update). I need to tinker around with that more.

@tyler-8 commented on GitHub (Dec 24, 2019): I think this configuration snippet may actually work in `configuration.py` instead of `settings.py` - in which case users can add the relevant parts on their own (and therefore would just be a documentation update). I need to tinker around with that more.
Author
Owner

@lampwins commented on GitHub (Dec 30, 2019):

I agree this should work in configuaration.py. Furthermore, core netbox should really steer clear of any implementation that makes assumptions about the runtime environment like uwsgi specifics. I know you have covered your bases here by accounting both situations but I wouldn't want to set a precedent with this.

That being said, I am fine with accepting this as a documentation addition.

@lampwins commented on GitHub (Dec 30, 2019): I agree this _should_ work in configuaration.py. Furthermore, core netbox should really steer clear of any implementation that makes assumptions about the runtime environment like uwsgi specifics. I know you have covered your bases here by accounting both situations but I wouldn't want to set a precedent with this. That being said, I am fine with accepting this as a documentation addition.
Author
Owner

@tyler-8 commented on GitHub (Dec 30, 2019):

steer clear of any implementation that makes assumptions about the runtime environment like uwsgi specifics

Agreed. I think suggestions/examples in the documentation are the way to go here.

As I continue to dig in to the issue however, it seems more complicated for gunicorn. gunicorn doesn't have any sort of static worker ID like uwsgi so the workaround in the OP doesn't translate.

From https://github.com/prometheus/client_python#multiprocess-mode-gunicorn:

This directory must be wiped between Gunicorn runs (before startup is recommended).

I'm not sure what the best approach for this would be. A parameter in the systemd service file that wipes the directory on service startup/restart?

I also considered an external Python script that wipes the directory of unused PID files every couple of minutes but that doesn't seem ideal.

Needless to say, it's complicated.

@tyler-8 commented on GitHub (Dec 30, 2019): > steer clear of any implementation that makes assumptions about the runtime environment like uwsgi specifics Agreed. I think suggestions/examples in the documentation are the way to go here. As I continue to dig in to the issue however, it seems more complicated for `gunicorn`. `gunicorn` doesn't have any sort of static worker ID like `uwsgi` so the workaround in the OP doesn't translate. From https://github.com/prometheus/client_python#multiprocess-mode-gunicorn: > This directory must be wiped between Gunicorn runs (before startup is recommended). I'm not sure what the best approach for this would be. A parameter in the `systemd` service file that wipes the directory on service startup/restart? I also considered an external Python script that wipes the directory of unused PID files every couple of minutes but that doesn't seem ideal. Needless to say, it's complicated.
Author
Owner

@tyler-8 commented on GitHub (Jan 22, 2020):

I've been digging through this on and off and I haven't really come around to a solution for gunicorn. Without some heavy custom code (found digging around https://github.com/prometheus/client_python/issues) that may not translate across environments, I don't know that there's actually a way to have clean, accurate metrics with prometheus_client and gunicorn.

With uwsgi the solution is rather simple (noted in OP) and that keeps the metric data clean and accurate. I'm happy to write up the documentation piece but it'd be for uwsgi only, which is of course counter to what the official docs specify (gunicorn).

@tyler-8 commented on GitHub (Jan 22, 2020): I've been digging through this on and off and I haven't really come around to a solution for `gunicorn`. Without some heavy custom code (found digging around https://github.com/prometheus/client_python/issues) that may not translate across environments, I don't know that there's actually a way to have clean, accurate metrics with `prometheus_client` and `gunicorn`. With `uwsgi` the solution is rather simple (noted in OP) and that keeps the metric data clean and accurate. I'm happy to write up the documentation piece but it'd be for `uwsgi` only, which is of course counter to what the official docs specify (`gunicorn`).
Author
Owner

@jeremystretch commented on GitHub (Feb 21, 2020):

@tyler-8 Have you made any further progress with gunicorn? What do you want to do with this issue?

@jeremystretch commented on GitHub (Feb 21, 2020): @tyler-8 Have you made any further progress with gunicorn? What do you want to do with this issue?
Author
Owner

@tyler-8 commented on GitHub (Feb 24, 2020):

Without fairly invasive changes to Netbox that are specific to a gunicorn implementation, I don't think there is a work around for gunicorn.

What's happening here is:

gunicorn workers run, each process creates a counter_<pid>.db file that maintains the metrics for that pid. The prometheus-client (used by django-prometheus) then collects all the metrics from all the .db files in the metrics file path for viewing at /metrics. Up to this point, everything is fine.

The problems arise when workers recycle and get new PIDs and/or when the metrics file path is cleaned up. prometheus-client continues accounting for the metrics in the now "old" counter_<oldpid>.db files - but if those old PID files are cleaned up (which they should be because they will continue to generate new files and cause metric calculation to be more intensive) then from the perspective of your metrics system (prometheus, ELK, Wavefront, etc), the counter is less than the previous reported number and it assumes a reset to 0 has happened (even though it hasn't because the count is just all_db_files_counts - the_now_missing_db_file(s).

This will cause huge spikes in the metrics to appear in the graphs, which makes it difficult to form any kind of alerting or thresholding around the metrics.

If a worker reuses a PID later down the line, it will just tack on to the existing counter_<pid>.db but this event is rare and definitely not reliable for accurate metrics.

uwsgi gets around all of this by having a relatively simple way to use the worker ID (0 through X number of workers) rather than the worker PID, so the workers always reattach to their specific metrics file. There's no need for cleanup (unless rebooting the system in which case the metrics should be cleaned up and it's no problem, particularly if your metrics path is in /run).

From reading through a number of prometheus-client issues, it doesn't seem like there is any clear implementation or changes in the works to help resolve this issue with gunicorn and multiprocessing.

All this to say, I'm not sure where to leave this issue, other than maybe putting something similar to my above text as a major caveat of using the metrics with gunicorn.

@tyler-8 commented on GitHub (Feb 24, 2020): Without fairly invasive changes to Netbox that are specific to a `gunicorn` implementation, I don't think there is a work around for `gunicorn`. What's happening here is: gunicorn workers run, each process creates a `counter_<pid>.db` file that maintains the metrics for that pid. The `prometheus-client` (used by `django-prometheus`) then collects all the metrics from all the `.db` files in the metrics file path for viewing at `/metrics`. Up to this point, everything is fine. The problems arise when workers recycle and get new PIDs and/or when the metrics file path is cleaned up. `prometheus-client` continues accounting for the metrics in the now "old" `counter_<oldpid>.db` files - but if those old PID files are cleaned up (which they should be because they will continue to generate new files and cause metric calculation to be more intensive) then from the perspective of your metrics system (prometheus, ELK, Wavefront, etc), the counter is less than the previous reported number and it assumes a reset to 0 has happened (even though it hasn't because the count is just `all_db_files_counts - the_now_missing_db_file(s)`. This will cause huge spikes in the metrics to appear in the graphs, which makes it difficult to form any kind of alerting or thresholding around the metrics. If a worker reuses a PID later down the line, it will just tack on to the existing `counter_<pid>.db` but this event is rare and definitely not reliable for accurate metrics. `uwsgi` gets around all of this by having a relatively simple way to use the worker ID (0 through X number of workers) rather than the worker PID, so the workers always reattach to their specific metrics file. There's no need for cleanup (unless rebooting the system in which case the metrics should be cleaned up and it's no problem, particularly if your metrics path is in `/run`). From reading through a number of `prometheus-client` issues, it doesn't seem like there is any clear implementation or changes in the works to help resolve this issue with `gunicorn` and multiprocessing. All this to say, I'm not sure where to leave this issue, other than maybe putting something similar to my above text as a major caveat of using the metrics with `gunicorn`.
Author
Owner

@lampwins commented on GitHub (Feb 24, 2020):

Yeah, I think we can draw a line here with documentation. Perhaps a very brief version of your above summary and a note to say that if metrics are super important to you, then use uwsgi. To be honest, the people for which this will matter are likely already doing this.

@lampwins commented on GitHub (Feb 24, 2020): Yeah, I think we can draw a line here with documentation. Perhaps a very brief version of your above summary and a note to say that if metrics are super important to you, then use uwsgi. To be honest, the people for which this will matter are likely already doing this.
Author
Owner

@tyler-8 commented on GitHub (Feb 24, 2020):

I'm happy to piece something together for proposal if the issue is accepted.

@tyler-8 commented on GitHub (Feb 24, 2020): I'm happy to piece something together for proposal if the issue is accepted.
Author
Owner

@tyler-8 commented on GitHub (Mar 13, 2020):

I'm going to tackle this PR this weekend. Sorry for the delay!

@tyler-8 commented on GitHub (Mar 13, 2020): I'm going to tackle this PR this weekend. Sorry for the delay!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/netbox#3081