mirror of
https://github.com/netbox-community/netbox.git
synced 2026-01-11 21:10:29 +01:00
Add configuration to prevent inaccurate prometheus metrics #3081
Closed
opened 2025-12-29 18:25:26 +01:00 by adam
·
10 comments
No Branch/Tag Specified
main
update-changelog-comments-docs
feature-removal-issue-type
20911-dropdown
20239-plugin-menu-classes-mutable-state
21097-graphql-id-lookups
feature
fix_module_substitution
20923-dcim-templates
20044-elevation-stuck-lightmode
feature-ip-prefix-link
v4.5-beta1-release
20068-import-moduletype-attrs
20766-fix-german-translation-code-literals
20378-del-script
7604-filter-modifiers-v3
circuit-swap
12318-case-insensitive-uniqueness
20637-improve-device-q-filter
20660-script-load
19724-graphql
20614-update-ruff
14884-script
02496-max-page
19720-macaddress-interface-generic-relation
19408-circuit-terminations-export-templates
20203-openapi-check
fix-19669-api-image-download
7604-filter-modifiers
19275-fixes-interface-bulk-edit
fix-17794-get_field_value_return_list
11507-show-aggregate-and-rir-on-api
9583-add_column_specific_search_field_to_tables
v4.5.0
v4.4.10
v4.4.9
v4.5.0-beta1
v4.4.8
v4.4.7
v4.4.6
v4.4.5
v4.4.4
v4.4.3
v4.4.2
v4.4.1
v4.4.0
v4.3.7
v4.4.0-beta1
v4.3.6
v4.3.5
v4.3.4
v4.3.3
v4.3.2
v4.3.1
v4.3.0
v4.2.9
v4.3.0-beta2
v4.2.8
v4.3.0-beta1
v4.2.7
v4.2.6
v4.2.5
v4.2.4
v4.2.3
v4.2.2
v4.2.1
v4.2.0
v4.1.11
v4.1.10
v4.1.9
v4.1.8
v4.2-beta1
v4.1.7
v4.1.6
v4.1.5
v4.1.4
v4.1.3
v4.1.2
v4.1.1
v4.1.0
v4.0.11
v4.0.10
v4.0.9
v4.1-beta1
v4.0.8
v4.0.7
v4.0.6
v4.0.5
v4.0.3
v4.0.2
v4.0.1
v4.0.0
v3.7.8
v3.7.7
v4.0-beta2
v3.7.6
v3.7.5
v4.0-beta1
v3.7.4
v3.7.3
v3.7.2
v3.7.1
v3.7.0
v3.6.9
v3.6.8
v3.6.7
v3.7-beta1
v3.6.6
v3.6.5
v3.6.4
v3.6.3
v3.6.2
v3.6.1
v3.6.0
v3.5.9
v3.6-beta2
v3.5.8
v3.6-beta1
v3.5.7
v3.5.6
v3.5.5
v3.5.4
v3.5.3
v3.5.2
v3.5.1
v3.5.0
v3.4.10
v3.4.9
v3.5-beta2
v3.4.8
v3.5-beta1
v3.4.7
v3.4.6
v3.4.5
v3.4.4
v3.4.3
v3.4.2
v3.4.1
v3.4.0
v3.3.10
v3.3.9
v3.4-beta1
v3.3.8
v3.3.7
v3.3.6
v3.3.5
v3.3.4
v3.3.3
v3.3.2
v3.3.1
v3.3.0
v3.2.9
v3.2.8
v3.3-beta2
v3.2.7
v3.3-beta1
v3.2.6
v3.2.5
v3.2.4
v3.2.3
v3.2.2
v3.2.1
v3.2.0
v3.1.11
v3.1.10
v3.2-beta2
v3.1.9
v3.2-beta1
v3.1.8
v3.1.7
v3.1.6
v3.1.5
v3.1.4
v3.1.3
v3.1.2
v3.1.1
v3.1.0
v3.0.12
v3.0.11
v3.0.10
v3.1-beta1
v3.0.9
v3.0.8
v3.0.7
v3.0.6
v3.0.5
v3.0.4
v3.0.3
v3.0.2
v3.0.1
v3.0.0
v2.11.12
v3.0-beta2
v2.11.11
v2.11.10
v3.0-beta1
v2.11.9
v2.11.8
v2.11.7
v2.11.6
v2.11.5
v2.11.4
v2.11.3
v2.11.2
v2.11.1
v2.11.0
v2.10.10
v2.10.9
v2.11-beta1
v2.10.8
v2.10.7
v2.10.6
v2.10.5
v2.10.4
v2.10.3
v2.10.2
v2.10.1
v2.10.0
v2.9.11
v2.10-beta2
v2.9.10
v2.10-beta1
v2.9.9
v2.9.8
v2.9.7
v2.9.6
v2.9.5
v2.9.4
v2.9.3
v2.9.2
v2.9.1
v2.9.0
v2.9-beta2
v2.8.9
v2.9-beta1
v2.8.8
v2.8.7
v2.8.6
v2.8.5
v2.8.4
v2.8.3
v2.8.2
v2.8.1
v2.8.0
v2.7.12
v2.7.11
v2.7.10
v2.7.9
v2.7.8
v2.7.7
v2.7.6
v2.7.5
v2.7.4
v2.7.3
v2.7.2
v2.7.1
v2.7.0
v2.6.12
v2.6.11
v2.6.10
v2.6.9
v2.7-beta1
Solcon-2020-01-06
v2.6.8
v2.6.7
v2.6.6
v2.6.5
v2.6.4
v2.6.3
v2.6.2
v2.6.1
v2.6.0
v2.5.13
v2.5.12
v2.6-beta1
v2.5.11
v2.5.10
v2.5.9
v2.5.8
v2.5.7
v2.5.6
v2.5.5
v2.5.4
v2.5.3
v2.5.2
v2.5.1
v2.5.0
v2.4.9
v2.5-beta2
v2.4.8
v2.5-beta1
v2.4.7
v2.4.6
v2.4.5
v2.4.4
v2.4.3
v2.4.2
v2.4.1
v2.4.0
v2.3.7
v2.4-beta1
v2.3.6
v2.3.5
v2.3.4
v2.3.3
v2.3.2
v2.3.1
v2.3.0
v2.2.10
v2.3-beta2
v2.2.9
v2.3-beta1
v2.2.8
v2.2.7
v2.2.6
v2.2.5
v2.2.4
v2.2.3
v2.2.2
v2.2.1
v2.2.0
v2.1.6
v2.2-beta2
v2.1.5
v2.2-beta1
v2.1.4
v2.1.3
v2.1.2
v2.1.1
v2.1.0
v2.0.10
v2.1-beta1
v2.0.9
v2.0.8
v2.0.7
v2.0.6
v2.0.5
v2.0.4
v2.0.3
v2.0.2
v2.0.1
v2.0.0
v2.0-beta3
v1.9.6
v1.9.5
v2.0-beta2
v1.9.4-r1
v1.9.3
v2.0-beta1
v1.9.2
v1.9.1
v1.9.0-r1
v1.8.4
v1.8.3
v1.8.2
v1.8.1
v1.8.0
v1.7.3
v1.7.2-r1
v1.7.1
v1.7.0
v1.6.3
v1.6.2-r1
v1.6.1-r1
1.6.1
v1.6.0
v1.5.2
v1.5.1
v1.5.0
v1.4.2
v1.4.1
v1.4.0
v1.3.2
v1.3.1
v1.3.0
v1.2.2
v1.2.1
v1.2.0
v1.1.0
v1.0.7-r1
v1.0.7
v1.0.6
v1.0.5
v1.0.4
v1.0.3-r1
v1.0.3
1.0.0
Labels
Clear labels
beta
breaking change
complexity: high
complexity: low
complexity: medium
needs milestone
netbox
pending closure
plugin candidate
pull-request
severity: high
severity: low
severity: medium
status: accepted
status: backlog
status: blocked
status: duplicate
status: needs owner
status: needs triage
status: revisions needed
status: under review
topic: GraphQL
topic: Internationalization
topic: OpenAPI
topic: UI/UX
topic: cabling
topic: event rules
topic: htmx navigation
topic: industrialization
topic: migrations
topic: plugins
topic: scripts
topic: templating
topic: testing
type: bug
type: deprecation
type: documentation
type: feature
type: housekeeping
type: translation
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: starred/netbox#3081
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @tyler-8 on GitHub (Dec 19, 2019).
Environment
Proposed Functionality
When running
django-prometheusin a multiprocessed environment (as withgunicornoruwsgi), the metrics files are generated using thepidof each worker process. This becomes an issue as worker processes recycle (by whatever cause) and get new a newpidand the old metrics files persist - and those numbers are scraped byprometheus_client. This topic has been discussed further in the mainprometheus_clientrepo: https://github.com/prometheus/client_python/issues/275 & https://github.com/prometheus/client_python/issues/204I have combined the two fixes documented (from here and here). The below block would be added near the top of the Netbox
settings.py. There's no impact to users not running with metrics enabled. And if users do have metrics enabled, the below block is only executed if the prometheus_multiproc_dir environment variable is set (which is essentially mandatory to get accurate metrics with a multiprocessed wsgi server).Use Case
Any user that wants to take advantage of the prometheus metrics will eventually have performance issues (as the number of metrics files grows) and inaccurate metrics.
Database Changes
N/A
External Dependencies
N/A
@kobayashi commented on GitHub (Dec 24, 2019):
Thanks to give us this useful tip! Could we describe about this under Multi Processing Notes
?
@tyler-8 commented on GitHub (Dec 24, 2019):
I think this configuration snippet may actually work in
configuration.pyinstead ofsettings.py- in which case users can add the relevant parts on their own (and therefore would just be a documentation update). I need to tinker around with that more.@lampwins commented on GitHub (Dec 30, 2019):
I agree this should work in configuaration.py. Furthermore, core netbox should really steer clear of any implementation that makes assumptions about the runtime environment like uwsgi specifics. I know you have covered your bases here by accounting both situations but I wouldn't want to set a precedent with this.
That being said, I am fine with accepting this as a documentation addition.
@tyler-8 commented on GitHub (Dec 30, 2019):
Agreed. I think suggestions/examples in the documentation are the way to go here.
As I continue to dig in to the issue however, it seems more complicated for
gunicorn.gunicorndoesn't have any sort of static worker ID likeuwsgiso the workaround in the OP doesn't translate.From https://github.com/prometheus/client_python#multiprocess-mode-gunicorn:
I'm not sure what the best approach for this would be. A parameter in the
systemdservice file that wipes the directory on service startup/restart?I also considered an external Python script that wipes the directory of unused PID files every couple of minutes but that doesn't seem ideal.
Needless to say, it's complicated.
@tyler-8 commented on GitHub (Jan 22, 2020):
I've been digging through this on and off and I haven't really come around to a solution for
gunicorn. Without some heavy custom code (found digging around https://github.com/prometheus/client_python/issues) that may not translate across environments, I don't know that there's actually a way to have clean, accurate metrics withprometheus_clientandgunicorn.With
uwsgithe solution is rather simple (noted in OP) and that keeps the metric data clean and accurate. I'm happy to write up the documentation piece but it'd be foruwsgionly, which is of course counter to what the official docs specify (gunicorn).@jeremystretch commented on GitHub (Feb 21, 2020):
@tyler-8 Have you made any further progress with gunicorn? What do you want to do with this issue?
@tyler-8 commented on GitHub (Feb 24, 2020):
Without fairly invasive changes to Netbox that are specific to a
gunicornimplementation, I don't think there is a work around forgunicorn.What's happening here is:
gunicorn workers run, each process creates a
counter_<pid>.dbfile that maintains the metrics for that pid. Theprometheus-client(used bydjango-prometheus) then collects all the metrics from all the.dbfiles in the metrics file path for viewing at/metrics. Up to this point, everything is fine.The problems arise when workers recycle and get new PIDs and/or when the metrics file path is cleaned up.
prometheus-clientcontinues accounting for the metrics in the now "old"counter_<oldpid>.dbfiles - but if those old PID files are cleaned up (which they should be because they will continue to generate new files and cause metric calculation to be more intensive) then from the perspective of your metrics system (prometheus, ELK, Wavefront, etc), the counter is less than the previous reported number and it assumes a reset to 0 has happened (even though it hasn't because the count is justall_db_files_counts - the_now_missing_db_file(s).This will cause huge spikes in the metrics to appear in the graphs, which makes it difficult to form any kind of alerting or thresholding around the metrics.
If a worker reuses a PID later down the line, it will just tack on to the existing
counter_<pid>.dbbut this event is rare and definitely not reliable for accurate metrics.uwsgigets around all of this by having a relatively simple way to use the worker ID (0 through X number of workers) rather than the worker PID, so the workers always reattach to their specific metrics file. There's no need for cleanup (unless rebooting the system in which case the metrics should be cleaned up and it's no problem, particularly if your metrics path is in/run).From reading through a number of
prometheus-clientissues, it doesn't seem like there is any clear implementation or changes in the works to help resolve this issue withgunicornand multiprocessing.All this to say, I'm not sure where to leave this issue, other than maybe putting something similar to my above text as a major caveat of using the metrics with
gunicorn.@lampwins commented on GitHub (Feb 24, 2020):
Yeah, I think we can draw a line here with documentation. Perhaps a very brief version of your above summary and a note to say that if metrics are super important to you, then use uwsgi. To be honest, the people for which this will matter are likely already doing this.
@tyler-8 commented on GitHub (Feb 24, 2020):
I'm happy to piece something together for proposal if the issue is accepted.
@tyler-8 commented on GitHub (Mar 13, 2020):
I'm going to tackle this PR this weekend. Sorry for the delay!