Add Prometheus metrics about cache #8401

Closed
opened 2025-12-29 20:36:16 +01:00 by adam · 9 comments
Owner

Originally created by @tobikris on GitHub (Aug 2, 2023).

NetBox version

v3.5.7

Feature type

Change to existing functionality

Proposed functionality

Currently the prometheus metrics do not include metrics about the cache. The documentation suggests otherwise 0b10131564/docs/integrations/prometheus-metrics.md (L18)

I could imagine that the change of the cache backend removed this as a side-effect.
As those metrics are pretty useful, I propose to add them back.

Use case

Without the metrics it is pretty hard to determine the effectiveness of the caching. I would like to see if the cache is working as expected or if the general performance is worse than it should be.

Database changes

No response

External dependencies

No response

Originally created by @tobikris on GitHub (Aug 2, 2023). ### NetBox version v3.5.7 ### Feature type Change to existing functionality ### Proposed functionality Currently the prometheus metrics do not include metrics about the cache. The documentation suggests otherwise https://github.com/netbox-community/netbox/blob/0b10131564dc16138b9b7c7cd869d705771c229e/docs/integrations/prometheus-metrics.md?plain=1#L18 I could imagine that the [change of the cache backend](https://github.com/netbox-community/netbox/pull/6716) removed this as a side-effect. As those metrics are pretty useful, I propose to add them back. ### Use case Without the metrics it is pretty hard to determine the effectiveness of the caching. I would like to see if the cache is working as expected or if the general performance is worse than it should be. ### Database changes _No response_ ### External dependencies _No response_
adam added the type: feature label 2025-12-29 20:36:16 +01:00
adam closed this issue 2025-12-29 20:36:16 +01:00
Author
Owner

@kkthxbye-code commented on GitHub (Aug 2, 2023):

What caching though, there's basically none in use. Specifically what do you want to monitor?

@kkthxbye-code commented on GitHub (Aug 2, 2023): What caching though, there's basically none in use. Specifically what do you want to monitor?
Author
Owner

@jsenecal commented on GitHub (Aug 2, 2023):

What caching though, there's basically none in use. Specifically what do you want to monitor?

Cache hit, miss, and invalidation counters

These guys: Diff

@jsenecal commented on GitHub (Aug 2, 2023): > What caching though, there's basically none in use. Specifically what do you want to monitor? > Cache hit, miss, and invalidation counters These guys: [Diff](https://github.com/netbox-community/netbox/pull/6716/files#diff-15ee226a2efe0a66ca7c3e44694d85c323157f8a38d64e681032f926de642ab0L147-L164)
Author
Owner

@kkthxbye-code commented on GitHub (Aug 2, 2023):

@jsenecal - We don't really cache anything, which is why I asked.

@kkthxbye-code commented on GitHub (Aug 2, 2023): @jsenecal - We don't really cache anything, which is why I asked.
Author
Owner

@jsenecal commented on GitHub (Aug 2, 2023):

@kkthxbye-code yeah, I get that, I was replying to the "what". Perhaps the question to @tobikris is more "Why" :)

@jsenecal commented on GitHub (Aug 2, 2023): @kkthxbye-code yeah, I get that, I was replying to the "what". Perhaps the question to @tobikris is more "Why" :)
Author
Owner

@kkthxbye-code commented on GitHub (Aug 2, 2023):

I just want to know what specific objects he needs the cache hit, miss and invalidation counters for, because we cache the release check, config revisions and the RSS widget and that's pretty much it. Nothing were metrics would actually be of use imo.

I'm assuming the why is because he thinks we still cache stuff, so that's not really interesting.

@kkthxbye-code commented on GitHub (Aug 2, 2023): I just want to know what specific objects he needs the cache hit, miss and invalidation counters for, because we cache the release check, config revisions and the RSS widget and that's pretty much it. Nothing were metrics would actually be of use imo. I'm assuming the why is because he thinks we still cache stuff, so that's not really interesting.
Author
Owner

@tobikris commented on GitHub (Aug 2, 2023):

Your assumption is correct - based on the docs I was expecting to see cache hits etc for modelled objects.
We are having some issues with long response times and wanted to dig into different aspects. This is why we tried to use the Prometheus metrics and realized the cache metrics mentioned in docs were missing.

Thanks for the explanation. I guess this feature request can be closed again as you are right - those metrics are not important if only minor things are cached anyway.

@tobikris commented on GitHub (Aug 2, 2023): Your assumption is correct - based on the docs I was expecting to see cache hits etc for modelled objects. We are having some issues with long response times and wanted to dig into different aspects. This is why we tried to use the Prometheus metrics and realized the cache metrics mentioned in docs were missing. Thanks for the explanation. I guess this feature request can be closed again as you are right - those metrics are not important if only minor things are cached anyway.
Author
Owner

@kkthxbye-code commented on GitHub (Aug 2, 2023):

@tobikris - Long response times are usually because of high pagination size. Page load time scales pretty linearly with pagination size, so if someone set it to 1000, which is the default max, page load time can be multiple seconds.

Other stuff that slows down load times include custom link columns, custom field of the object type and for prefixes specifically, the utilization column.

@kkthxbye-code commented on GitHub (Aug 2, 2023): @tobikris - Long response times are _usually_ because of high pagination size. Page load time scales pretty linearly with pagination size, so if someone set it to 1000, which is the default max, page load time can be multiple seconds. Other stuff that slows down load times include custom link columns, custom field of the object type and for prefixes specifically, the utilization column.
Author
Owner

@tobikris commented on GitHub (Aug 3, 2023):

Thanks again. We are aware of the page size and its impact on response times. However, we are a little bit forced to load all elements in one request. This is because the current pagination implementation without cursors does not ensure correctness in case of parallel changes.

But even in our tests using pagination with different page sizes it took about 40 seconds in total to retrieve all 7000 IP addresses. Sweetpoint was at about 250 items per request.
Does that sound about right/expected? Please note that we are currently not running the latest version (v3.2.8) for unrelated reasons.

@tobikris commented on GitHub (Aug 3, 2023): Thanks again. We are aware of the page size and its impact on response times. However, we are a little bit forced to load all elements in one request. This is because the current pagination implementation without cursors does not ensure correctness in case of parallel changes. But even in our tests using pagination with different page sizes it took about 40 seconds in total to retrieve all 7000 IP addresses. Sweetpoint was at about 250 items per request. Does that sound about right/expected? Please note that we are currently not running the latest version (v3.2.8) for unrelated reasons.
Author
Owner

@jsenecal commented on GitHub (Aug 3, 2023):

It obviously depends on how fast your CPU cores are, but yeah 7000 IPs is a lot to retrieve. Maybe there is a better way, but a discussion would be more suited for this.

@jsenecal commented on GitHub (Aug 3, 2023): It obviously depends on how fast your CPU cores are, but yeah 7000 IPs is a lot to retrieve. Maybe there is a better way, but a [discussion ](https://github.com/netbox-community/netbox/discussions) would be more suited for this.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/netbox#8401