Cache the Config Context in the Database #3615

Closed
opened 2025-12-29 18:30:10 +01:00 by adam · 11 comments
Owner

Originally created by @dstarner on GitHub (Apr 27, 2020).

Environment

  • Python version: 3.7
  • NetBox version: 2.7.10

Proposed Functionality

As discussed in Slack, it may be beneficial to cache the generated/merged config context in the database to reduce load times on bulk device list API requests.

Use Case

Fetching 1000 devices with config context returned takes ~60s. Without config context, the response takes ~5 seconds. This large overhead was determined to be from the generation & merging of the config contexts that exist on devices / virtual machines. This context is regenerated on every request. For performance, it would be practical and very performant to pre-cache this generated data in the database.

Database Changes

This would require adding a computed_config_context to the ConfigContextModel that would contain the computed config context. The naming of this field can be up for debate if something better is thought of.

It would contain the pre-cached computed config context, and it would be rebuilt whenever an associated ConfigContext is updated or during the save()/post_save method of a ConfigContextModel.

External Dependencies

NA

Originally created by @dstarner on GitHub (Apr 27, 2020). <!-- NOTE: IF YOUR ISSUE DOES NOT FOLLOW THIS TEMPLATE, IT WILL BE CLOSED. This form is only for proposing specific new features or enhancements. If you have a general idea or question, please post to our mailing list instead of opening an issue: https://groups.google.com/forum/#!forum/netbox-discuss NOTE: Due to an excessive backlog of feature requests, we are not currently accepting any proposals which significantly extend NetBox's feature scope. Please describe the environment in which you are running NetBox. Be sure that you are running an unmodified instance of the latest stable release before submitting a bug report. --> ### Environment * Python version: 3.7 * NetBox version: 2.7.10 <!-- Describe in detail the new functionality you are proposing. Include any specific changes to work flows, data models, or the user interface. --> ### Proposed Functionality [As discussed in Slack](https://networktocode.slack.com/archives/C3DQ6MZ0Q/p1587745271315900?thread_ts=1587743396.300200&cid=C3DQ6MZ0Q), it may be beneficial to cache the generated/merged config context in the database to reduce load times on bulk device list API requests. <!-- Convey an example use case for your proposed feature. Write from the perspective of a NetBox user who would benefit from the proposed functionality and describe how. ---> ### Use Case Fetching 1000 devices _with_ config context returned takes ~60s. _Without_ config context, the response takes ~5 seconds. This large overhead was determined to be from the generation & merging of the config contexts that exist on devices / virtual machines. This context is regenerated on every request. For performance, it would be practical and very performant to pre-cache this generated data in the database. <!-- Note any changes to the database schema necessary to support the new feature. For example, does the proposal require adding a new model or field? (Not all new features require database changes.) ---> ### Database Changes This would require adding a `computed_config_context` to the [`ConfigContextModel`](https://github.com/netbox-community/netbox/blob/develop/netbox/extras/models.py#L841) that would contain the computed config context. The naming of this field can be up for debate if something better is thought of. It would contain the pre-cached computed config context, and it would be rebuilt whenever an associated [`ConfigContext`](https://github.com/netbox-community/netbox/blob/develop/netbox/extras/models.py#L754) is updated or during the `save()`/`post_save` method of a [`ConfigContextModel`](https://github.com/netbox-community/netbox/blob/develop/netbox/extras/models.py#L841). <!-- List any new dependencies on external libraries or services that this new feature would introduce. For example, does the proposal require the installation of a new Python package? (Not all new features introduce new dependencies.) --> ### External Dependencies NA
adam added the status: revisions needed label 2025-12-29 18:30:10 +01:00
adam closed this issue 2025-12-29 18:30:10 +01:00
Author
Owner

@jeremystretch commented on GitHub (Apr 27, 2020):

This proposal needs to be fleshed out more. What is the specific workflow being proposed? E.g. when do config contexts get calculated if not on-demand? How do we detect and resolve discrepancies? Is the calculation being performed in the background? What are the performance implications? Please be specific in your proposal.

@jeremystretch commented on GitHub (Apr 27, 2020): This proposal needs to be fleshed out more. What is the specific workflow being proposed? E.g. when do config contexts get calculated if not on-demand? How do we detect and resolve discrepancies? Is the calculation being performed in the background? What are the performance implications? Please be specific in your proposal.
Author
Owner

@dstarner commented on GitHub (Apr 27, 2020):

I'll post a small design when I get a moment. If anyone sees this and wants to post their thoughts / comments before I flesh it out, feel free to.

@dstarner commented on GitHub (Apr 27, 2020): I'll post a small design when I get a moment. If anyone sees this and wants to post their thoughts / comments before I flesh it out, feel free to.
Author
Owner

@ebusto commented on GitHub (Apr 27, 2020):

This sounds great.

We are starting to use config contexts rather extensively, and have a few external services which relentlessly hammer the NetBox API. Eliminating the overhead of rendering the config context on demand would help scalability quite a bit.

@ebusto commented on GitHub (Apr 27, 2020): This sounds great. We are starting to use config contexts rather extensively, and have a few external services which relentlessly hammer the NetBox API. Eliminating the overhead of rendering the config context on demand would help scalability quite a bit.
Author
Owner

@tyler-8 commented on GitHub (Apr 27, 2020):

I imagine this will look something like:

  1. Save signal on a ConfigContextModel, Device, or VM
  2. RQ Worker picks up task
  3. Call .prerender_config_context() method on all associated Devices/VMs (which is using the .get_config_context() method)
  4. Populate/overwrite computed_config_context field on the device or VM with output

Serializers and views would be updated to show the contents of computed_config_context instead of calling .get_config_context() directly.

@tyler-8 commented on GitHub (Apr 27, 2020): I imagine this will look something like: 1. Save signal on a ConfigContextModel, Device, or VM 2. RQ Worker picks up task 3. Call `.prerender_config_context()` method on all associated Devices/VMs (which is using the `.get_config_context()` method) 4. Populate/overwrite `computed_config_context` field on the device or VM with output Serializers and views would be updated to show the contents of `computed_config_context` instead of calling `.get_config_context()` directly.
Author
Owner

@ebusto commented on GitHub (Apr 27, 2020):

Would the asynchronous nature of the RQ worker approach cause issues? That is, if I update a few attributes of a device, I'd expect to be able to click save and flip over to the config context tab and see the updated configuration.

Using just signals might be a better approach.

@ebusto commented on GitHub (Apr 27, 2020): Would the asynchronous nature of the RQ worker approach cause issues? That is, if I update a few attributes of a device, I'd expect to be able to click save and flip over to the config context tab and see the updated configuration. Using just signals might be a better approach.
Author
Owner

@ebusto commented on GitHub (Apr 27, 2020):

I implemented very similar functionality in one of our NetBox plugins, which allows you to group devices (and virtual machines) into services, such as "all cron servers owned by Unix Support" and "all SaltStack masters owned by Network Engineering".

Since manually adding and removing systems to and from services is tedious, it supports contexts, where you can specify criteria that is virtually the same as a configuration context, with the addition of hostname patterns. A system that no longer meets the criteria is automatically removed from the service, and a system that was created or updated and now meets the criteria is automatically added to the service.

In order to keep the performance acceptable, I took the approach of building a query set representing the set of devices that need to be evaluated, and then evaluating each service against the query set, moving all of the work to the database.

When a ServiceContext is created, updated, or deleted, the query set is simply all devices (Device.objects.all()), but only against the specific service that was changed.

For devices that are created or updated, a signal receiver tracks the PK, and constructs the query set from all PKs once the transaction is committed. We often perform bulk updates of hundreds of devices at a time, so this query set can represent a large number of devices.

Our NetBox instance has 50k devices and 10k VMs, and with this approach the overhead is negligible.

@ebusto commented on GitHub (Apr 27, 2020): I implemented very similar functionality in one of our NetBox plugins, which allows you to group devices (and virtual machines) into services, such as "all cron servers owned by Unix Support" and "all SaltStack masters owned by Network Engineering". Since manually adding and removing systems to and from services is tedious, it supports contexts, where you can specify criteria that is virtually the same as a configuration context, with the addition of hostname patterns. A system that no longer meets the criteria is automatically removed from the service, and a system that was created or updated and now meets the criteria is automatically added to the service. In order to keep the performance acceptable, I took the approach of building a query set representing the set of devices that need to be evaluated, and then evaluating each service against the query set, moving all of the work to the database. When a ServiceContext is created, updated, or deleted, the query set is simply all devices (`Device.objects.all()`), but only against the specific service that was changed. For devices that are created or updated, a signal receiver tracks the PK, and constructs the query set from all PKs once the transaction is committed. We often perform bulk updates of hundreds of devices at a time, so this query set can represent a large number of devices. Our NetBox instance has 50k devices and 10k VMs, and with this approach the overhead is negligible.
Author
Owner

@sdktr commented on GitHub (Apr 27, 2020):

Would the asynchronous nature of the RQ worker approach cause issues? That is, if I update a few attributes of a device, I'd expect to be able to click save and flip over to the config context tab and see the updated configuration.

Using just signals might be a better approach.

Agree on that. We can't trust the async update proces to be ready before the next consumer requests the config_context and needs this update to be reflected in there. But I also agree that we can't wait on a recompute of all affected data on a save.

The update signal on config_context (but also on a lot of other changes like site/region/device that can affect memberships!) should_ imo both

  1. invalidate the queryset (OR: invalidate the computed_config_context on all affected devices (old and new members of the queryset!))
  2. schedule the async task of regenerating

If the config_context of a device is requested, there should be a check on whether a valid computed config context is available. If not we should generate it on request (as current behavior).

I hope we will in the future have a way of getting config_context merging on other entities than devices/vms (e.g. site level). Storing the computed_config_context not with the device/vm but in a seperate table (or cache) would be a future proof way to handle this?

What should be the role of the redis cache vs the database for storing each of these properties (computed configcontext, (in)valid querysets etc)? The computed config context is in fact a cached/optimized copy of data which might make it more suitable for storage in cache. On the other hand storing it as a JSONB datatype in postgres would open up some real cool possibilities on indexing/searching/querying within the context..

@sdktr commented on GitHub (Apr 27, 2020): > Would the asynchronous nature of the RQ worker approach cause issues? That is, if I update a few attributes of a device, I'd expect to be able to click save and flip over to the config context tab and see the updated configuration. > > Using just signals might be a better approach. Agree on that. We can't trust the async update proces to be ready before the next consumer requests the config_context and needs this update to be reflected in there. But I also agree that we can't wait on a recompute of all affected data on a save. The update signal on config_context _(but also on a lot of other changes like site/region/device that can affect memberships!)_ should_ imo both 1. invalidate the queryset (OR: invalidate the computed_config_context on all _affected_ devices (old and new members of the queryset!)) 2. schedule the async task of regenerating If the config_context of a device is requested, there should be a check on whether a _valid_ computed config context is available. If not we should generate it on request (as current behavior). I hope we will in the future have a way of getting config_context merging on other entities than devices/vms (e.g. site level). Storing the `computed_config_context` not with the device/vm but in a seperate table (or cache) would be a future proof way to handle this? What should be the role of the redis cache vs the database for storing each of these properties (computed configcontext, (in)valid querysets etc)? The computed config context is in fact a cached/optimized copy of data which might make it more suitable for storage in cache. On the other hand storing it as a JSONB datatype in postgres would open up some real cool possibilities on indexing/searching/querying _within_ the context..
Author
Owner

@lampwins commented on GitHub (Apr 27, 2020):

I think this is an overcomplicated solution to the root problem which is we are not efficiently querying the ConfigContext model instances.

The reason this is so inefficient today is that this method gets called on each serialized device instance which in turn calls this complex query logic for every device.

We should look to optimize this before going down the denormalization route.

@lampwins commented on GitHub (Apr 27, 2020): I think this is an overcomplicated solution to the root problem which is we are not efficiently querying the ConfigContext model instances. The reason this is so inefficient today is that [this method](https://github.com/netbox-community/netbox/blob/develop/netbox/dcim/api/serializers.py#L428) gets called on each serialized device instance which in turn calls [this complex query logic](https://github.com/netbox-community/netbox/blob/d8cb58c74653da88d9aade40548b146a9311f5e6/netbox/extras/querysets.py#L24) _for every device_. We should look to optimize this before going down the denormalization route.
Author
Owner

@jeremystretch commented on GitHub (Apr 29, 2020):

@lampwins Do you want to close this issue in favor of a new one then? I want to make sure we keep the issue proposal consistent with the eventual solution to avoid confusion.

@jeremystretch commented on GitHub (Apr 29, 2020): @lampwins Do you want to close this issue in favor of a new one then? I want to make sure we keep the issue proposal consistent with the eventual solution to avoid confusion.
Author
Owner

@dstarner commented on GitHub (Apr 30, 2020):

@jeremystretch I do not mind creating the new issue that is tied closer to the query optimization. Feel free to close this one in the meantime

@dstarner commented on GitHub (Apr 30, 2020): @jeremystretch I do not mind creating the new issue that is tied closer to the query optimization. Feel free to close this one in the meantime
Author
Owner

@dstarner commented on GitHub (Apr 30, 2020):

@lampwins Interestingly enough, when I played around with this, Django debug-toolbar noted that my SQL performance was generally okay, but that it was actually CPU time that was causing the slowness. I'm not sure at what level this CPU slowness is occuring.

@dstarner commented on GitHub (Apr 30, 2020): @lampwins Interestingly enough, when I played around with this, Django debug-toolbar noted that my SQL performance was generally okay, but that it was actually CPU time that was causing the slowness. I'm not sure at what level this CPU slowness is occuring.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/netbox#3615