Implement an improved global search engine #7070

Closed
opened 2025-12-29 19:48:33 +01:00 by adam · 10 comments
Owner

Originally created by @jeremystretch on GitHub (Oct 4, 2022).

Originally assigned to: @jeremystretch on GitHub.

NetBox version

v3.3.4

Feature type

Change to existing functionality

Proposed functionality

This issue is an evolution of the proposal initially outlined in #7016 to improve NetBox's global search functionality, in conjunction with dynamic registration per #8927. This proposal suggests the introduction of a new global search cache, allowing a single database per search query. As this is a fairly complex topic, I've outlined some core areas of focus below. While not a complete implementation plan, it should be sufficient to get started and generate additional discussion.

Caching

Each registered model will declare which of its fields should be cached for search. (This could be done similar to what we currently do with clone_fields and the clone() method.) Fields would be prescribed by name and numeric weighting. For example:

class Book(NetBoxModel):
    title = models.CharField()
    isbn = models.PositiveBigIntegerField()
    author = models.CharField()
    description = models.CharField()

    search_fields = (
        ('title', 200),
        ('isbn', 200),
        ('author', 150),
        ('description', 100),
    )
    
    def cache(self):
        data = []
        for field_name, weight in getattr(self, 'search_fields', []):
            field = self._meta.get_field(field_name)
            value = field.value_from_object(self)
            if field_value not in (None, ''):
                data.append(SearchResult(object=self, field=field_name, value=value, weight=weight))
        return data

On save(), the model's cache() method (name TBD) would be called via a post_save signal handler to generate cache data. A new low-priority background task would then be created to feed this data into the search results table. (Any existing results for the referenced object would first be deleted.) Similarly, search results will be automatically deleted in response to a post_delete signal.

Database Schema

Cached search data from all models would be written to a single table:

Field Type Description
timestamp Datetime Timestamp of most recent update
object_type FK(ContentType) GenericForeignKey component
object_id Integer GenericForeignKey component
field Char Name of the field/attribute being cached
value Char Cached value
weight Weight Numeric weight assigned to the field

The object_type and object_id fields would serve a GenericForeignKey named object, which references the cached object.

A populated table might look like this:

timestamp object_type object_id field value weight
2022-09-15 1:23 dcim.Device 441 name akron-rtr1 200
2022-09-15 1:23 dcim.Device 441 serial A4890274 180
2022-09-15 1:23 dcim.Device 441 asset_tag H302R8E 180
2022-09-15 1:23 dcim.Device 441 comments Some text goes here 50
2022-09-15 3:08 dcim.Site 17 name Akron 200
2022-09-15 3:08 dcim.Site 17 facility us-oh-akron01 150
2022-09-15 3:08 dcim.Site 17 description Primary DC for US-East 50
2022-09-15 3:08 dcim.Site 17 physical_address 123 Fake St Akron OH 80

Searching for "akron" would return four rows. We can append .distinct('object_type', 'object_id') to ensure only a single row is returned per object, and we can use .order_by('-weight') to favor the most important result for each object. (We might further order by object type for consistency among objects with identical weights.)

Matching Logic

We could potentially add an exact boolean column to the table, indicating whether each result requires an exact (vs. partial) match. This could be useful for e.g. integer values, where partial matching is typically of little value. For example, we might only want to find exact matches for a device's serial or asset_tag values. Such a query would look like this:

SearchResult.objects.filter(Q(value__iexact='foo') | Q(exact=False, value__icontains='foo'))

It remains to be seen what the performance penalty of this approach looks like. We could also expose exactness as a toggle, enabling the user to search only for exact matches.

Displaying Results

Each matching result will include several attributes:

  • The object referenced (with a link)
  • The field name on which the match occurred
  • The field value, or matched portion of the value

These can be displayed to the user to convey a succinct understanding of why each object was included in the results. Although resolving the object required a GenericForeignKey lookup, this should be automatically reduced via prefetch_related() to a single additional query per type of object returned.

Handling Model Migrations

Some housekeeping will be necessary to delete from the cache search results which reference fields which have been removed from their models. We should be able to hook into the post_migrate signal to detect when migrations have been applied, and bulk delete any entries which reference a field that is no longer referenced under its model's search_fields attribute. A similar approach may be used to detect removed models (e.g. because a plugin was uninstalled).

Considerations

Advantages

  • No new PostgreSQL extensions or other dependencies are introduced.
  • Both native and plugin models can be registered automatically.
  • All results can be returned via a single SQL query, and rendered as a single table.
  • An unlimited number of results can be paginated.
  • The search logic to be applied can be specified by the user (e.g. exact match vs. partial match vs. "starts with," etc.).
  • Employing a background worker ensures that the caching of new results does not impact real-time operations.
  • The entire search cache can be rebuilt offline if needed.
  • The timestamp column can be compared against an object's last_updated time to identify stale results.

Disadvantages

  • We're essentially inventing our own search engine (humble as it may be).
  • This approach does not solve for matching by related objects, however it should be noted that this generally not a feature of the current search function, and likely is a reasonable compromise.
  • Using a single large table for all results may degrade search performance over time as object counts increase.

Use case

The overall goal here is to provide more robust general purpose search for both core NetBox models as well as those delivered via plugins. Performance, while important, is probably less important than implementing a consistent, flexible search engine.

Database changes

Defined above for clarity

External dependencies

None

Originally created by @jeremystretch on GitHub (Oct 4, 2022). Originally assigned to: @jeremystretch on GitHub. ### NetBox version v3.3.4 ### Feature type Change to existing functionality ### Proposed functionality This issue is an evolution of the proposal initially outlined in #7016 to improve NetBox's global search functionality, in conjunction with dynamic registration per #8927. This proposal suggests the introduction of a new global search cache, allowing a single database per search query. As this is a fairly complex topic, I've outlined some core areas of focus below. While not a complete implementation plan, it should be sufficient to get started and generate additional discussion. ### Caching Each registered model will declare which of its fields should be cached for search. (This could be done similar to what we currently do with `clone_fields` and the `clone()` method.) Fields would be prescribed by name and numeric weighting. For example: ```python class Book(NetBoxModel): title = models.CharField() isbn = models.PositiveBigIntegerField() author = models.CharField() description = models.CharField() search_fields = ( ('title', 200), ('isbn', 200), ('author', 150), ('description', 100), ) def cache(self): data = [] for field_name, weight in getattr(self, 'search_fields', []): field = self._meta.get_field(field_name) value = field.value_from_object(self) if field_value not in (None, ''): data.append(SearchResult(object=self, field=field_name, value=value, weight=weight)) return data ``` On `save()`, the model's `cache()` method (name TBD) would be called via a `post_save` signal handler to generate cache data. A new low-priority background task would then be created to feed this data into the search results table. (Any existing results for the referenced object would first be deleted.) Similarly, search results will be automatically deleted in response to a `post_delete` signal. ### Database Schema Cached search data from all models would be written to a single table: | Field | Type | Description | |-------|------|-------------| | timestamp | Datetime | Timestamp of most recent update | | object_type | FK(ContentType) | GenericForeignKey component | | object_id | Integer | GenericForeignKey component | | field | Char | Name of the field/attribute being cached | | value | Char | Cached value | | weight | Weight | Numeric weight assigned to the field | The `object_type` and `object_id` fields would serve a GenericForeignKey named `object`, which references the cached object. A populated table might look like this: | timestamp | object_type | object_id | field | value | weight | |-----------------|-------------|-----------|------------------|------------------------|--------| | 2022-09-15 1:23 | dcim.Device | 441 | name | akron-rtr1 | 200 | | 2022-09-15 1:23 | dcim.Device | 441 | serial | A4890274 | 180 | | 2022-09-15 1:23 | dcim.Device | 441 | asset_tag | H302R8E | 180 | | 2022-09-15 1:23 | dcim.Device | 441 | comments | Some text goes here | 50 | | 2022-09-15 3:08 | dcim.Site | 17 | name | Akron | 200 | | 2022-09-15 3:08 | dcim.Site | 17 | facility | us-oh-akron01 | 150 | | 2022-09-15 3:08 | dcim.Site | 17 | description | Primary DC for US-East | 50 | | 2022-09-15 3:08 | dcim.Site | 17 | physical_address | 123 Fake St Akron OH | 80 | Searching for "akron" would return four rows. We can append `.distinct('object_type', 'object_id')` to ensure only a single row is returned per object, and we can use `.order_by('-weight')` to favor the most important result for each object. (We might further order by object type for consistency among objects with identical weights.) #### Matching Logic We could potentially add an `exact` boolean column to the table, indicating whether each result requires an exact (vs. partial) match. This could be useful for e.g. integer values, where partial matching is typically of little value. For example, we might only want to find exact matches for a device's `serial` or `asset_tag` values. Such a query would look like this: ``` SearchResult.objects.filter(Q(value__iexact='foo') | Q(exact=False, value__icontains='foo')) ``` It remains to be seen what the performance penalty of this approach looks like. We could also expose exactness as a toggle, enabling the user to search only for exact matches. ### Displaying Results Each matching result will include several attributes: * The object referenced (with a link) * The field name on which the match occurred * The field value, or matched portion of the value These can be displayed to the user to convey a succinct understanding of why each object was included in the results. Although resolving the object required a GenericForeignKey lookup, this should be automatically reduced via `prefetch_related()` to a single additional query per type of object returned. ### Handling Model Migrations Some housekeeping will be necessary to delete from the cache search results which reference fields which have been removed from their models. We should be able to hook into the [`post_migrate` signal](https://docs.djangoproject.com/en/4.1/ref/signals/#post-migrate) to detect when migrations have been applied, and bulk delete any entries which reference a field that is no longer referenced under its model's `search_fields` attribute. A similar approach may be used to detect removed models (e.g. because a plugin was uninstalled). ### Considerations #### Advantages * No new PostgreSQL extensions or other dependencies are introduced. * Both native and plugin models can be registered automatically. * All results can be returned via a single SQL query, and rendered as a single table. * An unlimited number of results can be paginated. * The search logic to be applied can be specified by the user (e.g. exact match vs. partial match vs. "starts with," etc.). * Employing a background worker ensures that the caching of new results does not impact real-time operations. * The entire search cache can be rebuilt offline if needed. * The `timestamp` column can be compared against an object's `last_updated` time to identify stale results. #### Disadvantages * We're essentially inventing our own search engine (humble as it may be). * This approach does not solve for matching by related objects, however it should be noted that this generally not a feature of the current search function, and likely is a reasonable compromise. * Using a single large table for all results may degrade search performance over time as object counts increase. ### Use case The overall goal here is to provide more robust general purpose search for both core NetBox models as well as those delivered via plugins. Performance, while important, is probably less important than implementing a consistent, flexible search engine. ### Database changes Defined above for clarity ### External dependencies None
adam added the status: acceptedtype: feature labels 2025-12-29 19:48:33 +01:00
adam closed this issue 2025-12-29 19:48:34 +01:00
Author
Owner

@cpmills1975 commented on GitHub (Oct 4, 2022):

I know next to nothing about Redis other than it's an "in memory database" but would holding this table or parts of it in Redis rather than in a PostgreSQL table somehow offer anything?

@cpmills1975 commented on GitHub (Oct 4, 2022): I know next to nothing about Redis other than it's an "in memory database" but would holding this table or parts of it in Redis rather than in a PostgreSQL table somehow offer anything?
Author
Owner

@ITJamie commented on GitHub (Oct 5, 2022):

Second on the redis comment, redis tends to be very good at this type of lookup, it would also have the advantage of not growing the "main" databases size (assuming that this new caching table would exist in the same db)

If the decision is made to build a caching table, would it make sense to fork one of djangos existing caching backend modules to leverage this? ( eg removing the MAX_ENTRIES limits )
https://docs.djangoproject.com/en/4.1/topics/cache/#database-caching

@ITJamie commented on GitHub (Oct 5, 2022): Second on the redis comment, redis tends to be very good at this type of lookup, it would also have the advantage of not growing the "main" databases size (assuming that this new caching table would exist in the same db) If the decision is made to build a caching table, would it make sense to fork one of djangos existing caching backend modules to leverage this? ( eg removing the MAX_ENTRIES limits ) https://docs.djangoproject.com/en/4.1/topics/cache/#database-caching
Author
Owner

@jeremystretch commented on GitHub (Oct 5, 2022):

Potentially. At this early stage I'm primarily interested in proving viability of the approach, and less concerned with performance. Once we have a working prototype in place using PostgreSQL, we should have a better idea of whether it's feasible to swap in Redis for the cache and what the trade-offs will be.

@jeremystretch commented on GitHub (Oct 5, 2022): Potentially. At this early stage I'm primarily interested in proving viability of the approach, and less concerned with performance. Once we have a working prototype in place using PostgreSQL, we should have a better idea of whether it's feasible to swap in Redis for the cache and what the trade-offs will be.
Author
Owner

@ghost commented on GitHub (Oct 6, 2022):

Just curious, rather than inventing your own search .. what is the reluctance to leverage an existing extension, dependency or feature set from another tool to accomplish this?

In the last community call you mentioned this global search and discussed the fact that some users have environments that prevent them from installing additional dependencies. If that is the case, why should the entire user base not be able to take advantage because some environments are more restrictive than others?

Is there any way to make this feature something that can be 'turned on' or 'turned off' either by settings, or the detection of these additional dependencies? If you don't have them, search continues to function the way it does now. If you do, then you get to take advantage of global search?

@ghost commented on GitHub (Oct 6, 2022): Just curious, rather than inventing your own search .. what is the reluctance to leverage an existing extension, dependency or feature set from another tool to accomplish this? In the last community call you mentioned this global search and discussed the fact that some users have environments that prevent them from installing additional dependencies. If that is the case, why should the entire user base not be able to take advantage because *some* environments are more restrictive than others? Is there any way to make this feature something that can be 'turned on' or 'turned off' either by settings, or the detection of these additional dependencies? If you don't have them, search continues to function the way it does now. If you do, then you get to take advantage of global search?
Author
Owner

@jeremystretch commented on GitHub (Oct 19, 2022):

I've made excellent progress with the new search implementation, and performance testing has been very encouraging. (Searching against 1+ million cached values for a partial match takes ~140ms on my local instance.) For now at least, we'll keep the cache in PostgreSQL as this affords advanced querying functionality not available with Redis.

Still to do:

  • Optimize the reindex management command
  • Extend the reindex management command to allow partial reindexing my app/model
  • Consider moving the caching routine to a background task
  • Highlight the portion of the cached value which matches the query
  • Add initial global search tests
@jeremystretch commented on GitHub (Oct 19, 2022): I've made excellent progress with the new search implementation, and performance testing has been very encouraging. (Searching against 1+ million cached values for a partial match takes ~140ms on my local instance.) For now at least, we'll keep the cache in PostgreSQL as this affords advanced querying functionality not available with Redis. Still to do: * ~Optimize the `reindex` management command~ * ~Extend the `reindex` management command to allow partial reindexing my app/model~ * ~Consider moving the caching routine to a background task~ * ~Highlight the portion of the cached value which matches the query~ * ~Add initial global search tests~
Author
Owner

@ghost commented on GitHub (Oct 19, 2022):

How might the search work syntax wise?
Will wildcards and/or regex be supported?
Will there be a character minimum? I can't just search for 'a' and it would return every single value that has an 'a' in it?

@ghost commented on GitHub (Oct 19, 2022): How might the search work syntax wise? Will wildcards and/or regex be supported? Will there be a character minimum? I can't just search for 'a' and it would return every single value that has an 'a' in it?
Author
Owner

@ITJamie commented on GitHub (Oct 19, 2022):

As an addendum to the above. Would glob or regex based searching be possible?

@ITJamie commented on GitHub (Oct 19, 2022): As an addendum to the above. Would glob or regex based searching be possible?
Author
Owner

@jeremystretch commented on GitHub (Oct 21, 2022):

The initial implementation (pre-beta) does not introduce any sort of advanced query language. The search backend supports specification of a general lookup logic (e.g. partial vs. exact match) but nothing beyond that. While we can certainly consider something, it would be best to do so under a separate FR.

@jeremystretch commented on GitHub (Oct 21, 2022): The initial implementation (pre-beta) does not introduce any sort of advanced query language. The search backend supports specification of a general lookup logic (e.g. partial vs. exact match) but nothing beyond that. While we can certainly consider something, it would be best to do so under a separate FR.
Author
Owner

@jeremystretch commented on GitHub (Oct 21, 2022):

Closing this out as the initial implementation of the new backend has been completed. We're likely to continue iterating on it up to and throughout the beta release.

@jeremystretch commented on GitHub (Oct 21, 2022): Closing this out as the initial implementation of the new backend has been completed. We're likely to continue iterating on it up to and throughout the beta release.
Author
Owner

@PieterL75 commented on GitHub (Oct 21, 2022):

Looking forward to this!
Also the power it can add to type ahead, predictive search etc etc

@PieterL75 commented on GitHub (Oct 21, 2022): Looking forward to this! Also the power it can add to type ahead, predictive search etc etc
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/netbox#7070