Slow queries in GraphQL after upgrading to v3.4.1 #7411

Originally created by @glesys-andreas on GitHub (Dec 22, 2022).

Originally assigned to: @arthanson on GitHub.

NetBox version

v3.4.1

Python version

3.9

Steps to Reproduce

import/create >= 800 vlans
[{"vid": "1", "name": "vlan_1", "status": "active"} ..... {"vid": "800", "name": "vlan_800", "status": "active"}]

run query:

query vlan {
	vlan_list {
		prefixes {
			prefix
			description
			tenant {
				name
			}
			vlan {
				group {
					name
				}
				vid
			}
		}
	}
}

Expected Behavior

Response in less than 10 seconds.

Observed Behavior

In version 3.3 this query took less than 10s, after upgrading to 3.4 this takes between 30-40 seconds.
This is reproduced at the demo site: https://demo.netbox.dev/ipam/vlans/

Originally created by @glesys-andreas on GitHub (Dec 22, 2022). Originally assigned to: @arthanson on GitHub. ### NetBox version v3.4.1 ### Python version 3.9 ### Steps to Reproduce import/create >= 800 vlans [{"vid": "1", "name": "vlan_1", "status": "active"} ..... {"vid": "800", "name": "vlan_800", "status": "active"}] run query: ``` query vlan { vlan_list { prefixes { prefix description tenant { name } vlan { group { name } vid } } } } ``` ### Expected Behavior Response in less than 10 seconds. ### Observed Behavior In version 3.3 this query took less than 10s, after upgrading to 3.4 this takes between 30-40 seconds. This is reproduced at the demo site: https://demo.netbox.dev/ipam/vlans/

adam added the status: accepted type: feature labels 2025-12-29 20:23:05 +01:00

adam closed this issue

adam commented

@arthanson commented on GitHub (Dec 28, 2022):

This is more of a feature request than a bug, The issue is we aren't doing any prefetch_related or select_related on the GraphQL queries. I replicated the above and 2046 queries are done. I'm not certain why it got worse from 3.3 to 3.4 but the root of the issue is the DB calls.

The bigger issue with GraphQL is we don't know what will be included in the query, so you can either optimize for most common queries or dynamically optimize queries. There is a package https://github.com/tfoxy/graphene-django-optimizer to dynamically optimize queries, unfortunately it doesn't work with the latest Graphene.

@arthanson commented on GitHub (Dec 28, 2022): This is more of a feature request than a bug, The issue is we aren't doing any prefetch_related or select_related on the GraphQL queries. I replicated the above and 2046 queries are done. I'm not certain why it got worse from 3.3 to 3.4 but the root of the issue is the DB calls. The bigger issue with GraphQL is we don't know what will be included in the query, so you can either optimize for most common queries or dynamically optimize queries. There is a package https://github.com/tfoxy/graphene-django-optimizer to dynamically optimize queries, unfortunately it doesn't work with the latest Graphene.

adam commented

@emil-nasso commented on GitHub (Dec 31, 2022):

I think this is a standard problem in graphql APIs that is usually solved using batching/data loaders. Basically, it's an n+1 problem/lazy loading problem. There is one query to the database to fetch all prefixes and then when it is time to load the VLAN for a prefix, there is no batching implemented, so it performs lazy-loading/one query to the database per prefix.

The solution is to batch it. When trying to load VLANs, instead of directly loading them, their keys are "cached" and finally, a list of all prefixes that we need to load VLANs for is sent to the batch method and we can load them all at once, using one query to the database.

I have implemented this in several graphql APIs but I'm not "fluent in python or graphene". :)

It seems like graphene has a solution for it though: https://docs.graphene-python.org/en/latest/execution/dataloader

It seems like it could possibly be a solution for this issue to use that.

@emil-nasso commented on GitHub (Dec 31, 2022): I think this is a standard problem in graphql APIs that is usually solved using batching/data loaders. Basically, it's an n+1 problem/lazy loading problem. There is one query to the database to fetch all prefixes and then when it is time to load the VLAN for a prefix, there is no batching implemented, so it performs lazy-loading/one query to the database per prefix. The solution is to batch it. When trying to load VLANs, instead of directly loading them, their keys are "cached" and finally, a list of all prefixes that we need to load VLANs for is sent to the batch method and we can load them all at once, using one query to the database. I have implemented this in several graphql APIs but I'm not "fluent in python or graphene". :) It seems like graphene has a solution for it though: https://docs.graphene-python.org/en/latest/execution/dataloader It seems like it could possibly be a solution for this issue to use that.

adam commented

@neuro42 commented on GitHub (Jan 23, 2023):

After upgrading to 3.4.3 from 3.3.6, we've also noticed our graphql query jump from ~1s to ~7s. Looking at the CPU usage with py-spy, it doesn't seem like the DB queries are being slow, but the django bits around them that are:

2.74s   _fetch_all (django/db/models/query.py:1867)
1.36s   _deepcopy_dict (copy.py:230)
1.38s   deepcopy (copy.py:146)
2.33s   filter_queryset (django_filters/filterset.py:213)
3.25s   __iter__ (django/db/models/query.py)

I haven't had a chance yet to rerun the query against 3.3.6 to see what it looked like there.

@neuro42 commented on GitHub (Jan 23, 2023): After upgrading to 3.4.3 from 3.3.6, we've also noticed our graphql query jump from ~1s to ~7s. Looking at the CPU usage with py-spy, it doesn't seem like the DB queries are being slow, but the django bits around them that are: ``` 2.74s _fetch_all (django/db/models/query.py:1867) 1.36s _deepcopy_dict (copy.py:230) 1.38s deepcopy (copy.py:146) 2.33s filter_queryset (django_filters/filterset.py:213) 3.25s __iter__ (django/db/models/query.py) ``` I haven't had a chance yet to rerun the query against 3.3.6 to see what it looked like there.

adam commented

@v0tti commented on GitHub (Feb 16, 2023):

This issue is causing some trouble for us because most of our queries are taking so long we run into timeouts.

Example query we are using:

{
      device_list(has_primary_ip: true, status__n: "planned") {
        comments
        custom_fields
        device_role {
          name
        }
        id
        name
        platform {
          slug
        }
        primary_ip4 {
          address
          dns_name
        }
        primary_ip6 {
          address
          dns_name
        }
        site {
          slug
        }
        status
        interfaces {
          bridge {
            name
          }
          ip_addresses {
            address
          }
          lag {
            name
          }
          name
          mac_address
          mgmt_only
          parent {
            name
          }
          tagged_vlans {
            vid
          }
          type
          untagged_vlan {
            vid
          }
        }
      }
}

@v0tti commented on GitHub (Feb 16, 2023): This issue is causing some trouble for us because most of our queries are taking so long we run into timeouts. Example query we are using: ``` { device_list(has_primary_ip: true, status__n: "planned") { comments custom_fields device_role { name } id name platform { slug } primary_ip4 { address dns_name } primary_ip6 { address dns_name } site { slug } status interfaces { bridge { name } ip_addresses { address } lag { name } name mac_address mgmt_only parent { name } tagged_vlans { vid } type untagged_vlan { vid } } } } ```

adam commented

@v0tti commented on GitHub (Feb 21, 2023):

I have identified commit 99cf1b1671 as responsible for the massive slowdown of the GraphQL queries.
For our worst query it slows down the execution time from about 20 seconds to over 2 minutes.

@v0tti commented on GitHub (Feb 21, 2023): I have identified commit 99cf1b16718ca8bc037f546c41a9258bcc89b495 as responsible for the massive slowdown of the GraphQL queries. For our worst query it slows down the execution time from about 20 seconds to over 2 minutes.

adam commented

@v0tti commented on GitHub (Feb 22, 2023):

My fork reverts the commit on top of the current NetBox release. GraphQL is now as fast as prior to v3.4. As to be expected, filtering does not work on all levels, but we do not use that at the moment.

@v0tti commented on GitHub (Feb 22, 2023): [My fork](https://github.com/v0tti/netbox/releases/tag/v3.4.5-fix) reverts the commit on top of the current NetBox release. GraphQL is now as fast as prior to v3.4. As to be expected, filtering does not work on all levels, but we do not use that at the moment.

adam commented

@florianschendel commented on GitHub (Feb 28, 2023):

Hi guys,

I have a similar issue. Since the upgrade to 3.4.x the graphql query ends with a timeout

{
  prefix_list {
    description
    id
    last_updated
    site {
      name
    }
    vrf {
      name
    }
    tenant {
      name
    }
    role {
      name
    }
    tags {
      name
    }
    custom_fields
    prefix
    vlan {
      vid
      id
      name
      description
      status
      last_updated
    }
  }
}

Less complex queries still work without any issues.

Best regards
Florian

@florianschendel commented on GitHub (Feb 28, 2023): Hi guys, I have a similar issue. Since the upgrade to 3.4.x the graphql query ends with a timeout ``` { prefix_list { description id last_updated site { name } vrf { name } tenant { name } role { name } tags { name } custom_fields prefix vlan { vid id name description status last_updated } } } ``` Less complex queries still work without any issues. Best regards Florian

adam commented

@arthanson commented on GitHub (Feb 28, 2023):

Am going to take another look at this later this week, see if there are some tweaks that can be made. Thanks for the sample queries, will test with those.

@arthanson commented on GitHub (Feb 28, 2023): Am going to take another look at this later this week, see if there are some tweaks that can be made. Thanks for the sample queries, will test with those.

adam commented

@florianschendel commented on GitHub (Feb 28, 2023):

Am going to take another look at this later this week, see if there are some tweaks that can be made. Thanks for the sample queries, will test with those.

thanks a lot!

@florianschendel commented on GitHub (Feb 28, 2023): > Am going to take another look at this later this week, see if there are some tweaks that can be made. Thanks for the sample queries, will test with those. thanks a lot!

adam commented

@florianschendel commented on GitHub (Feb 28, 2023):

I split the one script into three scripts as workaround

I created one for the region "emea", "apac" and "americas". We created super networks/summary networks with the VLAN ID 4094.

Actually we don´t need it in our graphql report, we need only VLAN/Prefixes that are really configured on our switches, routers ....

Maybe its also a workaround for some user people.

{
  prefix_list(vlan_vid__lt: 4094, region: "emea") {
    description
    id
    last_updated
    site {
      name
    }
    vrf {
      name
    }
    tenant {
      name
    }
    role {
      name
    }
    tags {
      name
    }
    custom_fields
    prefix
    vlan {
      vid
      id
      name
      description
      status
      last_updated
    }
  }
}

@florianschendel commented on GitHub (Feb 28, 2023): I split the one script into three scripts as workaround I created one for the region "emea", "apac" and "americas". We created super networks/summary networks with the VLAN ID 4094. Actually we don´t need it in our graphql report, we need only VLAN/Prefixes that are really configured on our switches, routers .... Maybe its also a workaround for some user people. ``` { prefix_list(vlan_vid__lt: 4094, region: "emea") { description id last_updated site { name } vrf { name } tenant { name } role { name } tags { name } custom_fields prefix vlan { vid id name description status last_updated } } } ```

adam commented

@arthanson commented on GitHub (Mar 9, 2023):

Traced down the issue: https://github.com/netbox-community/netbox/blob/develop/netbox/netbox/filtersets.py#L249 This code is causing custom-fields to be queried even if they aren't in GraphQL request. You don't see this in REST API as all fields are returned anyways and there are prefetch_related calls in-place, but in GraphQL only a specific subset of fields are returned and there is currently no query optimization. This basically causes the query count to double from previous versions as filters are applied correctly now.

I made a bunch of VLANs with the query above, here are the query counts:
old code (before filtering at all levels): 985 queries
current code (with filtering at all levels): 1852 queries
with query optimizer: 84 queries

This will need to be fixed separately from a query optimizer, either the code referenced above needs to be changed so the custom fields aren't added to the query (preferred), or the query optimizer would need to detect the filter is a subclass of NetBoxModelFilterSet and auto add custom field prefetches.

@arthanson commented on GitHub (Mar 9, 2023): Traced down the issue: https://github.com/netbox-community/netbox/blob/develop/netbox/netbox/filtersets.py#L249 This code is causing custom-fields to be queried even if they aren't in GraphQL request. You don't see this in REST API as all fields are returned anyways and there are prefetch_related calls in-place, but in GraphQL only a specific subset of fields are returned and there is currently no query optimization. This basically causes the query count to double from previous versions as filters are applied correctly now. I made a bunch of VLANs with the query above, here are the query counts: old code (before filtering at all levels): 985 queries current code (with filtering at all levels): 1852 queries with query optimizer: 84 queries This will need to be fixed separately from a query optimizer, either the code referenced above needs to be changed so the custom fields aren't added to the query (preferred), or the query optimizer would need to detect the filter is a subclass of NetBoxModelFilterSet and auto add custom field prefetches.

adam commented