Incorrect API result "virtual_disk_count" for some VM #11859

Open
opened 2025-12-29 21:50:47 +01:00 by adam · 13 comments
Owner

Originally created by @stavr666 on GitHub (Nov 21, 2025).

Originally assigned to: @jeremystretch on GitHub.

NetBox Edition

NetBox Community

NetBox Version

v4.4.6

Python Version

3.11

Steps to Reproduce

Encountered on v4.3.4. Reproducible after upgrading to v4.4.6. Not all (new) VMs affected for some reason.

  1. Create VM
  2. Create virtual disk for VM

Expected Behavior

  1. VM shows number of disks on "Virtual Disks" tab
  2. VM return correct "virtual_disk_count" number from API request

Observed Behavior

No disks on UI and API counters.

Image Image Image
Originally created by @stavr666 on GitHub (Nov 21, 2025). Originally assigned to: @jeremystretch on GitHub. ### NetBox Edition NetBox Community ### NetBox Version v4.4.6 ### Python Version 3.11 ### Steps to Reproduce Encountered on v4.3.4. Reproducible after upgrading to v4.4.6. Not all (new) VMs affected for some reason. 1. Create VM 2. Create virtual disk for VM ### Expected Behavior 1. VM shows number of disks on "Virtual Disks" tab 2. VM return correct "virtual_disk_count" number from API request ### Observed Behavior No disks on UI and API counters. <img width="468" height="220" alt="Image" src="https://github.com/user-attachments/assets/b49ed2dd-6c75-4a0f-b573-820ee0122cb9" /> <img width="526" height="302" alt="Image" src="https://github.com/user-attachments/assets/e8f97382-720b-4778-9f88-3987580857f6" /> <img width="384" height="99" alt="Image" src="https://github.com/user-attachments/assets/7883ade3-413f-45cc-b5eb-71303b0b8729" />
Author
Owner

@jnovinger commented on GitHub (Nov 21, 2025):

Thanks for the report, @stavr666. I'm not able to reproduce your STR exactly, but I do see something very similar.

In the representation of my new VM from the API (from /api/virtualization/virtual-machines/ list endpoint and from the /api/virtualization/virtual-machines/541/ detail endpoint), the nested role object show zeroes for device_count and virtualmachine_count (which is inaccurate on its face, since it's nested in a VM with that role!).

Image

However, when I view the detail of that device role (/api/dcim/device-roles/7/) the counts match what is displayed in the web UI.

@jnovinger commented on GitHub (Nov 21, 2025): Thanks for the report, @stavr666. I'm not able to reproduce your STR exactly, but I do see something very similar. In the representation of my new VM from the API (from `/api/virtualization/virtual-machines/` list endpoint and from the `/api/virtualization/virtual-machines/541/` detail endpoint), the nested `role` object show zeroes for `device_count` and `virtualmachine_count` (which is inaccurate on its face, since it's nested in a VM with that role!). <img width="542" height="208" alt="Image" src="https://github.com/user-attachments/assets/77b37a20-4948-4eab-a5c6-6620056e4f67" /> However, when I view the detail of that device role (`/api/dcim/device-roles/7/`) the counts match what is displayed in the web UI.
Author
Owner

@stavr666 commented on GitHub (Nov 21, 2025):

However, when I view the detail of that device role (/api/dcim/device-roles/7/) the counts match what is displayed in the web UI.

Yes, similarity of UI and API wrong results also seems strange to me. Encountered this rare UI bug before (never reported since it was not critical).
But now (several days before trying to fix it by upgrading to v4.6.4) it's causes our automation to broke in pipeline "compare by disk count". We can rewrite our scripts, but it'll become slow again.

Can I somehow collect any technical details, that can help diagnose source of error? SQL request or something?

@stavr666 commented on GitHub (Nov 21, 2025): > However, when I view the detail of that device role (`/api/dcim/device-roles/7/`) the counts match what is displayed in the web UI. Yes, similarity of UI and API wrong results also seems strange to me. Encountered this rare UI bug before (never reported since it was not critical). But now (several days before trying to fix it by upgrading to v4.6.4) it's causes our automation to broke in pipeline "compare by disk count". We can rewrite our scripts, but it'll become slow again. Can I somehow collect any technical details, that can help diagnose source of error? SQL request or something?
Author
Owner

@jnovinger commented on GitHub (Nov 21, 2025):

Can I somehow collect any technical details, that can help diagnose source of error? SQL request ore something?

Python stack traces (in the case of unhandled exceptions), SQL queries, versions, and things along those lines are the most useful for really isolating where the problem is originating from.

Although, in this case, I suspect that it has something to do with how our CounterCacheField and how it's being handled by API serializers.

@jnovinger commented on GitHub (Nov 21, 2025): > Can I somehow collect any technical details, that can help diagnose source of error? SQL request ore something? Python stack traces (in the case of unhandled exceptions), SQL queries, versions, and things along those lines are the most useful for really isolating where the problem is originating from. Although, in this case, I suspect that it has something to do with how our `CounterCacheField` and how it's being handled by API serializers.
Author
Owner

@stavr666 commented on GitHub (Nov 21, 2025):

it has something to do with how our CounterCacheField

Any way to bump it's refresh manually?

@stavr666 commented on GitHub (Nov 21, 2025): > it has something to do with how our `CounterCacheField` Any way to bump it's refresh manually?
Author
Owner

@jnovinger commented on GitHub (Nov 21, 2025):

I don't actually believe it's an issue with the actual count being wrong, so much as it is an issue with the serializer not reading the value correctly and defaulting to zero. But, that's just speculation. I have not had any time to dig in to this one at all.

@jnovinger commented on GitHub (Nov 21, 2025): I don't actually believe it's an issue with the actual count being wrong, so much as it is an issue with the serializer not reading the value correctly and defaulting to zero. But, that's just speculation. I have not had any time to dig in to this one at all.
Author
Owner

@jnovinger commented on GitHub (Nov 21, 2025):

Seems like this and #19976 are likely related.

@jnovinger commented on GitHub (Nov 21, 2025): Seems like this and #19976 are likely related.
Author
Owner

@stavr666 commented on GitHub (Nov 21, 2025):

It's not that critical for now, so I'll keep track on mentioned issue.

@stavr666 commented on GitHub (Nov 21, 2025): It's not that critical for now, so I'll keep track on mentioned issue.
Author
Owner

@pheus commented on GitHub (Nov 21, 2025):

I might be wrong here, but to me this looks like two slightly different problems.

The API nested object counts (e.g. the device_count on role) seem to be coming from queryset annotations. Those I can reproduce pretty easily, including on the public demo.

By contrast, the virtual_disk_count field is a cached integer field (a CounterCacheField) on the model. While working on #19523 I ran into similar problems with cached counters and opened #20697 to track a CounterCacheField double‑counting bug. In that investigation I managed to push some counters into negative values (for example -2 devices) when the initial value was 0 and a related Device was deleted. With the CounterCacheField mechanism in place, every creation bumps the counter by +1 and every deletion by -1, so if the counter ever gets out of sync, it can drift into odd values.

There is a management command that recalculates all CounterCacheField values:

python3 netbox/manage.py calculate_cached_counts

@stavr666 Could you try running this on your instance and then repeat the steps you used to trigger the issue? It would be helpful to know whether the problem persists after the counters have been rebuilt, or if it only affected stale values from before.

If I’m misunderstanding the root cause here, please feel free to correct me. Just sharing what I’ve seen while working with the counter fields recently. 🙌

@pheus commented on GitHub (Nov 21, 2025): I might be wrong here, but to me this looks like two slightly different problems. The API nested object counts (e.g. the `device_count` on `role`) seem to be coming from queryset annotations. Those I can reproduce pretty easily, including on the public demo. By contrast, the `virtual_disk_count` field is a cached integer field (a `CounterCacheField`) on the model. While working on #19523 I ran into similar problems with cached counters and opened #20697 to track a `CounterCacheField` double‑counting bug. In that investigation I managed to push some counters into negative values (for example `-2` devices) when the initial value was `0` and a related `Device` was deleted. With the `CounterCacheField` mechanism in place, every creation bumps the counter by `+1` and every deletion by `-1`, so if the counter ever gets out of sync, it can drift into odd values. There is a management command that recalculates all `CounterCacheField` values: ```bash python3 netbox/manage.py calculate_cached_counts ``` @stavr666 Could you try running this on your instance and then repeat the steps you used to trigger the issue? It would be helpful to know whether the problem persists after the counters have been rebuilt, or if it only affected stale values from before. If I’m misunderstanding the root cause here, please feel free to correct me. Just sharing what I’ve seen while working with the counter fields recently. 🙌
Author
Owner

@stavr666 commented on GitHub (Nov 24, 2025):

There is a management command that recalculates all CounterCacheField values:

python3 netbox/manage.py calculate_cached_counts

It helped. Both UI and API shows correct values now:

Image Image Image
@stavr666 commented on GitHub (Nov 24, 2025): > There is a management command that recalculates all `CounterCacheField` values: > > python3 netbox/manage.py calculate_cached_counts It helped. Both UI and API shows correct values now: <img width="792" height="91" alt="Image" src="https://github.com/user-attachments/assets/932b0fcd-b04a-4dcb-b77d-190c7afb5d76" /> <img width="394" height="154" alt="Image" src="https://github.com/user-attachments/assets/125171e6-d8e7-4dee-a96c-d5eaaa3d5f78" /> <img width="385" height="95" alt="Image" src="https://github.com/user-attachments/assets/23e3f02d-03b7-4ab9-bc86-fc6cc63aa53d" />
Author
Owner

@pheus commented on GitHub (Nov 26, 2025):

Thanks for confirming, @stavr666 ! Glad to hear the values look correct now! 🙌

If you have a moment, could you try to repeat the steps that originally triggered the mismatch (both with existing objects and with newly created ones) and see whether you can still reproduce the issue?

That would help a lot to confirm whether this was just a case of stale cached counters or if there’s still an underlying bug we should keep digging into. No pressure if you don’t have time right away, of course 🙂

@pheus commented on GitHub (Nov 26, 2025): Thanks for confirming, @stavr666 ! Glad to hear the values look correct now! 🙌 If you have a moment, could you try to repeat the steps that originally triggered the mismatch (both with existing objects and with newly created ones) and see whether you can still reproduce the issue? That would help a lot to confirm whether this was just a case of stale cached counters or if there’s still an underlying bug we should keep digging into. No pressure if you don’t have time right away, of course 🙂
Author
Owner

@stavr666 commented on GitHub (Nov 28, 2025):

@pheus
New VMs have same caching issues:

Image Image
@stavr666 commented on GitHub (Nov 28, 2025): @pheus New VMs have same caching issues: <img width="527" height="357" alt="Image" src="https://github.com/user-attachments/assets/c8427837-50a5-4588-a32e-fba4eba9b448" /> <img width="382" height="84" alt="Image" src="https://github.com/user-attachments/assets/34d40bbb-88cb-4e0f-bb9b-e2e5b94091dc" />
Author
Owner

@jeremystretch commented on GitHub (Dec 17, 2025):

@stavr666 I'm not able to reproduce this on NetBox v4.4.8. If you're still encountering this issue after upgrading, could you please share updated reproduction steps?

@jeremystretch commented on GitHub (Dec 17, 2025): @stavr666 I'm not able to reproduce this on NetBox v4.4.8. If you're still encountering this issue after upgrading, could you please share updated reproduction steps?
Author
Owner

@github-actions[bot] commented on GitHub (Dec 25, 2025):

This is a reminder that additional information is needed in order to further triage this issue. If the requested details are not provided, the issue will soon be closed automatically.

@github-actions[bot] commented on GitHub (Dec 25, 2025): This is a reminder that additional information is needed in order to further triage this issue. If the requested details are not provided, the issue will soon be closed automatically.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/netbox#11859