Nondeterministic ordering of interfaces with a "zero" channel or subinterface #3453

Closed
opened 2025-12-29 18:29:15 +01:00 by adam · 13 comments
Owner

Originally created by @gaige on GitHub (Mar 9, 2020).

Originally assigned to: @jeremystretch on GitHub.

Environment

  • Python version: 3.7.5
  • NetBox version: 2.7.9

Steps to Reproduce

  1. Create a device to which you can attach interfaces
  2. Create 2 interfaces on this device, named ge-0/0/26 and ge-0/0/26.0 filling in any other required fields.
  3. The sub-interface (ge-0/0/26.0) should be listed after the main interface (ge-0/0/26), and it may be on your screen.
  4. In the postgres database, the _name column of the dcim_interface table for both entries will be the same.

Note: the problem that I had was trickier to find originally, and resulted in inconsistent behavior when requesting a multi-page group of interfaces via the API. If the ge-0/0/26 and ge-0/0/26.0 interfaces were to be the 50th and 51st items to be returned by the API, there was a very consistent behavior where 'ge-0/0/26' was returned both as the last item in the first call (limit=50 no offset), and the first item in the second call (limit=50&offset=50) and that the ge-0/0/26.0 item would not be returned at all. By running a larger limit, I could see that the problem didn't happen, which indicated a variance in what was being returned by the database query, which led me to the _name field ordering in the dcim_interface table.

Expected Behavior

Each of the interfaces should have a unique _name key in the database.

Observed Behavior

The _name key for an item with no .0 and an item with a .0 were identical. This is due to the ordering algorithm using 000000 as the empty placeholder as well as the value for 0. To the left of the interface name, this is not a problem, because the slot numbers use 9999 as placeholders instead of 0, and thus do not collide with the 0's.

I've got a patch for this which I'll be submitting in a pull request forthwith.

Originally created by @gaige on GitHub (Mar 9, 2020). Originally assigned to: @jeremystretch on GitHub. <!-- NOTE: IF YOUR ISSUE DOES NOT FOLLOW THIS TEMPLATE, IT WILL BE CLOSED. This form is only for reproducible bugs. If you need assistance with NetBox installation, or if you have a general question, DO NOT open an issue. Instead, post to our mailing list: https://groups.google.com/forum/#!forum/netbox-discuss Please describe the environment in which you are running NetBox. Be sure that you are running an unmodified instance of the latest stable release before submitting a bug report. --> ### Environment * Python version: 3.7.5 * NetBox version: 2.7.9 <!-- Describe in detail the exact steps that someone else can take to reproduce this bug using the current stable release of NetBox (or the current beta release where applicable). Begin with the creation of any necessary database objects and call out every operation being performed explicitly. If reporting a bug in the REST API, be sure to reconstruct the raw HTTP request(s) being made: Don't rely on a wrapper like pynetbox. --> ### Steps to Reproduce 1. Create a device to which you can attach interfaces 2. Create 2 interfaces on this device, named `ge-0/0/26` and `ge-0/0/26.0` filling in any other required fields. 3. The sub-interface (ge-0/0/26.0) should be listed after the main interface (ge-0/0/26), and it may be on your screen. 4. In the postgres database, the _name column of the dcim_interface table for both entries will be the same. Note: the problem that I had was trickier to find originally, and resulted in inconsistent behavior when requesting a multi-page group of interfaces via the API. If the ge-0/0/26 and ge-0/0/26.0 interfaces were to be the 50th and 51st items to be returned by the API, there was a very consistent behavior where 'ge-0/0/26' was returned both as the last item in the first call (limit=50 no offset), and the first item in the second call (limit=50&offset=50) and that the `ge-0/0/26.0` item would not be returned at all. By running a larger limit, I could see that the problem didn't happen, which indicated a variance in what was being returned by the database query, which led me to the _name field ordering in the dcim_interface table. <!-- What did you expect to happen? --> ### Expected Behavior Each of the interfaces should have a unique _name key in the database. <!-- What happened instead? --> ### Observed Behavior The _name key for an item with no `.0` and an item with a `.0` were identical. This is due to the ordering algorithm using `000000` as the empty placeholder as well as the value for 0. To the left of the interface name, this is not a problem, because the slot numbers use `9999` as placeholders instead of 0, and thus do not collide with the 0's. I've got a patch for this which I'll be submitting in a pull request forthwith.
adam added the type: bugstatus: accepted labels 2025-12-29 18:29:15 +01:00
adam closed this issue 2025-12-29 18:29:15 +01:00
Author
Owner

@gaige commented on GitHub (Mar 9, 2020):

Note that I do have a pull request ready to go, but in accordance with how I'm interpreting the policy, I won't submit it until after the bug has been accepted. For reference, my proposed mechanism is to use a 4-space semaphore instead of a 4-0 semaphor to indicate a missing element. This both preserves order and differentiates between 0 and missing elements.

My branch containing the fix is available at:
https://github.com/cluetrust/netbox/tree/4366-interface-collision

@gaige commented on GitHub (Mar 9, 2020): Note that I do have a pull request ready to go, but in accordance with how I'm interpreting the policy, I won't submit it until after the bug has been accepted. For reference, my proposed mechanism is to use a 4-space semaphore instead of a 4-`0` semaphor to indicate a missing element. This both preserves order and differentiates between `0` and missing elements. My branch containing the fix is available at: https://github.com/cluetrust/netbox/tree/4366-interface-collision
Author
Owner

@jeremystretch commented on GitHub (Mar 10, 2020):

Yeah, the old regex approach would treat a missing ID/channel as NULL and order appropriately. We didn't fully replicate this in #3799, instead setting a zeroized value.

For reference, my proposed mechanism is to use a 4-space semaphore instead of a 4-0 semaphor to indicate a missing element.

This is the right track, though I'd prefer to use a non-whitespace character such as a dot for better readability. Also, I want to avoid adding random examples to the tests as they become very difficult to manage over time.

@jeremystretch commented on GitHub (Mar 10, 2020): Yeah, the old regex approach would treat a missing ID/channel as NULL and order appropriately. We didn't fully replicate this in #3799, instead setting a zeroized value. > For reference, my proposed mechanism is to use a 4-space semaphore instead of a 4-0 semaphor to indicate a missing element. This is the right track, though I'd prefer to use a non-whitespace character such as a dot for better readability. Also, I want to avoid adding random examples to the tests as they become very difficult to manage over time.
Author
Owner

@gaige commented on GitHub (Mar 10, 2020):

I'll go ahead and put a dot in there. Although I'm half-tempted to suggest a - (HYPHEN-MINUS) to prevent problems with . potentially being a pain in regular expressions. But, I'm happy to go either way.

How about the following for the test additions (with the appropriate changes as above):

    ('vlan0', '9999999999999999vlan000000            '),
    ('vlan0.0', '9999999999999999vlan000000      000000'),
    ('ge-0/0/1', '0000000099999999ge-000001            '),
    ('ge-0/0/1.0', '0000000099999999ge-000001      000000'),
    ('ge-0/0/1.2', '0000000099999999ge-000001      000002'),
    ('ge-1/2/3.4', '0001000299999999xe-000003      000004'),

In particular, this tests the two sub-interface types (hyphen-separated and not hyphen-separated) for Junos and sticks with the formula used previously in the Cisco-style interfaces for increasing digits.

I'd also replace the not equal verification to be (which I realized was incorrect):

        self.assertNotEqual(naturalize_interface('ge-0/0/1', max_length=100),
                             naturalize_interface('ge-0/0/1.0', max_length=100))

Let me know and I'll make the changes, test, and do the pull request.

@gaige commented on GitHub (Mar 10, 2020): I'll go ahead and put a dot in there. Although I'm half-tempted to suggest a `-` (HYPHEN-MINUS) to prevent problems with `.` potentially being a pain in regular expressions. But, I'm happy to go either way. How about the following for the test additions (with the appropriate changes as above): ```` ('vlan0', '9999999999999999vlan000000 '), ('vlan0.0', '9999999999999999vlan000000 000000'), ('ge-0/0/1', '0000000099999999ge-000001 '), ('ge-0/0/1.0', '0000000099999999ge-000001 000000'), ('ge-0/0/1.2', '0000000099999999ge-000001 000002'), ('ge-1/2/3.4', '0001000299999999xe-000003 000004'), ```` In particular, this tests the two sub-interface types (hyphen-separated and not hyphen-separated) for Junos and sticks with the formula used previously in the Cisco-style interfaces for increasing digits. I'd also replace the not equal verification to be (which I realized was incorrect): ```` self.assertNotEqual(naturalize_interface('ge-0/0/1', max_length=100), naturalize_interface('ge-0/0/1.0', max_length=100)) ```` Let me know and I'll make the changes, test, and do the pull request.
Author
Owner

@jeremystretch commented on GitHub (Mar 10, 2020):

@gaige Thanks but there's no need. I've made the changes and adapted the tests accordingly. Will be pushed soon; I'm trying to work around an issue with pycodestyle atm.

@jeremystretch commented on GitHub (Mar 10, 2020): @gaige Thanks but there's no need. I've made the changes and adapted the tests accordingly. Will be pushed soon; I'm trying to work around an issue with pycodestyle atm.
Author
Owner

@jeremystretch commented on GitHub (Mar 10, 2020):

Unfortunately this is not going to be such a simple fix. Because PosgtreSQL is using localized ordering (e.g. en_US for me), we cannot assume deterministic ordering of non-alphanumeric characters. For instance, PostgreSQL ignores the period and hyphen characters when ordering _name values.

Worse, this doesn't seem to be something we can fix with a schema migration: The entire database (or at least the table?) would likely need to be rebuilt in order to change the locale, which would not be practical. IMO we have two options:

  1. Identify a "placeholder" character indicating null values which will always be sorted before zero; or
  2. Force collation to plain ASCII ordering on the column when sorting.

Option 2 is possible but we'll need to look more into the implementation details of enforcing the collation for the model's default ordering. IIRC there's a bug when using F() to wrap firelds in Meta.order_by that isn't fixed until Django 3.0, so this might need to wait for NetBox v2.8.

For now, I've reverted my earlier patch, but kept the modified tests in place, which appear to be passing anyway.

@jeremystretch commented on GitHub (Mar 10, 2020): Unfortunately this is not going to be such a simple fix. Because PosgtreSQL is using localized ordering (e.g. `en_US` for me), we cannot assume deterministic ordering of non-alphanumeric characters. For instance, PostgreSQL ignores the period and hyphen characters when ordering `_name` values. Worse, this doesn't seem to be something we can fix with a schema migration: The entire database (or at least the table?) would likely need to be rebuilt in order to change the locale, which would not be practical. IMO we have two options: 1. Identify a "placeholder" character indicating null values which will _always_ be sorted before zero; or 2. Force collation to plain ASCII ordering on the column when sorting. Option 2 is [possible](https://stackoverflow.com/questions/22841132/django-making-a-query-with-custom-collation/36423002#36423002) but we'll need to look more into the implementation details of enforcing the collation for the model's default ordering. IIRC there's a bug when using `F()` to wrap firelds in `Meta.order_by` that isn't fixed until Django 3.0, so this _might_ need to wait for NetBox v2.8. For now, I've reverted my earlier patch, but kept the modified tests in place, which appear to be passing anyway.
Author
Owner

@gaige commented on GitHub (Mar 10, 2020):

Yes, the reason the modified tests are working is that you didn't take the conflict test with (it was in a separate test function in the same test file). However, there was a false-positive in my original conflict test. I fixed it in the version that I did later but haven't pushed (see the comment earlier in this thread).

However the bigger problem is not the ordering for the UI, but the bug that I mentioned in the API. If you have 2 devices with the same _name the order they are returned can be different based on the specific query to the database. In this case, it causes 2 copies of a single interface and no copies of the second interface to be returned in the interface list retrieval if they happen to sit astride the page boundary. Since the python API doesn't allow you to adjust the page boundary, there's no way to guarantee that these interfaces will be returned (and not duplicated) in the interface list. Had this not shown up, I never would have bothered to create the fix (the ordering in the visuals isn't that important to me). It's made worse by the fact that any Junos-based switch with VLANs on it will have the "XX-NN/MM/OO" and "XX-NN/MM/OO.0" interfaces, and thus potentially trigger the issue.

@gaige commented on GitHub (Mar 10, 2020): Yes, the reason the modified tests are working is that you didn't take the conflict test with (it was in a separate test function in the same test file). However, there was a false-positive in my original conflict test. I fixed it in the version that I did later but haven't pushed (see the comment earlier in this thread). *However* the bigger problem is not the ordering for the UI, but the bug that I mentioned in the API. If you have 2 devices with the same `_name` the order they are returned can be different based on the specific query to the database. In this case, it causes 2 copies of a single interface and no copies of the second interface to be returned in the interface list retrieval if they happen to sit astride the page boundary. Since the python API doesn't allow you to adjust the page boundary, there's no way to guarantee that these interfaces will be returned (and not duplicated) in the interface list. Had this not shown up, I never would have bothered to create the fix (the ordering in the visuals isn't that important to me). It's made worse by the fact that any Junos-based switch with VLANs on it will have the "XX-NN/MM/OO" and "XX-NN/MM/OO.0" interfaces, and thus potentially trigger the issue.
Author
Owner

@jeremystretch commented on GitHub (Mar 10, 2020):

Yes, the reason the modified tests are working is that you didn't take the conflict test with (it was in a separate test function in the same test file)

I injected a similar condition into the existing test instead. No need for a separate test.

However the bigger problem is not the ordering for the UI, but the bug that I mentioned in the API

The same ordering logic is used for both.

Since the python API doesn't allow you to adjust the page boundary

What are you talking about, the REST API or the Django ORM? Both allow you to adjust the page size.

@jeremystretch commented on GitHub (Mar 10, 2020): > Yes, the reason the modified tests are working is that you didn't take the conflict test with (it was in a separate test function in the same test file) I [injected a similar condition](https://github.com/netbox-community/netbox/blob/develop/netbox/utilities/tests/test_ordering.py#L35-L36) into the existing test instead. No need for a separate test. > However the bigger problem is not the ordering for the UI, but the bug that I mentioned in the API The same ordering logic is used for both. > Since the python API doesn't allow you to adjust the page boundary What are you talking about, the REST API or the Django ORM? Both allow you to adjust the page size.
Author
Owner

@gaige commented on GitHub (Mar 10, 2020):

Yes, the reason the modified tests are working is that you didn't take the conflict test with (it was in a separate test function in the same test file)

I injected a similar condition into the existing test instead. No need for a separate test.

I'm not sure I'm seeing the test condition you're referring to. I did see these:

            ('Gi1', '9999999999999999Gi000001000000000000'),
            ('Gi1.0', '9999999999999999Gi000001000000000000'),
            ('Gi1.1', '9999999999999999Gi000001000000000001'),
            ('Gi1:0', '9999999999999999Gi000001000000000000'),
            ('Gi1:0.0', '9999999999999999Gi000001000000000000'),

But, they show that the .0 interfaces and the non-.0 interfaces have the same normalized value.

My separate test was designed to verify that the two items with different interface names created different normalized values. I'm not sure how that's represented here.

However the bigger problem is not the ordering for the UI, but the bug that I mentioned in the API

The same ordering logic is used for both.

Right, but as a user, you only see one page at a time. If you have 2 pages of interfaces, you probably: a) wouldn't notice the missing interface; or b) would search for it using the search box and you would find it.

However, as an API user, when I need to retrieve the list of all interfaces on an individual device (we've got an auditing program that validates the interfaces and bridge interfaces against the devices via SNMP), when I request items 0-49 and 50-99, I expect to get 100 different items representing the entire data set. And, if I set the limit in the API query to 0 or some sufficiently large number (using Paw or the Swagger tools), I get a good list. However, if I page the list (which is enforced at 50 in the pynetbox package), then I get 2 of one and 0 of the other.

Since the python API doesn't allow you to adjust the page boundary

What are you talking about, the REST API or the Django ORM? Both allow you to adjust the page size.

The Python library for accessing the API (pynetbox). Sorry if that wasn't clear. I could patch that to allow retrieving of all of the items, but that seemed the wrong approach given that the API was returning errant information.

Alternatively, the API bug could be fixed by returning the items in a "less pretty" order, such as by the id or the name, but right now, with the _name field being not unique, the database won't necessarily return the same data every time. Just an alternate thought. It would also fix any problems that we don't know about in the algorithm.

@gaige commented on GitHub (Mar 10, 2020): > > Yes, the reason the modified tests are working is that you didn't take the conflict test with (it was in a separate test function in the same test file) > > I [injected a similar condition](https://github.com/netbox-community/netbox/blob/develop/netbox/utilities/tests/test_ordering.py#L35-L36) into the existing test instead. No need for a separate test. I'm not sure I'm seeing the test condition you're referring to. I did see these: ```` ('Gi1', '9999999999999999Gi000001000000000000'), ('Gi1.0', '9999999999999999Gi000001000000000000'), ('Gi1.1', '9999999999999999Gi000001000000000001'), ('Gi1:0', '9999999999999999Gi000001000000000000'), ('Gi1:0.0', '9999999999999999Gi000001000000000000'), ```` But, they show that the `.0` interfaces and the non-`.0` interfaces have the same normalized value. My separate test was designed to verify that the two items with different interface names created different normalized values. I'm not sure how that's represented here. > > However the bigger problem is not the ordering for the UI, but the bug that I mentioned in the API > > The same ordering logic is used for both. Right, but as a user, you only see one page at a time. If you have 2 pages of interfaces, you probably: a) wouldn't notice the missing interface; or b) would search for it using the search box and you would find it. However, as an API user, when I need to retrieve the list of all interfaces on an individual device (we've got an auditing program that validates the interfaces and bridge interfaces against the devices via SNMP), when I request items 0-49 and 50-99, I expect to get 100 different items representing the entire data set. And, if I set the `limit` in the API query to 0 or some sufficiently large number (using Paw or the Swagger tools), I get a good list. However, if I page the list (which is enforced at 50 in the `pynetbox` package), then I get 2 of one and 0 of the other. > > Since the python API doesn't allow you to adjust the page boundary > > What are you talking about, the REST API or the Django ORM? Both allow you to adjust the page size. The Python library for accessing the API (pynetbox). Sorry if that wasn't clear. I could patch that to allow retrieving of all of the items, but that seemed the wrong approach given that the API was returning errant information. Alternatively, the API bug could be fixed by returning the items in a "less pretty" order, such as by the `id` or the `name`, but right now, with the `_name` field being not unique, the database won't necessarily return the same data every time. Just an alternate thought. It would also fix any problems that we don't know about in the algorithm.
Author
Owner

@jeremystretch commented on GitHub (Mar 10, 2020):

But, they show that the .0 interfaces and the non-.0 interfaces have the same normalized value.

Yes, because as noted we're still using the original zero-based approach until we come up with a solution. However, the test also evaluates the actual ordering of the two interfaces.

I will come back to this issue when I have more time to dedicate to it.

@jeremystretch commented on GitHub (Mar 10, 2020): > But, they show that the .0 interfaces and the non-.0 interfaces have the same normalized value. Yes, because as noted we're still using the original zero-based approach until we come up with a solution. However, the test also evaluates the actual ordering of the two interfaces. I will come back to this issue when I have more time to dedicate to it.
Author
Owner

@gaige commented on GitHub (Mar 10, 2020):

I will come back to this issue when I have more time to dedicate to it.

I respect that. You need to prioritize items as you must. In my case, this bug in the API (duplicate and missing items) cost me quite a bit of time in diagnosing other issues, so we'll have to keep a separate branch until this gets addressed in some fashion that makes the data returned by the API correct.

Thanks,
-Gaige

@gaige commented on GitHub (Mar 10, 2020): > > I will come back to this issue when I have more time to dedicate to it. > I respect that. You need to prioritize items as you must. In my case, this bug in the API (duplicate and missing items) cost me quite a bit of time in diagnosing other issues, so we'll have to keep a separate branch until this gets addressed in some fashion that makes the data returned by the API correct. Thanks, -Gaige
Author
Owner

@hSaria commented on GitHub (Mar 11, 2020):

Can we just add the PK as the last ordering field to act as a tie-breaker? Sure, it might not order these collisions as you'd want, but at least it wouldn't break pagination by introducing a duplicate record and omitting the other when the collision is at a page break.

I imagine changes to natural ordering will only reduce the likelihood of non-deterministic ordering; some combination of input will eventually raise the same issue as this one.

@hSaria commented on GitHub (Mar 11, 2020): Can we just add the PK as the last ordering field to act as a tie-breaker? Sure, it might not order these collisions as you'd want, but at least it wouldn't break pagination by introducing a duplicate record and omitting the other when the collision is at a page break. I imagine changes to natural ordering will only reduce the likelihood of non-deterministic ordering; some combination of input will eventually raise the same issue as this one.
Author
Owner

@gaige commented on GitHub (Mar 11, 2020):

Good thought! It also protects against future potential collisions. Although it'd be nice for the display to show the "shorter" interfaces first, my primary concern is the API contract violation, which this would definitely solve.

@gaige commented on GitHub (Mar 11, 2020): Good thought! It also protects against future potential collisions. Although it'd be nice for the display to show the "shorter" interfaces first, my primary concern is the API contract violation, which this would definitely solve.
Author
Owner

@jeremystretch commented on GitHub (Mar 11, 2020):

Can we just add the PK as the last ordering field to act as a tie-breaker?

No, we need to find a complete solution to enforce proper ordering.

@jeremystretch commented on GitHub (Mar 11, 2020): > Can we just add the PK as the last ordering field to act as a tie-breaker? No, we need to find a complete solution to enforce proper ordering.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/netbox#3453