[PR #2186] [MERGED] Fuzzy matching continued #3668

Closed
opened 2026-04-25 00:16:35 +02:00 by adam · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/advplyr/audiobookshelf/pull/2186
Author: @mikiher
Created: 10/5/2023
Status: Merged
Merged: 10/8/2023
Merged by: @advplyr

Base: masterHead: Fuzzy-Matching-Continued


📝 Commits (10+)

  • 1d3ad38 [cleanup] refactor OpenLib sort into getOpenLibResult
  • 46b0b3a [cleanup] Refactor candidates logic to separate class
  • 5d7c197 [fix] Add back toLowerCase to cleanAuthor/Title (required by other uses)
  • 10f5bc8 [cleanup] Make original title/author check with more readable
  • 752bfff [enhamcement] Only add title candidate before and after all transforms
  • 8979586 [enhancement] Improve candidate sorting
  • 9eff471 [enhancement] AuthorCandidates, author validation
  • b2acdad [enhancement] Added a couple title transformers
  • f3555a1 [enhancement] Handle initials in author normalization
  • bf9f389 [enhancement] Treat underscores as title part separators

📊 Changes

1 file changed (+173 additions, -73 deletions)

View changed files

📝 server/finders/BookFinder.js (+173 -73)

📄 Description

This is a continuation of Fuzzy Matching V1.
This includes some cleanups and refactoring, a few improvements, and one major enhancement.

Cleanups, refactoring, and small fixes:

Enhancements & Improvements:

The code is now more robust, and handles various hard corner cases it didn't handle before.

It fixes one case in the previous eval set, and keeps a 98% found rate and 96% found@1 in a new 50 title/author pairs eval set.
In the new eval set, I also measured the average number of fuzzy searches - 1.18 (note that the eval sets are picked from an unstructured torrents folder, where the initial search with the original title and field almost always fails. 1.18 means that in most cases, only one fuzzy search request is needed).

The additional parallel author validation requests (usually between 2-4) to audnexus seem to run very quickly, and most of the network time seems to be spent in the search provider requests.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/advplyr/audiobookshelf/pull/2186 **Author:** [@mikiher](https://github.com/mikiher) **Created:** 10/5/2023 **Status:** ✅ Merged **Merged:** 10/8/2023 **Merged by:** [@advplyr](https://github.com/advplyr) **Base:** `master` ← **Head:** `Fuzzy-Matching-Continued` --- ### 📝 Commits (10+) - [`1d3ad38`](https://github.com/advplyr/audiobookshelf/commit/1d3ad38187708ca0c6efefce2d04b82820f19522) [cleanup] refactor OpenLib sort into getOpenLibResult - [`46b0b3a`](https://github.com/advplyr/audiobookshelf/commit/46b0b3a6efb7f31ac7d67ee5fff6dcbd2ff28542) [cleanup] Refactor candidates logic to separate class - [`5d7c197`](https://github.com/advplyr/audiobookshelf/commit/5d7c197c893d10277f59c753e2d324837185a78f) [fix] Add back toLowerCase to cleanAuthor/Title (required by other uses) - [`10f5bc8`](https://github.com/advplyr/audiobookshelf/commit/10f5bc8cbeeacd3c47f7115f387dd7d5817982e7) [cleanup] Make original title/author check with more readable - [`752bfff`](https://github.com/advplyr/audiobookshelf/commit/752bfffb1109e8fadf87775ecacf588365608b03) [enhamcement] Only add title candidate before and after all transforms - [`8979586`](https://github.com/advplyr/audiobookshelf/commit/8979586404a1ca4a46b0eff3d1cc23582ffbfbb5) [enhancement] Improve candidate sorting - [`9eff471`](https://github.com/advplyr/audiobookshelf/commit/9eff471afaa87572bfcb312af64d756511fde2a3) [enhancement] AuthorCandidates, author validation - [`b2acdad`](https://github.com/advplyr/audiobookshelf/commit/b2acdadcea6fa52636d816166beac24cb370e127) [enhancement] Added a couple title transformers - [`f3555a1`](https://github.com/advplyr/audiobookshelf/commit/f3555a12ceff25d328b7dd1637668874e181946e) [enhancement] Handle initials in author normalization - [`bf9f389`](https://github.com/advplyr/audiobookshelf/commit/bf9f3895db17f2172cda4e32caab559eda9c05a1) [enhancement] Treat underscores as title part separators ### 📊 Changes **1 file changed** (+173 additions, -73 deletions) <details> <summary>View changed files</summary> 📝 `server/finders/BookFinder.js` (+173 -73) </details> ### 📄 Description This is a continuation of [Fuzzy Matching V1](https://github.com/advplyr/audiobookshelf/pull/2099). This includes some cleanups and refactoring, a few improvements, and one major enhancement. Cleanups, refactoring, and small fixes: - Refactor title candidates logic (addition, variants, sorting) into a separate class (TitleCandidates) (https://github.com/advplyr/audiobookshelf/commit/46b0b3a6efb7f31ac7d67ee5fff6dcbd2ff28542) - (minor) Rewrite the logic that makes sure we don't run the original Title/Author search twice, to make it more readable (https://github.com/advplyr/audiobookshelf/commit/10f5bc8cbeeacd3c47f7115f387dd7d5817982e7) - (minor) Refactor OpenLib-specific sorting into getOpenLibResult (https://github.com/advplyr/audiobookshelf/commit/1d3ad38187708ca0c6efefce2d04b82820f19522) - (minor) Add back lower-casing in the cleanTitle/Author methods, since they are called from other places in the code (https://github.com/advplyr/audiobookshelf/commit/5d7c197c893d10277f59c753e2d324837185a78f) Enhancements & Improvements: - (major) Move Author candidates logic into its own class (AuthorCandidates), and introduce author extraction and validation (from both author and title parts) using parallel requests to Audnexus. This helps in cases where: the author field includes additional data, or when the author hides in one of the title parts. Fuzzy search logic now has an external loop that goes over author candidates (including empty author in the end), and an internal loop that goes over title candidates (https://github.com/advplyr/audiobookshelf/commit/9eff471afaa87572bfcb312af64d756511fde2a3) - Handle initials in author normalization (separate initials, and remove middle initials, as they sometimes mismatch with providers) (https://github.com/advplyr/audiobookshelf/commit/f3555a12ceff25d328b7dd1637668874e181946e) - Added/fixed a couple of title transformer regular expressions. (https://github.com/advplyr/audiobookshelf/commit/b2acdadcea6fa52636d816166beac24cb370e127) - Improved title candidate sorting (preferring transformed title parts over original ones, and title parts in their order of appearance) (https://github.com/advplyr/audiobookshelf/commit/8979586404a1ca4a46b0eff3d1cc23582ffbfbb5) - Add just one title variant after all all transformers have been applied, and not after each transformer (https://github.com/advplyr/audiobookshelf/commit/752bfffb1109e8fadf87775ecacf588365608b03) - Treat underscores as title part separators (improves some corner cases) (https://github.com/advplyr/audiobookshelf/pull/2186/commits/bf9f3895db17f2172cda4e32caab559eda9c05a1) - Reduce spurious audnexus author matching (by reducing the max Levenshtein distance, and looking only at the top 10 results) (https://github.com/advplyr/audiobookshelf/pull/2186/commits/b0b7a0a61817671b15e2687a32399aea6f0bdb51) - If no authors have been validated, use an aggressively cleaned version of the author field (in many cases, it is better than nothing) (https://github.com/advplyr/audiobookshelf/pull/2186/commits/f44b7ed1d0f8ba538e194632f98660893d9206a6) The code is now more robust, and handles various hard corner cases it didn't handle before. It fixes one case in the previous eval set, and keeps a 98% found rate and 96% found@1 in a new 50 title/author pairs eval set. In the new eval set, I also measured the average number of fuzzy searches - 1.18 (note that the eval sets are picked from an unstructured torrents folder, where the initial search with the original title and field almost always fails. 1.18 means that in most cases, only one fuzzy search request is needed). The additional parallel author validation requests (usually between 2-4) to audnexus seem to run very quickly, and most of the network time seems to be spent in the search provider requests. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
adam added the pull-request label 2026-04-25 00:16:35 +02:00
adam closed this issue 2026-04-25 00:16:35 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/audiobookshelf#3668