[PR #2186] [MERGED] Fuzzy matching continued #3668

New Issue

2026-04-25T00:16:35+02:00

adam commented

2026-04-25 00:16:35 +02:00

📋 Pull Request Information

Original PR: https://github.com/advplyr/audiobookshelf/pull/2186
Author: @mikiher
Created: 10/5/2023
Status: ✅ Merged
Merged: 10/8/2023
Merged by: @advplyr

Base: master ← Head: Fuzzy-Matching-Continued

📝 Commits (10+)

1d3ad38 [cleanup] refactor OpenLib sort into getOpenLibResult
46b0b3a [cleanup] Refactor candidates logic to separate class
5d7c197 [fix] Add back toLowerCase to cleanAuthor/Title (required by other uses)
10f5bc8 [cleanup] Make original title/author check with more readable
752bfff [enhamcement] Only add title candidate before and after all transforms
8979586 [enhancement] Improve candidate sorting
9eff471 [enhancement] AuthorCandidates, author validation
b2acdad [enhancement] Added a couple title transformers
f3555a1 [enhancement] Handle initials in author normalization
bf9f389 [enhancement] Treat underscores as title part separators

📊 Changes

1 file changed (+173 additions, -73 deletions)

View changed files

📝 server/finders/BookFinder.js (+173 -73)

📄 Description

This is a continuation of Fuzzy Matching V1.
This includes some cleanups and refactoring, a few improvements, and one major enhancement.

Cleanups, refactoring, and small fixes:

Refactor title candidates logic (addition, variants, sorting) into a separate class (TitleCandidates) (https://github.com/advplyr/audiobookshelf/commit/46b0b3a6efb7f31ac7d67ee5fff6dcbd2ff28542)
(minor) Rewrite the logic that makes sure we don't run the original Title/Author search twice, to make it more readable (https://github.com/advplyr/audiobookshelf/commit/10f5bc8cbeeacd3c47f7115f387dd7d5817982e7)
(minor) Refactor OpenLib-specific sorting into getOpenLibResult (https://github.com/advplyr/audiobookshelf/commit/1d3ad38187708ca0c6efefce2d04b82820f19522)
(minor) Add back lower-casing in the cleanTitle/Author methods, since they are called from other places in the code (https://github.com/advplyr/audiobookshelf/commit/5d7c197c893d10277f59c753e2d324837185a78f)

Enhancements & Improvements:

(major) Move Author candidates logic into its own class (AuthorCandidates), and introduce author extraction and validation (from both author and title parts) using parallel requests to Audnexus. This helps in cases where: the author field includes additional data, or when the author hides in one of the title parts. Fuzzy search logic now has an external loop that goes over author candidates (including empty author in the end), and an internal loop that goes over title candidates (https://github.com/advplyr/audiobookshelf/commit/9eff471afaa87572bfcb312af64d756511fde2a3)
Handle initials in author normalization (separate initials, and remove middle initials, as they sometimes mismatch with providers) (https://github.com/advplyr/audiobookshelf/commit/f3555a12ceff25d328b7dd1637668874e181946e)
Added/fixed a couple of title transformer regular expressions. (https://github.com/advplyr/audiobookshelf/commit/b2acdadcea6fa52636d816166beac24cb370e127)
Improved title candidate sorting (preferring transformed title parts over original ones, and title parts in their order of appearance) (https://github.com/advplyr/audiobookshelf/commit/8979586404a1ca4a46b0eff3d1cc23582ffbfbb5)
Add just one title variant after all all transformers have been applied, and not after each transformer (https://github.com/advplyr/audiobookshelf/commit/752bfffb1109e8fadf87775ecacf588365608b03)
Treat underscores as title part separators (improves some corner cases) (https://github.com/advplyr/audiobookshelf/pull/2186/commits/bf9f3895db17f2172cda4e32caab559eda9c05a1)
Reduce spurious audnexus author matching (by reducing the max Levenshtein distance, and looking only at the top 10 results) (https://github.com/advplyr/audiobookshelf/pull/2186/commits/b0b7a0a61817671b15e2687a32399aea6f0bdb51)
If no authors have been validated, use an aggressively cleaned version of the author field (in many cases, it is better than nothing) (https://github.com/advplyr/audiobookshelf/pull/2186/commits/f44b7ed1d0f8ba538e194632f98660893d9206a6)

The code is now more robust, and handles various hard corner cases it didn't handle before.

It fixes one case in the previous eval set, and keeps a 98% found rate and 96% found@1 in a new 50 title/author pairs eval set.
In the new eval set, I also measured the average number of fuzzy searches - 1.18 (note that the eval sets are picked from an unstructured torrents folder, where the initial search with the original title and field almost always fails. 1.18 means that in most cases, only one fuzzy search request is needed).

The additional parallel author validation requests (usually between 2-4) to audnexus seem to run very quickly, and most of the network time seems to be spent in the search provider requests.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/advplyr/audiobookshelf/pull/2186 **Author:** [@mikiher](https://github.com/mikiher) **Created:** 10/5/2023 **Status:** ✅ Merged **Merged:** 10/8/2023 **Merged by:** [@advplyr](https://github.com/advplyr) **Base:** `master` ← **Head:** `Fuzzy-Matching-Continued` --- ### 📝 Commits (10+) - [`1d3ad38`](https://github.com/advplyr/audiobookshelf/commit/1d3ad38187708ca0c6efefce2d04b82820f19522) [cleanup] refactor OpenLib sort into getOpenLibResult - [`46b0b3a`](https://github.com/advplyr/audiobookshelf/commit/46b0b3a6efb7f31ac7d67ee5fff6dcbd2ff28542) [cleanup] Refactor candidates logic to separate class - [`5d7c197`](https://github.com/advplyr/audiobookshelf/commit/5d7c197c893d10277f59c753e2d324837185a78f) [fix] Add back toLowerCase to cleanAuthor/Title (required by other uses) - [`10f5bc8`](https://github.com/advplyr/audiobookshelf/commit/10f5bc8cbeeacd3c47f7115f387dd7d5817982e7) [cleanup] Make original title/author check with more readable - [`752bfff`](https://github.com/advplyr/audiobookshelf/commit/752bfffb1109e8fadf87775ecacf588365608b03) [enhamcement] Only add title candidate before and after all transforms - [`8979586`](https://github.com/advplyr/audiobookshelf/commit/8979586404a1ca4a46b0eff3d1cc23582ffbfbb5) [enhancement] Improve candidate sorting - [`9eff471`](https://github.com/advplyr/audiobookshelf/commit/9eff471afaa87572bfcb312af64d756511fde2a3) [enhancement] AuthorCandidates, author validation - [`b2acdad`](https://github.com/advplyr/audiobookshelf/commit/b2acdadcea6fa52636d816166beac24cb370e127) [enhancement] Added a couple title transformers - [`f3555a1`](https://github.com/advplyr/audiobookshelf/commit/f3555a12ceff25d328b7dd1637668874e181946e) [enhancement] Handle initials in author normalization - [`bf9f389`](https://github.com/advplyr/audiobookshelf/commit/bf9f3895db17f2172cda4e32caab559eda9c05a1) [enhancement] Treat underscores as title part separators ### 📊 Changes **1 file changed** (+173 additions, -73 deletions) <details> <summary>View changed files</summary> 📝 `server/finders/BookFinder.js` (+173 -73) </details> ### 📄 Description This is a continuation of [Fuzzy Matching V1](https://github.com/advplyr/audiobookshelf/pull/2099). This includes some cleanups and refactoring, a few improvements, and one major enhancement. Cleanups, refactoring, and small fixes: - Refactor title candidates logic (addition, variants, sorting) into a separate class (TitleCandidates) (https://github.com/advplyr/audiobookshelf/commit/46b0b3a6efb7f31ac7d67ee5fff6dcbd2ff28542) - (minor) Rewrite the logic that makes sure we don't run the original Title/Author search twice, to make it more readable (https://github.com/advplyr/audiobookshelf/commit/10f5bc8cbeeacd3c47f7115f387dd7d5817982e7) - (minor) Refactor OpenLib-specific sorting into getOpenLibResult (https://github.com/advplyr/audiobookshelf/commit/1d3ad38187708ca0c6efefce2d04b82820f19522) - (minor) Add back lower-casing in the cleanTitle/Author methods, since they are called from other places in the code (https://github.com/advplyr/audiobookshelf/commit/5d7c197c893d10277f59c753e2d324837185a78f) Enhancements & Improvements: - (major) Move Author candidates logic into its own class (AuthorCandidates), and introduce author extraction and validation (from both author and title parts) using parallel requests to Audnexus. This helps in cases where: the author field includes additional data, or when the author hides in one of the title parts. Fuzzy search logic now has an external loop that goes over author candidates (including empty author in the end), and an internal loop that goes over title candidates (https://github.com/advplyr/audiobookshelf/commit/9eff471afaa87572bfcb312af64d756511fde2a3) - Handle initials in author normalization (separate initials, and remove middle initials, as they sometimes mismatch with providers) (https://github.com/advplyr/audiobookshelf/commit/f3555a12ceff25d328b7dd1637668874e181946e) - Added/fixed a couple of title transformer regular expressions. (https://github.com/advplyr/audiobookshelf/commit/b2acdadcea6fa52636d816166beac24cb370e127) - Improved title candidate sorting (preferring transformed title parts over original ones, and title parts in their order of appearance) (https://github.com/advplyr/audiobookshelf/commit/8979586404a1ca4a46b0eff3d1cc23582ffbfbb5) - Add just one title variant after all all transformers have been applied, and not after each transformer (https://github.com/advplyr/audiobookshelf/commit/752bfffb1109e8fadf87775ecacf588365608b03) - Treat underscores as title part separators (improves some corner cases) (https://github.com/advplyr/audiobookshelf/pull/2186/commits/bf9f3895db17f2172cda4e32caab559eda9c05a1) - Reduce spurious audnexus author matching (by reducing the max Levenshtein distance, and looking only at the top 10 results) (https://github.com/advplyr/audiobookshelf/pull/2186/commits/b0b7a0a61817671b15e2687a32399aea6f0bdb51) - If no authors have been validated, use an aggressively cleaned version of the author field (in many cases, it is better than nothing) (https://github.com/advplyr/audiobookshelf/pull/2186/commits/f44b7ed1d0f8ba538e194632f98660893d9206a6) The code is now more robust, and handles various hard corner cases it didn't handle before. It fixes one case in the previous eval set, and keeps a 98% found rate and 96% found@1 in a new 50 title/author pairs eval set. In the new eval set, I also measured the average number of fuzzy searches - 1.18 (note that the eval sets are picked from an unstructured torrents folder, where the initial search with the original title and field almost always fails. 1.18 means that in most cases, only one fuzzy search request is needed). The additional parallel author validation requests (usually between 2-4) to audnexus seem to run very quickly, and most of the network time seems to be spent in the search provider requests. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

adam added the pull-request label 2026-04-25 00:16:35 +02:00

adam closed this issue

2026-04-25 00:16:35 +02:00

adam referenced this issue

2026-04-25 00:18:07 +02:00

[PR #3670] [MERGED] Fix:Remove authors with no books when a books is removed #3668 #4048

Sign in to join this conversation.

Branches Tags

master

book_tags_genres_dedupe

episode_download_fallback

Issue-4540-SortBy-StartedDate-and-FinishedDate

episode_meta_tagging

fix_authorize_race_condition

redirect_transcode_requests

progress_updated_sort

fix_ereader_socket_event

fix_change_empty_root_password

fix_podcast_session_track_index

fix_set_token

session_modal_user

localize_durations

fix_oidc_create_user

jwt_auth_refactor

fix_scanner_deleting_single_file_books

fix_mediaprogress_updatedat_2

experimental_next_client

podcast_episode_duration

episode-timestamps-clickable

book_author_secondary_sort_title

podcast_useragents

pathexists_user_access

fix_pathexists_join

book_author_secondary_sort

clean_duplicate_mediaprogress

sanitize_html_description

trix_prevent_attachments

check_path_api_fix

fix_mediaprogress_updatedat

increase_express_json_limit

fix_dockerfile_nunicode

search_episodes

audiobook_tools_update

episode_secondary_sorts

hls_stream_url_update

new_session_track_endpoint

audiobook_tools_enhancements

watcher_rescans_update

player_track_tooltip

fix_exclude_prefixes_crash

socket_item_events

fix_podcast_episode_scanner_promise

new_stats_controller

count_cache_for_userpermissions

parsing-opf-v3

validate_migration_files

fix-quick-match-all-crash

fix-chapter-end-sleep-timer

stringify_sequelize_query

remove-col-ambiguity

fix_next_prev_edit_description

details_trim_whitespace

fix_content_url_basepath

fix_logger_fatal

progress_bar_visibility

batch-edit-populate-map-details

feed_generator_updates

bookmark-modal-updates

migrate-library-item-in-scanner

migrate-new-library-items

migrate-podcasts-new-library-item-2

migrate-podcasts-new-library-item

fix-remove-episode-from-playlist

playback-session-use-new-library-item

refactor-library-item

fix-heatmap-caption

feed-episodes-upsert

share-media-player-media-session-api

remove-old-playlist

remove_old_collection_object

plugin-implementation-demo

feed_migration

refactor-feeds-from-item

fix_remove_authors_no_books

v2.17.3-fk-constraints-migration

migrations-first-upgrade

sqlite_2

feature/nuxt-target-server

waveform

sqlite

playlists

video

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/audiobookshelf#3668