mirror of
https://github.com/advplyr/audiobookshelf.git
synced 2026-05-30 23:40:40 +02:00
[Enhancement]: Diacritic-insensitive search #1771
Closed
opened 2026-04-24 23:57:42 +02:00 by adam
·
47 comments
No Branch/Tag Specified
master
book_tags_genres_dedupe
episode_download_fallback
Issue-4540-SortBy-StartedDate-and-FinishedDate
episode_meta_tagging
fix_authorize_race_condition
redirect_transcode_requests
progress_updated_sort
fix_ereader_socket_event
fix_change_empty_root_password
fix_podcast_session_track_index
fix_set_token
session_modal_user
localize_durations
fix_oidc_create_user
jwt_auth_refactor
fix_scanner_deleting_single_file_books
fix_mediaprogress_updatedat_2
experimental_next_client
podcast_episode_duration
episode-timestamps-clickable
book_author_secondary_sort_title
podcast_useragents
pathexists_user_access
fix_pathexists_join
book_author_secondary_sort
clean_duplicate_mediaprogress
sanitize_html_description
trix_prevent_attachments
check_path_api_fix
fix_mediaprogress_updatedat
increase_express_json_limit
fix_dockerfile_nunicode
search_episodes
audiobook_tools_update
episode_secondary_sorts
hls_stream_url_update
new_session_track_endpoint
audiobook_tools_enhancements
watcher_rescans_update
player_track_tooltip
fix_exclude_prefixes_crash
socket_item_events
fix_podcast_episode_scanner_promise
new_stats_controller
count_cache_for_userpermissions
parsing-opf-v3
validate_migration_files
fix-quick-match-all-crash
fix-chapter-end-sleep-timer
stringify_sequelize_query
remove-col-ambiguity
fix_next_prev_edit_description
details_trim_whitespace
fix_content_url_basepath
fix_logger_fatal
progress_bar_visibility
batch-edit-populate-map-details
feed_generator_updates
bookmark-modal-updates
migrate-library-item-in-scanner
migrate-new-library-items
migrate-podcasts-new-library-item-2
migrate-podcasts-new-library-item
fix-remove-episode-from-playlist
playback-session-use-new-library-item
refactor-library-item
fix-heatmap-caption
feed-episodes-upsert
share-media-player-media-session-api
remove-old-playlist
remove_old_collection_object
plugin-implementation-demo
feed_migration
refactor-feeds-from-item
fix_remove_authors_no_books
v2.17.3-fk-constraints-migration
migrations-first-upgrade
sqlite_2
feature/nuxt-target-server
waveform
sqlite
playlists
video
v2.35.1
v2.35.0
v2.34.0
v2.33.2
v2.33.1
v2.33.0
v2.32.1
v2.32.0
v2.31.0
v2.30.0
v2.29.0
v2.28.0
v2.27.0
v2.26.3
v2.26.2
v2.26.1
v2.26.0
v2.25.1
v2.25.0
v2.24.0
v2.23.0
v2.22.0
v2.21.0
v2.20.0
v2.19.5
v2.19.4
v2.19.3
v2.19.2
v2.19.1
v2.19.0
v2.18.1
v2.18.0
v2.17.7
v2.17.6
v2.17.5
v2.17.4
v2.17.3
v2.17.2
v2.17.1
v2.17.0
v2.16.2
v2.16.1
v2.16.0
v2.15.1
v2.15.0
v2.14.0
v2.13.4
v2.13.3
v2.13.2
v2.13.1
v2.13.0
v2.12.3
v2.12.2
v2.12.1
v2.12.0
v2.11.0
v2.10.1
v2.10.0
v2.9.0
v2.8.1
v2.8.0
v2.7.2
v2.7.1
v2.7.0
v2.6.0
v2.5.0
v2.4.4
v2.4.3
v2.4.2
v2.4.1
v2.4.0
v2.3.5
v2.3.4
v2.3.3
v2.3.2
v2.3.1
v2.3.0
v2.2.23
v2.2.22
v2.2.21
v2.2.20
v2.2.19
v2.2.18
v2.2.17
v2.2.16
v2.2.15
v2.2.14
v2.2.13
v2.2.12
v2.2.11
v2.2.10
v2.2.9
v2.2.8
v2.2.7
v2.2.6
v2.2.5
v2.2.4
v2.2.3
v2.2.2
v2.2.1
v2.2.0
v2.1.5
v2.1.4
v2.1.3
v2.1.2
v2.1.1
v2.1.0
v2.0.24
v2.0.23
v2.0.22
v2.0.21
v2.0.20
v2.0.19
v2.0.18
v2.0.17
v2.0.16
v2.0.15
v2.0.14
v2.0.13
v2.0.12
v2.0.11
v2.0.10
v2.0.9
v2.0.8
v2.0.7
v2.0.6
v2.0.5
v2.0.4
v2.0.3
v2.0.2
v2.0.1
v1.7.2
v1.7.1
v1.7.0
v1.6.0
v1.5.5
v1.5.0
v1.4.11
v1.4.9
v1.4.7
v1.4.6
v1.4.4
v1.4.2
v1.4.0
v1.4.1
v1.3.4
v1.3.3
v1.3.1
v1.2.8
v1.2.6
v1.2.5
v1.2.4
v1.2.1
v1.1.15
v1.1.14
v1.1.13
v1.1.12
v1.1.11
v1.1.10
v1.1.9
v1.1.8
v1.0.0
0.9.61-beta.0
0.9.61-beta
Labels
Clear labels
authentication
backlog
bug
chapter editor
config-issue
ebooks
encoding/embedding
enhancement
help wanted
listening sessions & progress
planned
possible plugin
progress sync
pull-request
sorting/filtering/searching
unable to reproduce
upload
users & permissions
waiting
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
Assignees
adam (Adam Melkus)
Clear assignees
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: starred/audiobookshelf#1771
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @tululum on GitHub (Feb 27, 2024).
Describe the issue
A lot of non-english languages have accented letters/diacritics (https://en.wikipedia.org/wiki/Diacritic). It is quite common to not include the accent when writing text online, especially when on mobile phones (which make it difficult and slow to write accented letters). Therefore, it is common for all search engines to drop the accent before performing the search (pretty much every single search engine does this). Unfortunately, this is not the case for ABS.
For example all the following searches should find the book "Černá lilie", but they don't:





Also, note that the last example of search
černáis the same asČernáexcept for case of the first letter. That also proves that the search is not case-insensitive if accented letters are present.I never wrote a line of code in nodejs, so unfortunately I don't feel confident to propose a fix myself. But I tested that this nodejs code can be used to drop all accents in my language (Czech), so perhaps that could be used somehow to fix this issue.
Probably better way to fix this would be directly in the SQLite query, maybe something like this could work. However, there could be some issues with that too (https://www.sqlite.org/faq.html#q18).
Steps to reproduce the issue
Černácerna,černa,cerná,černáorCerna.Audiobookshelf version
2.7.2
How are you running audiobookshelf?
Docker
@nichwall commented on GitHub (Feb 27, 2024):
Related to (currently closed) https://github.com/advplyr/audiobookshelf/issues/2187
@advplyr commented on GitHub (Feb 27, 2024):
In #2187 I mention a sqlite extension we could look at using to support case insensitive queries for non-ASCII
@mikiher commented on GitHub (Jul 22, 2024):
OK, so here's rougly the solution I was thinking of:
remove_diacritics(x), which returns a diacricts-free version of xtwith a columnxfor which we want to support diacritic-insensitive search, we add a columnnormalized_xt:normalized_x = remove_diacritics(x) != x ? remove_diacritics(x) : nullt, we set normalized_x as above (we can use hooks for this).query, we perform an OR of two conditions (pseudo-code below)x like %${query}%normalized_x like %${remove_diacritics(query)}%xvalues intare unchanged by remove_diacritics(), then both the performance and db size overheads are probably reasonable.@advplyr does this seems like a reasonable approach? I don't think there are many normalized_x columns we'll need to create. This approach will of course work with any kind of normalization which we don't currently support (e.g. lowercasing non-ascii capitals, removing Hebrew and Arabic diacritic marks, etc...)
@advplyr commented on GitHub (Jul 22, 2024):
I think this approach would be difficult to scale if we wanted to expand what is currently searchable. Currently we would need a normalized_x for:
Books: author, narrator, series, title, subtitle, tags, genres.
Podcasts: title, author, tags (should add genres and episode title)
Genres and tags are currently JSON arrays on each book, so normalized_x would need to be a JSON array for those.
If we wanted to expand to allow global search to search on anything else (like playlists, collections, descriptions, etc) we would have to write a script that backfills the new normalized_x column.
I think we should exhaust the possibility of doing this with sqlite first. If we have to do it this way we might be able to get away with only 1 column per table with a
normalizedSearchKeyscolumn or something like that.Using a hook we normalize all the searchable values to put in a single column. Scaling that would still require backfilling but easier at least.
@mikiher commented on GitHub (Jul 23, 2024):
According to my research, it seems like the only sane way to do this without additional columns is by adding a UDF that implements remove_diacritics(), and calling that UDF at runtime, where the condition looks like:
I'm worried about the performance implications of this approach, but we can try this.
@mikiher commented on GitHub (Jul 23, 2024):
Unfortunately, sqlite3 on node.js does not seem to support UDFs :(
@advplyr commented on GitHub (Jul 23, 2024):
It looks like the only option is to build the sqlite3 binaries ourselves.
https://github.com/TryGhost/node-sqlite3/issues/70#issuecomment-25844364
https://github.com/nalgeon/sqlean/blob/main/docs/unicode.md
@mikiher commented on GitHub (Jul 24, 2024):
OK, so after a few hours of struggle, I was able to load the sqlean unicode extension into the underlying sqlite3 db, so the
unaccentand other functions there became available to use in queries (it was quite hard to figure out how to do this, because the prescribed way of accessing it, hooking toafterConnect, doesn't work for sqlite due to some bug in Sequelize).So it looks like we won't need to build sqlite3 binaries, although we would need to deploy the unicode extension binaries (which are very small, ~100-150kB)
Now I'm progressing with modifying all the relevant queries. Stay tuned...
@advplyr commented on GitHub (Aug 10, 2024):
Reverted in v2.12.3
@devnoname120 commented on GitHub (Aug 10, 2024):
I wonder at this point if it wouldn't be easier to just to do. I don't have all the context but it seems to me that sqlite3 officially supports icu so it looks safer to me to rely on this one.
Additionally, the README.md of the unicode extension of sqlean mentions the following which isn't particularly reassuring:
What do you think?
Edit: It looks like building ICU degrades performances in general:
https://sqlite.org/forum/forumpost/524c146fbf
@mikiher commented on GitHub (Aug 10, 2024):
I haven't tried, but looking at it, it doesn't seem to come with accent-normalization functions (unaccent), just unicode aware case functions (lower, upper). We need both.
Yeah. Amusingly enoguh, it was deprecated a couple of weeks ago, right after I integrated it in my PR...
The problem is that the suggested replacement extension (text), does not seem to support unaccent() yet. I will be able to move to from unicode to text when it supports that.
@advplyr commented on GitHub (Aug 10, 2024):
I'm not sure how worthwhile it is but just supporting unicode lowercase would be helpful to some. A bug report was put in just for that https://github.com/advplyr/audiobookshelf/issues/2187
@devnoname120 commented on GitHub (Aug 11, 2024):
I'm not an expert in any way regarding these topics but apparently MariaDB has out-of-the-box support for the
utfmb4charset with the modernuca1400_ai_cicollation (unicode collate algorithm 14.0.0, accent insensitive, case insensitive, see the naming convention). See also the documentation of supported character sets.About SQLite ICU I don't know which collation algorithm it implements exactly, any ideas? I can't find either what UCA the unicode Sqlean plugin implements...
I don't know if this makes it worth it to migrate from SQLite to MariaDB. Among other things I guess it depends on whether we use SQLite-specific features?
Edit: nunicode looks like a very interesting alternative to Sqlean's unicode plugin.
@mikiher commented on GitHub (Aug 11, 2024):
Replacing the underlying db for supporting unaccent seems like a bit of an overkill to me.
Yes, for the record (after some digging) it looks like using the SQLite ICU + ICU extension can work as well (at least in theory).
When you have that extension loaded, you can run:
SELECT icu_load_collation('root', 'aici', 'PRIMARY');This creates a custom collation called
aiciwhich works onroot(i.e. applies to all locales) and uses UCOL_PRIMARY collation strength (which means case and accent insensitive).You can then (therretically) use this collation for comparisons and indexing, and an unaccent() function is not required.
This does seem to require both building our own SQLite icu-enabled version, and building the the icu extension, for all supported platforms and architectures, plus deploying them.
I am quite reluctant to go that way.
I haven't looked into this in detail, but yes, it does seem to provide unaccent.
However, I must say I'm not sure why the unicode extension deprecation seems to give you so much grief. I'm sure that at some point its functionality will be fully supported by the text extension, and in the meantime, it provides exactly the functionality we need at almost no cost (performance, size, loading). The issues caused by its introduction were due to my own failings, not the extension itself, and those can (hopefully) be fixed.
@mikiher commented on GitHub (Aug 14, 2024):
Update: I explicitly asked for unaccent support in sqlean/text, and the maintainer, for some reason, says that he does not plan to support it.
So, at this point I'm dropping my plans to re-introduce sqlean/unicode as is. I'm going to either look into the nunicode alternative @devnoname120 suggested, or fork the unicode extension and maintain it myself.
This is all going to take a while, so unfortunately the fix is not likely to happen in the very near future.
@advplyr commented on GitHub (Aug 14, 2024):
I haven't looked into this yet but I know that most of the other media servers use Sqlite so I wonder how they are handling this.
@mikiher commented on GitHub (Aug 14, 2024):
Are there any open source ones where I can look at the code?
On Wed, Aug 14, 2024 at 6:02 PM advplyr @.***> wrote:
@advplyr commented on GitHub (Aug 14, 2024):
Jellyfin is open source and uses sqlite. They use 2 db files and I just looked at the one named
library.dbin theItemValuestable they store "Value" and "CleanValue" that is unaccented.It looks like that table stores everything that would be searchable. My movies and tv series I scanned in created about 125k rows in that table.
@advplyr commented on GitHub (Aug 14, 2024):
So it seems they use your first suggestion of an additional column but they only need 1 column because all the values are in 1 table.
@mikiher commented on GitHub (Aug 14, 2024):
OK, I'll look into this direction as well.
I must admit, though, that extensions seem like a better idea generally, due to performance considerations, and also because everything is done at the database level. I think we'll also need additional extensions in the future (e.g. to support natural sorting), so the idea of depending on db extensions has merit IMO, and is superior to implementing stuff with application logic.
If you look at the unicode extension code, it is actually quite simple. It is a single file containing big tables for case and accent folding (which we should never touch), and a quite thin layer of logic that uses those tables to extend/replace existing functionality. There's little chance that any of this would ever need to change (unless there's breaking changes in SQLite itself), barring bug fixes.
@advplyr commented on GitHub (Aug 14, 2024):
Thanks, I agree. It would be much better if we can handle this with sqlite
@advplyr commented on GitHub (Aug 14, 2024):
I looked at the plex sqlite db and they are doing something different. I can't open the

icutables in DBeaver.Quick search led me https://forums.plex.tv/t/can-no-longer-update-library-database-with-sqlite3/701405/3
@mikiher commented on GitHub (Aug 14, 2024):
The discussion indicates the Plex uses the fts4 extension for implementing full text search.
I think FTS is likely an overkill for our purposes, though it may include the functionality we need. Do you think Audiobookshelf requires full-text-search capabilities?
@advplyr commented on GitHub (Aug 14, 2024):
I don't think so. It is useful to find out how everyone else is solving this with sqlite. I'm sure there are others I can look into later.
@devnoname120 commented on GitHub (Aug 14, 2024):
As far as I understand it's an either, not an and. You can compile
icu.cstandalone with a one-liner, see this excerpt fromext/icu/README.txt:For other operating systems, you can use https://www.sqlite.org/loadext.html#compiling_a_loadable_extension as an inspiration (just need to add the icu lib in the arguments).
Handling diacritics is exquisitely tricky to get right. There is a staggering myriad of exceptions and edge cases that need to be taken in account. Reading the official Unicode Collation Algorithm (UCA) is enough to convince oneself of that.
Although it handles ordering rather than just equality, it's 66 pages long and it keeps evolving. This official algorithm also requires the Default Unicode Collation Element Table (DUCET), which is even bigger.
IMHO we would be better off using the reference implementation (https://github.com/unicode-org/icu) from the Unicode group, which is what
ext/icu/icu.crelies on under the hood. Maintaining a custom solution may prove a lot of maintenance down the road in order to account for all the diacritic-related search issues that users may report.What do you think?
@mikiher commented on GitHub (Aug 15, 2024):
Yeah, you're probably right. I probably misread some instructions on stackoverflow. You still need to also deploy/install icu libs for this to work.
I'm guessing there are good reasons why prople are providing these alternate implementations. As you say, icu collations are extremely complex, which likely leads to performance issues, and many are looking for good-enough implementations that are standalone, less complex, and probably don't handle all the small details and all the many locales that the reference implementation has to deal with. I think that for ABS, it would be ok to use one of these "good-enough" implementations. I've checked the unicode extension against many accented books and authors in my library, and they all work fine.
I'll try it though, and if I see it's not complex to use and deploy, and doesn't have glaring performance issues, I'll use that.
@devnoname120 commented on GitHub (Aug 17, 2024):
@mikiher It looks like Node.js is shipped with libicu by default, and starting from Node.js 13 it comes with the full ICU data as well (prior to that it was only English by default).
I'm not sure that it would be a great idea but if you can't find libicu binaries for all platforms and you don't want to build it you may be able to piggyback on the one bundled with Node.js (or the one installed by the package manager if it's listed as a dependency of Node.js).
I don't know the possible implications of updating libicu or the ICU data on the collation so implicitly relying on the not-pinned version of Node.js may not be a wise idea. For example, do the SQLite tables have to be reindexed after every ICU update?
Either way, if a package manager is available on the system it's more likely than not to support installing libicu — it's needed everywhere.
For example our Docker containers currently rely on
alpinev3.20 (base image ofalpine-node20) whoseapkpackage manager provides anicu-libspackage that supports all architectures: https://pkgs.alpinelinux.org/packages?name=icu-libs&branch=v3.20&repo=main&arch=&maintainer=For SQLite's
icu.cwe could create a lightweight Node.js package that usesnode-gypto compile it (we already usenode-gyp?).Note that
icu.cdoesn't have any dependencies so compiling it very low-risk in terms of flakiness. As long as there is a bare C compiler/linker around it should compile without issues.@mikiher commented on GitHub (Aug 18, 2024):
At least on Windows (and I think on other platforms by default), I believe icu is statically linked to the node executable, so I don't think it can be used by the icu extension.
Probably not, but if you use indices that depend on icu collations (as we painfully learned in our case), you may need to reindex those.
Agreed, I don't see installing icu dependencies as a big issue (except maybe for Windows, but there we have an installer which can take care of that).
I'm not sure how exactly to use node-gyp for our purposes. IIUC, its purpose is building native addon modules for node. I don't need to build a native addon - I need to build a shared library, which sqlite3 loads from inside its C implementation.
In any case, I think it's important to remember that this is a node project. I don't want to require developers to have C/C++ toolchains on their dev machines. If dependencies need to be built, we need to do it on github workflows or a similar infrastructure, and make it available as release assets. or just check-in the pre-built extension binaries.
@mikiher commented on GitHub (Aug 19, 2024):
OK, so here's some update on my experiments.
SELECT icu_load_extension('root', 'aici', 'PRIMARY')SELECT 'Árbol' = 'arbol' collate aici AS resultSELECT 'Árbol' LIKE '%arb%' collate aici AS resultIt turns out that
LIKEin SQLite ignores collations (unlike other databases, which seem to support this syntax)Now since we need
LIKEfor searching, we're back to square one, at least for this specific solution.The other thing I noted about icu, is that the full data for icu (libicudata) is ~30MB, and the size of the required icu shared objects (libicuuc and libicui18n) is an additional ~6MB. Even if we were able to make it work with icu dependencies, we would pay a hefty price in terms in terms of size, compared to the ~100KB that the unicode extension weighs. And I haven't tested query performance at all yet.
@devnoname120 commented on GitHub (Aug 20, 2024):
@mikiher Thanks for the tests. With regard to the
LIKEsomething must be wrong becauseext/icu/icu.cexplicitly mentions the following (emphasis mine):See
icuLikeFunc()in the same file for the implementation.Note: I was able to reproduce your issue, and here is what I did for my future reference:
I'm not sure if we are doing this right in order to make
LIKEwork with the collation.@mikiher commented on GitHub (Aug 20, 2024):
We are doing it right.
The LIKE implementation is just hard-coded to doing what it's written to do (Unicode aware case insensitive matching), and does not respect the specified collation. You can look at the code - it's relatively straightforward.
The SQL syntax is correct, otherwise there would be a syntax error. the COLLATE part is just ignored.
If you find another suitable SQL expression to make the collation work with substring matching, please let me know.
@devnoname120 commented on GitHub (Aug 20, 2024):
@mikiher Ah indeed. I took a look at the current Docker image and it's currently
564 MB. According to your estimations we would add36 MBwhich is a 6% increase in image size. Definitely not negligible, but not a deal-breaker IMO especially if it means very little maintenance and all the edge cases and things we haven't thought about nicely working (so we won't have to get back to it and debug possible issues).I'm curious to see how either fares in terms of performances. I would guess not too far apart if we handle memory (re)allocations properly. An accent-/case-insensitive INDEX is what we would need to have actually good performances IMO. But since the tables usually have (I guess?) just a few thousands of rows it shouldn't be that bad.
As an exercise I made a quick implementation of an ICU-based
unaccent()function for SQLite and here are the results:The code (this is an old version, I'm almost done cleaning it up, using
sqlite_malloc()instead ofmalloc(), etc etc. and it uses unnecessary dynamic allocations):I could definitely integrate it in
icu.c, and if wanted applyunaccent()in theLIKEoperator.@mikiher commented on GitHub (Aug 21, 2024):
This is very nice, and (at a glance) seems to do what's needed (NFD normalization and removal of all accent characters).
At this point, since you're now implementing your own full-fledged extension, if you wish to continue with this effort, I'm happy to leave the rest to you.
From my perspective, I'd be happy to use this code if it's well tested, and if it doesn't degrade search performance significantly. I think it would also be a valuable extension for the SQLite user community in general.
I think this should be put in a separate project, and preferably provide pre-built binaries for the top platforms, like SQLean does - as I said above, I'd prefer not to require ABS developers to maintain C toolchains.
@advplyr, since you will need to eventually approve integrating this, it would be good to get early feedback on this.
I think we have some users with tens of thousands of books in their library.
Regarding an accent and case-insensitive INDEX - yes, I was intending to replace the existing NOCASE index with an aici index. As for how much it can improve performance (especially with LIKE comparisons), I really don't know - it needs to be tested.
@devnoname120 commented on GitHub (Aug 24, 2024):
@mikiher Thank you for your thoughtful comment. After digging into alekseyt/nunicode I'm convinced that it's an amazing option and I'd advocate for going in that direction. I think it's much better/performant/resilient than I could do building my own SQLite extension using ICU.
Upsides:
Downsides:
COLLATION, but it has anunaccent()function.NU1300_NOCASE+unaccent()if we feel the need to later on.LIKEoperator is (surprisingly — at least for me) overridden to perform case-folding aka expansion-insensitive comparisons (see exampleSELECT 'masse' = 'maße';below).Now for the tests:
@mikiher Let me know your thoughts!
@mikiher commented on GitHub (Sep 22, 2024):
Hi, sorry for the long time it took - had many other things on my plate.
Wanted to give an update on my experiments with nunicode.
The first thing I needed to do was to built it. The project as it is currently only provides pre-built binaries for Windows x86 32-bit and Linux x86 32-bit and 64-bit. So:
The experiments are at https://github.com/mikiher/nunicode-sqlite
I tested the built binaries:
I ran the same rudimentary tests as you did above with the sqlite3 CLI. the Windows and Ubuntu tests seem to work nicely, but the arm64 test has some issues, which I still need to figure out: It loads the library successfully and runs some of the test queries successfully, but fails on the following:
I have no idea why it fails - you're welcome to take a look at the workflow and if you have any suggestions, I'll be glad to hear.
@devnoname120 commented on GitHub (Sep 22, 2024):
@mikiher Thank you for your tests. Weirdly enough the tests that I performed were done on arm64… Specifically on macOS where both M1 and M2 worked fine. I didn't (need to) cross-compile though.
Which version of sqlite3 do you use? Which options are enabled?
As far as I know you don't need gcompat for this one because it doesn't rely on glibc.
I just tested your example and here is what I see on my end:
I did this in order to compile the sqlite extension:
@mikiher commented on GitHub (Sep 22, 2024):
SQLite version 3.45.3 2024-04-15 13:34:05It's the one installed by
apk add sqlite. I haven't enabled or disabled any particular options.It doesn't? so it doesn't matter which compiler I used for cross compilation?
This is roughly what I did as well. You can see all of the actions I ran in the github workflow
As I said there were a few changes I had to make:
nunicode/cmakefind_package(Sqlite3)directive (which setsSQLITE3_INCLUDE_DIRto/usr/include), and theSQLITE3_FOUNDcondition. Instead, I copied the sqlite3 headers tonunicode/include, and setSQLITE3_INCLUDE_DIRto that. I had to do this because otherwise the cross-compilers would be trying to use headers from/usr/includewhen compiling the extension, which is not desired.I don't think any of these changes really matter, with (maybe) the exception of the choice of cross compiler - I used
aarch64-linux-gnu-gcc. I mentioend gcompat because I know gcc uses glibc by default.@devnoname120 commented on GitHub (Sep 23, 2024):
@mikiher Hmm it seems that the
.sois indeed linked against glibc which is not optimal. I'd suggest trying again to compile it but from Alpine this time around. I cannot dig further into it right now unfortunately — the next two weeks will be quite busy on my side.Tangentially I think that it would make sense at some point to reach out to @alekseyt directly (aleksey.tulinov@gmail.com). The README of nunicode provides this email address and explicitly encourages people to contact him.
@mikiher commented on GitHub (Sep 23, 2024):
Thanks, I will also try to cross compile for musl and see if that helps.
The cross compilation is kind of important since I need a way to automatically build the required libraries for each architecture, and there are no github-hosted Alpine aarch64 runners.
@mikiher commented on GitHub (Sep 24, 2024):
So it turned out all of the issues were due to the crappy QEMU console, that seems to mess up non-ascii input :(
Once I connected to the QEMU emulator using ssh (which has proper UTF-8 support), all the issues went away.
I checked all 3 linux aarch64 libraries I built:
They all pass the simple tests above.
I think I'll stick with the original version I built because it's easiest to set-up the cross-compiler to build it.
Next I'm going to work on integrating the extenstions in ABS and testing.
@devnoname120 commented on GitHub (Sep 29, 2024):
@mikiher For Docker I'd suggest compiling nunicode directly for each architecture from an Alpine container as part of a multi-stage build and not rely on
gcompator cross-compilation at all for those.See here for an example of a multi-platform build. Docker makes it surprisingly easy, and you don't have to bother with QEMU by yourself:
https://github.com/code-inflation/cfspeedtest/pull/94/files
And the doc for multi-stage builds in case you are not familiar with them:
https://docs.docker.com/build/building/multi-stage/#use-multi-stage-builds
I can cobble up a PR if it helps!
@mikiher commented on GitHub (Sep 29, 2024):
But I need to build binaries and make them available anyway for non-docker servers. Why not use those artifacts in Audiobookshelf DockerFile since they're already available?
In the future we can consider providing Nunicode as an apk (I admit the additional code in the dockerfile is somewhat ugly).
@devnoname120 commented on GitHub (Oct 5, 2024):
As far as I know Alpine is the only non-marginal Linux distribution that uses musl by default. It does so because it's designed to minimize the size to the strict minimum for use in containers and embedded systems. Virtually all the other Linux distributions use glibc.
IMO the library that is deployed in the Alpine containers should be built from a musl distro (e.g. in an Alpine container) to avoid any issues that may arise while using the gcompat compatibility layer. I'm worried that bugs that end up being actually caused by the compat layer will be very difficult to diagnose.
For all the other distributions the libraries that are distributed should be compiled against glibc. So you could either release two flavors of the nunicode library (musl and glibc), or only release the glibc flavor and build the musl flavor in a stage of the audiobookshelf Docker image.
@mikiher commented on GitHub (Oct 6, 2024):
I'll add the musl flavor to the nunicode-binaries release, and use it to build the arm64 Docker image.
@devnoname120 commented on GitHub (Oct 7, 2024):
@mikiher Hmm why only the arm64 flavor?
@mikiher commented on GitHub (Oct 7, 2024):
Sorry, was a bit distracted. I meant both arm64 and amd64.
@mikiher commented on GitHub (Oct 7, 2024):
Submitted in PR #3488