[Enhancement]: Diacritic-insensitive search #1771

@mikiher commented on GitHub (Jul 23, 2024):

According to my research, it seems like the only sane way to do this without additional columns is by adding a UDF that implements remove_diacritics(), and calling that UDF at runtime, where the condition looks like:

    sequelize.fn('remove_diacritics', sequelize.col('x')),
    {
      [Op.like]: `%${remove_diacritics(query)}%`
    }

I'm worried about the performance implications of this approach, but we can try this.

@mikiher commented on GitHub (Jul 23, 2024): According to my research, it seems like the only sane way to do this without additional columns is by adding a UDF that implements remove_diacritics(), and calling that UDF at runtime, where the condition looks like: ``` sequelize.fn('remove_diacritics', sequelize.col('x')), { [Op.like]: `%${remove_diacritics(query)}%` } ``` I'm worried about the performance implications of this approach, but we can try this.

adam commented

@mikiher commented on GitHub (Jul 23, 2024):

Unfortunately, sqlite3 on node.js does not seem to support UDFs :(

@mikiher commented on GitHub (Jul 23, 2024): Unfortunately, sqlite3 on node.js does not seem to support UDFs :(

adam commented

https://github.com/TryGhost/node-sqlite3/issues/70#issuecomment-25844364
https://github.com/nalgeon/sqlean/blob/main/docs/unicode.md

@advplyr commented on GitHub (Jul 23, 2024):

It looks like the only option is to build the sqlite3 binaries ourselves.

@advplyr commented on GitHub (Jul 23, 2024): It looks like the only option is to build the sqlite3 binaries ourselves. https://github.com/TryGhost/node-sqlite3/issues/70#issuecomment-25844364 https://github.com/nalgeon/sqlean/blob/main/docs/unicode.md

adam commented

2026-04-24 23:57:46 +02:00

@mikiher commented on GitHub (Jul 24, 2024):

OK, so after a few hours of struggle, I was able to load the sqlean unicode extension into the underlying sqlite3 db, so the unaccent and other functions there became available to use in queries (it was quite hard to figure out how to do this, because the prescribed way of accessing it, hooking to afterConnect, doesn't work for sqlite due to some bug in Sequelize).

So it looks like we won't need to build sqlite3 binaries, although we would need to deploy the unicode extension binaries (which are very small, ~100-150kB)

Now I'm progressing with modifying all the relevant queries. Stay tuned...

@mikiher commented on GitHub (Jul 24, 2024): OK, so after a few hours of struggle, I was able to load the sqlean unicode extension into the underlying sqlite3 db, so the `unaccent` and other functions there became available to use in queries (it was quite hard to figure out how to do this, because the prescribed way of accessing it, hooking to `afterConnect`, doesn't work for sqlite due to some bug in Sequelize). So it looks like we won't need to build sqlite3 binaries, although we would need to deploy the unicode extension binaries (which are very small, ~100-150kB) Now I'm progressing with modifying all the relevant queries. Stay tuned...

adam commented

@advplyr commented on GitHub (Aug 10, 2024):

Reverted in v2.12.3

@advplyr commented on GitHub (Aug 10, 2024): Reverted in [v2.12.3](https://github.com/advplyr/audiobookshelf/releases/tag/v2.12.3)

adam commented

2026-04-24 23:57:46 +02:00

@devnoname120 commented on GitHub (Aug 10, 2024):

It looks like the only option is to build the sqlite3 binaries ourselves.

I wonder at this point if it wouldn't be easier to just to do. I don't have all the context but it seems to me that sqlite3 officially supports icu so it looks safer to me to rely on this one.

Additionally, the README.md of the unicode extension of sqlean mentions the following which isn't particularly reassuring:

⛔️ This extension is deprecated. Use text instead.

What do you think?

Edit: It looks like building ICU degrades performances in general:

The icu extension doesn't really help, because it replaces the SQL like() function which automatically disables the use of any index for the LIKE operator, regardless of the collation, so this will slow down with large data sets. As it affects every LIKE operator, not just specific ones, it might have unwanted side effects.

https://sqlite.org/forum/forumpost/524c146fbf

@devnoname120 commented on GitHub (Aug 10, 2024): > It looks like the only option is to build the sqlite3 binaries ourselves. I wonder at this point if it wouldn't be easier to just to do. I don't have all the context but it seems to me that sqlite3 [officially supports icu](https://sqlite.org/src/dir/ext/icu) so it looks safer to me to rely on this one. Additionally, the README.md of the unicode extension of sqlean [mentions the following](https://github.com/nalgeon/sqlean/blob/main/docs/unicode.md) which isn't particularly reassuring: > ⛔️ This extension is deprecated. Use text instead. What do you think? ---- **Edit**: It looks like building ICU degrades performances in general: > The icu extension doesn't really help, because it replaces the SQL like() function which automatically disables the use of any index for the LIKE operator, regardless of the collation, so this will slow down with large data sets. As it affects every LIKE operator, not just specific ones, it might have unwanted side effects. https://sqlite.org/forum/forumpost/524c146fbf

adam commented

2026-04-24 23:57:46 +02:00

@mikiher commented on GitHub (Aug 10, 2024):

It looks like the only option is to build the sqlite3 binaries ourselves.

I wonder at this point if it wouldn't be easier to just to do. I don't have all the context but it seems to me that sqlite3 officially supports icu so it looks safer to me to rely on this one.

I haven't tried, but looking at it, it doesn't seem to come with accent-normalization functions (unaccent), just unicode aware case functions (lower, upper). We need both.

Additionally, the README.md of the unicode extension of sqlean mentions the following which isn't particularly reassuring:

⛔️ This extension is deprecated. Use text instead.

What do you think?

Yeah. Amusingly enoguh, it was deprecated a couple of weeks ago, right after I integrated it in my PR...
The problem is that the suggested replacement extension (text), does not seem to support unaccent() yet. I will be able to move to from unicode to text when it supports that.

@mikiher commented on GitHub (Aug 10, 2024): > > It looks like the only option is to build the sqlite3 binaries ourselves. > > I wonder at this point if it wouldn't be easier to just to do. I don't have all the context but it seems to me that sqlite3 [officially supports icu](https://sqlite.org/src/dir/ext/icu) so it looks safer to me to rely on this one. I haven't tried, but looking at it, it doesn't seem to come with accent-normalization functions (unaccent), just unicode aware case functions (lower, upper). We need both. > > Additionally, the README.md of the unicode extension of sqlean [mentions the following](https://github.com/nalgeon/sqlean/blob/main/docs/unicode.md) which isn't particularly reassuring: > > > ⛔️ This extension is deprecated. Use text instead. > > What do you think? Yeah. Amusingly enoguh, it was deprecated a couple of weeks ago, right after I integrated it in my PR... The problem is that the suggested replacement extension (text), does not seem to support unaccent() yet. I will be able to move to from unicode to text when it supports that.

adam commented

@advplyr commented on GitHub (Aug 10, 2024):

I'm not sure how worthwhile it is but just supporting unicode lowercase would be helpful to some. A bug report was put in just for that https://github.com/advplyr/audiobookshelf/issues/2187

@advplyr commented on GitHub (Aug 10, 2024): I'm not sure how worthwhile it is but just supporting unicode lowercase would be helpful to some. A bug report was put in just for that https://github.com/advplyr/audiobookshelf/issues/2187

adam commented

@devnoname120 commented on GitHub (Aug 11, 2024):

I'm not an expert in any way regarding these topics but apparently MariaDB has out-of-the-box support for the utfmb4 charset with the modern uca1400_ai_ci collation (unicode collate algorithm 14.0.0, accent insensitive, case insensitive, see the naming convention). See also the documentation of supported character sets.

About SQLite ICU I don't know which collation algorithm it implements exactly, any ideas? I can't find either what UCA the unicode Sqlean plugin implements...

I don't know if this makes it worth it to migrate from SQLite to MariaDB. Among other things I guess it depends on whether we use SQLite-specific features?

Edit: nunicode looks like a very interesting alternative to Sqlean's unicode plugin.

@devnoname120 commented on GitHub (Aug 11, 2024): I'm not an expert in any way regarding these topics but apparently MariaDB has out-of-the-box support for the `utfmb4` charset with the modern [`uca1400_ai_ci` collation](https://jira.mariadb.org/plugins/servlet/mobile#issue/MDEV-27009) ([unicode collate algorithm 14.0.0](https://www.unicode.org/reports/tr10/tr10-45.html), accent insensitive, case insensitive, see the [naming convention](https://jira.mariadb.org/plugins/servlet/mobile#issue/MDEV-27210)). See also the [documentation](https://mariadb.com/kb/en/supported-character-sets-and-collations) of supported character sets. About SQLite ICU I don't know which collation algorithm it implements exactly, any ideas? I can't find either what UCA the unicode Sqlean plugin implements... I don't know if this makes it worth it to migrate from SQLite to MariaDB. Among other things I guess it depends on whether we use SQLite-specific features? **Edit**: [nunicode](https://bitbucket.org/alekseyt/nunicode/src/master/#markdown-header-sqlite3-extension) looks like a very interesting alternative to Sqlean's unicode plugin.

adam commented

@mikiher commented on GitHub (Aug 11, 2024):

I'm not an expert in any way regarding these topics but apparently MariaDB has out-of-the-box support for the utfmb4 charset with the modern uca1400_ai_ci collation (unicode collate algorithm 14.0.0, accent insensitive, case insensitive, see the naming convention). See also the documentation of supported character sets.

Replacing the underlying db for supporting unaccent seems like a bit of an overkill to me.

About SQLite ICU I don't know which collation algorithm it implements exactly, any ideas? I can't find either what UCA the unicode Sqlean plugin implements...

Yes, for the record (after some digging) it looks like using the SQLite ICU + ICU extension can work as well (at least in theory).
When you have that extension loaded, you can run:
SELECT icu_load_collation('root', 'aici', 'PRIMARY');
This creates a custom collation called aici which works on root (i.e. applies to all locales) and uses UCOL_PRIMARY collation strength (which means case and accent insensitive).
You can then (therretically) use this collation for comparisons and indexing, and an unaccent() function is not required.

This does seem to require both building our own SQLite icu-enabled version, and building the the icu extension, for all supported platforms and architectures, plus deploying them.

I am quite reluctant to go that way.

I don't know if this makes it worth it to migrate from SQLite to MariaDB. Among other things I guess it depends on whether we use SQLite-specific features?

Edit: nunicode looks like a very interesting alternative to Sqlean's unicode plugin.

I haven't looked into this in detail, but yes, it does seem to provide unaccent.

However, I must say I'm not sure why the unicode extension deprecation seems to give you so much grief. I'm sure that at some point its functionality will be fully supported by the text extension, and in the meantime, it provides exactly the functionality we need at almost no cost (performance, size, loading). The issues caused by its introduction were due to my own failings, not the extension itself, and those can (hopefully) be fixed.

@mikiher commented on GitHub (Aug 11, 2024): > I'm not an expert in any way regarding these topics but apparently MariaDB has out-of-the-box support for the `utfmb4` charset with the modern [`uca1400_ai_ci` collation](https://jira.mariadb.org/plugins/servlet/mobile#issue/MDEV-27009) ([unicode collate algorithm 14.0.0](https://www.unicode.org/reports/tr10/tr10-45.html), accent insensitive, case insensitive, see the [naming convention](https://jira.mariadb.org/plugins/servlet/mobile#issue/MDEV-27210)). See also the [documentation](https://mariadb.com/kb/en/supported-character-sets-and-collations) of supported character sets. > Replacing the underlying db for supporting unaccent seems like a bit of an overkill to me. > About SQLite ICU I don't know which collation algorithm it implements exactly, any ideas? I can't find either what UCA the unicode Sqlean plugin implements... > Yes, for the record (after some digging) it looks like using the SQLite ICU + ICU extension can work as well (at least in theory). When you have that extension loaded, you can run: `SELECT icu_load_collation('root', 'aici', 'PRIMARY');` This creates a custom collation called `aici` which works on `root` (i.e. applies to all locales) and uses UCOL_PRIMARY collation strength (which means case and accent insensitive). You can then (therretically) use this collation for comparisons and indexing, and an unaccent() function is not required. This does seem to require both building our own SQLite icu-enabled version, **and** building the the icu extension, for all supported platforms and architectures, plus deploying them. I am quite reluctant to go that way. > I don't know if this makes it worth it to migrate from SQLite to MariaDB. Among other things I guess it depends on whether we use SQLite-specific features? > > **Edit**: [nunicode](https://bitbucket.org/alekseyt/nunicode/src/master/#markdown-header-sqlite3-extension) looks like a very interesting alternative to Sqlean's unicode plugin. I haven't looked into this in detail, but yes, it does seem to provide unaccent. However, I must say I'm not sure why the unicode extension deprecation seems to give you so much grief. I'm sure that at some point its functionality will be fully supported by the text extension, and in the meantime, it provides exactly the functionality we need at almost no cost (performance, size, loading). The issues caused by its introduction were due to my own failings, not the extension itself, and those can (hopefully) be fixed.

adam commented

2026-04-24 23:57:48 +02:00

@mikiher commented on GitHub (Aug 14, 2024):

Update: I explicitly asked for unaccent support in sqlean/text, and the maintainer, for some reason, says that he does not plan to support it.

So, at this point I'm dropping my plans to re-introduce sqlean/unicode as is. I'm going to either look into the nunicode alternative @devnoname120 suggested, or fork the unicode extension and maintain it myself.

This is all going to take a while, so unfortunately the fix is not likely to happen in the very near future.

@mikiher commented on GitHub (Aug 14, 2024): Update: I explicitly asked for unaccent support in sqlean/text, and the maintainer, for some reason, says that he does not plan to support it. So, at this point I'm dropping my plans to re-introduce sqlean/unicode as is. I'm going to either look into the nunicode alternative @devnoname120 suggested, or fork the unicode extension and maintain it myself. This is all going to take a while, so unfortunately the fix is not likely to happen in the very near future.

adam commented

@advplyr commented on GitHub (Aug 14, 2024):

I haven't looked into this yet but I know that most of the other media servers use Sqlite so I wonder how they are handling this.

@advplyr commented on GitHub (Aug 14, 2024): I haven't looked into this yet but I know that most of the other media servers use Sqlite so I wonder how they are handling this.

adam commented

2026-04-24 23:57:48 +02:00

@mikiher commented on GitHub (Aug 14, 2024):

Are there any open source ones where I can look at the code?

On Wed, Aug 14, 2024 at 6:02 PM advplyr @.***> wrote:

I haven't looked into this yet but I know that most of the other media
servers use Sqlite so I wonder how they are handling this.

—
Reply to this email directly, view it on GitHub
https://github.com/advplyr/audiobookshelf/issues/2678#issuecomment-2289059051,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AFMDFVXDQPBHFN5WI7FYNSLZRNWRJAVCNFSM6AAAAABD4T3Q5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBZGA2TSMBVGE
.
You are receiving this because you commented.Message ID:
@.***>

@mikiher commented on GitHub (Aug 14, 2024): Are there any open source ones where I can look at the code? On Wed, Aug 14, 2024 at 6:02 PM advplyr ***@***.***> wrote: > I haven't looked into this yet but I know that most of the other media > servers use Sqlite so I wonder how they are handling this. > > — > Reply to this email directly, view it on GitHub > <https://github.com/advplyr/audiobookshelf/issues/2678#issuecomment-2289059051>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AFMDFVXDQPBHFN5WI7FYNSLZRNWRJAVCNFSM6AAAAABD4T3Q5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBZGA2TSMBVGE> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

adam commented

2026-04-24 23:57:48 +02:00

@advplyr commented on GitHub (Aug 14, 2024):

Jellyfin is open source and uses sqlite. They use 2 db files and I just looked at the one named library.db in the ItemValues table they store "Value" and "CleanValue" that is unaccented.

It looks like that table stores everything that would be searchable. My movies and tv series I scanned in created about 125k rows in that table.

@advplyr commented on GitHub (Aug 14, 2024): Jellyfin is open source and uses sqlite. They use 2 db files and I just looked at the one named `library.db` in the `ItemValues` table they store "Value" and "CleanValue" that is unaccented. It looks like that table stores everything that would be searchable. My movies and tv series I scanned in created about 125k rows in that table.

adam commented

2026-04-24 23:57:49 +02:00

@advplyr commented on GitHub (Aug 14, 2024):

So it seems they use your first suggestion of an additional column but they only need 1 column because all the values are in 1 table.

@advplyr commented on GitHub (Aug 14, 2024): So it seems they use your first suggestion of an additional column but they only need 1 column because all the values are in 1 table.

adam commented

2026-04-24 23:57:49 +02:00

@mikiher commented on GitHub (Aug 14, 2024):

OK, I'll look into this direction as well.

I must admit, though, that extensions seem like a better idea generally, due to performance considerations, and also because everything is done at the database level. I think we'll also need additional extensions in the future (e.g. to support natural sorting), so the idea of depending on db extensions has merit IMO, and is superior to implementing stuff with application logic.

If you look at the unicode extension code, it is actually quite simple. It is a single file containing big tables for case and accent folding (which we should never touch), and a quite thin layer of logic that uses those tables to extend/replace existing functionality. There's little chance that any of this would ever need to change (unless there's breaking changes in SQLite itself), barring bug fixes.

@mikiher commented on GitHub (Aug 14, 2024): OK, I'll look into this direction as well. I must admit, though, that extensions seem like a better idea generally, due to performance considerations, and also because everything is done at the database level. I think we'll also need additional extensions in the future (e.g. to support natural sorting), so the idea of depending on db extensions has merit IMO, and is superior to implementing stuff with application logic. If you look at the unicode extension code, it is actually quite simple. It is a single file containing big tables for case and accent folding (which we should never touch), and a quite thin layer of logic that uses those tables to extend/replace existing functionality. There's little chance that any of this would ever need to change (unless there's breaking changes in SQLite itself), barring bug fixes.

adam commented

2026-04-24 23:57:49 +02:00

@advplyr commented on GitHub (Aug 14, 2024):

Thanks, I agree. It would be much better if we can handle this with sqlite

@advplyr commented on GitHub (Aug 14, 2024): Thanks, I agree. It would be much better if we can handle this with sqlite

adam commented

2026-04-24 23:57:50 +02:00

@advplyr commented on GitHub (Aug 14, 2024):

I looked at the plex sqlite db and they are doing something different. I can't open the icu tables in DBeaver.

Quick search led me https://forums.plex.tv/t/can-no-longer-update-library-database-with-sqlite3/701405/3

@advplyr commented on GitHub (Aug 14, 2024): I looked at the plex sqlite db and they are doing something different. I can't open the `icu` tables in DBeaver. ![image](https://github.com/user-attachments/assets/79df8620-84d6-43e9-b626-0905a4da7608) Quick search led me https://forums.plex.tv/t/can-no-longer-update-library-database-with-sqlite3/701405/3

adam commented

2026-04-24 23:57:50 +02:00

@mikiher commented on GitHub (Aug 14, 2024):

The discussion indicates the Plex uses the fts4 extension for implementing full text search.
I think FTS is likely an overkill for our purposes, though it may include the functionality we need. Do you think Audiobookshelf requires full-text-search capabilities?

@mikiher commented on GitHub (Aug 14, 2024): The discussion indicates the Plex uses the [fts4](https://www.sqlite.org/fts3.html) extension for implementing full text search. I think FTS is likely an overkill for our purposes, though it may include the functionality we need. Do you think Audiobookshelf requires full-text-search capabilities?

adam commented

2026-04-24 23:57:50 +02:00

@advplyr commented on GitHub (Aug 14, 2024):

I don't think so. It is useful to find out how everyone else is solving this with sqlite. I'm sure there are others I can look into later.

@advplyr commented on GitHub (Aug 14, 2024): I don't think so. It is useful to find out how everyone else is solving this with sqlite. I'm sure there are others I can look into later.

adam commented

@devnoname120 commented on GitHub (Aug 14, 2024):

This does seem to require both building our own SQLite icu-enabled version, and building the the icu extension, for all supported platforms and architectures, plus deploying them.

As far as I understand it's an either, not an and. You can compile icu.c standalone with a one-liner, see this excerpt from ext/icu/README.txt:

The easiest way to compile and use the ICU extension is to build
and use it as a dynamically loadable SQLite extension. To do this
using gcc on *nix:
gcc -fPIC -shared icu.c `pkg-config --libs --cflags icu-uc icu-io` \
-o libSqliteIcu.so
You may need to add "-I" flags so that gcc can find sqlite3ext.h
and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be
loaded into sqlite in the same way as any other dynamically loadable
extension.

For other operating systems, you can use https://www.sqlite.org/loadext.html#compiling_a_loadable_extension as an inspiration (just need to add the icu lib in the arguments).

If you look at the unicode extension code, it is actually quite simple. It is a single file containing big tables for case and accent folding

Handling diacritics is exquisitely tricky to get right. There is a staggering myriad of exceptions and edge cases that need to be taken in account. Reading the official Unicode Collation Algorithm (UCA) is enough to convince oneself of that.

Although it handles ordering rather than just equality, it's 66 pages long and it keeps evolving. This official algorithm also requires the Default Unicode Collation Element Table (DUCET), which is even bigger.

IMHO we would be better off using the reference implementation (https://github.com/unicode-org/icu) from the Unicode group, which is what ext/icu/icu.c relies on under the hood. Maintaining a custom solution may prove a lot of maintenance down the road in order to account for all the diacritic-related search issues that users may report.

What do you think?

@devnoname120 commented on GitHub (Aug 14, 2024): > This does seem to require both building our own SQLite icu-enabled version, and building the the icu extension, for all supported platforms and architectures, plus deploying them. As far as I understand it's an either, not an and. You can compile `icu.c` standalone with a one-liner, see this excerpt from [`ext/icu/README.txt`](https://sqlite.org/src/file?name=ext/icu/README.txt): > The easiest way to compile and use the ICU extension is to build > and use it as a dynamically loadable SQLite extension. To do this > using gcc on *nix: > > gcc -fPIC -shared icu.c `pkg-config --libs --cflags icu-uc icu-io` \ > -o libSqliteIcu.so > > You may need to add "-I" flags so that gcc can find sqlite3ext.h > and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be > loaded into sqlite in the same way as any other dynamically loadable > extension. For other operating systems, you can use https://www.sqlite.org/loadext.html#compiling_a_loadable_extension as an inspiration (just need to add the icu lib in the arguments). --- > If you look at the unicode extension code, it is actually quite simple. It is a single file containing big tables for case and accent folding Handling diacritics is exquisitely tricky to get right. There is a staggering myriad of exceptions and edge cases that need to be taken in account. Reading the official [Unicode Collation Algorithm](https://www.unicode.org/reports/tr10/) (UCA) is enough to convince oneself of that. Although it handles ordering rather than just equality, it's 66 pages long and it [keeps evolving](https://www.unicode.org/reports/tr10/proposed.html). This official algorithm also requires the [Default Unicode Collation Element Table](https://www.unicode.org/Public/UCA/latest/allkeys.txt) (DUCET), which is even bigger. IMHO we would be better off using the reference implementation (https://github.com/unicode-org/icu) from the Unicode group, which is what `ext/icu/icu.c` relies on under the hood. Maintaining a custom solution may prove a lot of maintenance down the road in order to account for all the diacritic-related search issues that users may report. What do you think?

adam commented

@mikiher commented on GitHub (Aug 15, 2024):

As far as I understand it's an either, not an and. You can compile icu.c standalone with a one-liner

Yeah, you're probably right. I probably misread some instructions on stackoverflow. You still need to also deploy/install icu libs for this to work.

What do you think?

I'm guessing there are good reasons why prople are providing these alternate implementations. As you say, icu collations are extremely complex, which likely leads to performance issues, and many are looking for good-enough implementations that are standalone, less complex, and probably don't handle all the small details and all the many locales that the reference implementation has to deal with. I think that for ABS, it would be ok to use one of these "good-enough" implementations. I've checked the unicode extension against many accented books and authors in my library, and they all work fine.

I'll try it though, and if I see it's not complex to use and deploy, and doesn't have glaring performance issues, I'll use that.

@mikiher commented on GitHub (Aug 15, 2024): > As far as I understand it's an either, not an and. You can compile icu.c standalone with a one-liner Yeah, you're probably right. I probably misread some instructions on stackoverflow. You still need to also deploy/install icu libs for this to work. > What do you think? I'm guessing there are good reasons why prople are providing these alternate implementations. As you say, icu collations are extremely complex, which likely leads to performance issues, and many are looking for good-enough implementations that are standalone, less complex, and probably don't handle all the small details and all the many locales that the reference implementation has to deal with. I think that for ABS, it would be ok to use one of these "good-enough" implementations. I've checked the unicode extension against many accented books and authors in my library, and they all work fine. I'll try it though, and if I see it's not complex to use and deploy, and doesn't have glaring performance issues, I'll use that.

adam commented

@devnoname120 commented on GitHub (Aug 17, 2024):

@mikiher It looks like Node.js is shipped with libicu by default, and starting from Node.js 13 it comes with the full ICU data as well (prior to that it was only English by default).

I'm not sure that it would be a great idea but if you can't find libicu binaries for all platforms and you don't want to build it you may be able to piggyback on the one bundled with Node.js (or the one installed by the package manager if it's listed as a dependency of Node.js).

I don't know the possible implications of updating libicu or the ICU data on the collation so implicitly relying on the not-pinned version of Node.js may not be a wise idea. For example, do the SQLite tables have to be reindexed after every ICU update?

Either way, if a package manager is available on the system it's more likely than not to support installing libicu — it's needed everywhere.

For example our Docker containers currently rely on alpine v3.20 (base image of alpine-node20) whose apk package manager provides an icu-libs package that supports all architectures: https://pkgs.alpinelinux.org/packages?name=icu-libs&branch=v3.20&repo=main&arch=&maintainer=

For SQLite's icu.c we could create a lightweight Node.js package that uses node-gyp to compile it (we already use node-gyp?).

Note that icu.c doesn't have any dependencies so compiling it very low-risk in terms of flakiness. As long as there is a bare C compiler/linker around it should compile without issues.

@devnoname120 commented on GitHub (Aug 17, 2024): @mikiher It looks like Node.js is shipped with libicu by default, and starting from Node.js 13 it comes with the full ICU data as well (prior to that it was only English by default). I'm not sure that it would be a great idea but if you can't find libicu binaries for all platforms and you don't want to build it you may be able to piggyback on the one bundled with Node.js (or the one installed by the package manager if it's listed as a dependency of Node.js). I don't know the possible implications of updating libicu or the ICU data on the collation so implicitly relying on the not-pinned version of Node.js may not be a wise idea. For example, do the SQLite tables have to be reindexed after every ICU update? Either way, if a package manager is available on the system it's more likely than not to support installing libicu — it's needed everywhere. For example our Docker containers currently rely on `alpine` v3.20 (base image of `alpine-node20`) whose `apk` package manager provides an `icu-libs` package that supports all architectures: https://pkgs.alpinelinux.org/packages?name=icu-libs&branch=v3.20&repo=main&arch=&maintainer= ----------------- For SQLite's `icu.c` we could create a lightweight Node.js package that uses `node-gyp` to compile it (we [already use `node-gyp`?](https://github.com/search?q=repo%3Aadvplyr%2Faudiobookshelf%20node-gyp&type=code)). Note that `icu.c` doesn't have any dependencies so compiling it very low-risk in terms of flakiness. As long as there is a bare C compiler/linker around it should compile without issues.

adam commented

@mikiher commented on GitHub (Aug 18, 2024):

I'm not sure that it would be a great idea but if you can't find libicu binaries for all platforms and you don't want to build it you may be able to piggyback on the one bundled with Node.js (or the one installed by the package manager if it's listed as a dependency of Node.js).

At least on Windows (and I think on other platforms by default), I believe icu is statically linked to the node executable, so I don't think it can be used by the icu extension.

I don't know the possible implications of updating libicu or the ICU data on the collation so implicitly relying on the not-pinned version of Node.js may not be a wise idea. For example, do the SQLite tables have to be reindexed after every ICU update?

Probably not, but if you use indices that depend on icu collations (as we painfully learned in our case), you may need to reindex those.

Either way, if a package manager is available on the system it's more likely than not to support installing libicu — it's needed everywhere.

Agreed, I don't see installing icu dependencies as a big issue (except maybe for Windows, but there we have an installer which can take care of that).

For SQLite's icu.c we could create a lightweight Node.js package that uses node-gyp to compile it (we already use node-gyp?).

I'm not sure how exactly to use node-gyp for our purposes. IIUC, its purpose is building native addon modules for node. I don't need to build a native addon - I need to build a shared library, which sqlite3 loads from inside its C implementation.

In any case, I think it's important to remember that this is a node project. I don't want to require developers to have C/C++ toolchains on their dev machines. If dependencies need to be built, we need to do it on github workflows or a similar infrastructure, and make it available as release assets. or just check-in the pre-built extension binaries.

@mikiher commented on GitHub (Aug 18, 2024): > I'm not sure that it would be a great idea but if you can't find libicu binaries for all platforms and you don't want to build it you may be able to piggyback on the one bundled with Node.js (or the one installed by the package manager if it's listed as a dependency of Node.js). At least on Windows (and I think on other platforms by default), I believe icu is statically linked to the node executable, so I don't think it can be used by the icu extension. >I don't know the possible implications of updating libicu or the ICU data on the collation so implicitly relying on the not-pinned version of Node.js may not be a wise idea. For example, do the SQLite tables have to be reindexed after every ICU update? Probably not, but if you use indices that depend on icu collations (as we painfully learned in our case), you may need to reindex those. > Either way, if a package manager is available on the system it's more likely than not to support installing libicu — it's needed everywhere. Agreed, I don't see installing icu dependencies as a big issue (except maybe for Windows, but there we have an installer which can take care of that). > For SQLite's icu.c we could create a lightweight Node.js package that uses node-gyp to compile it (we [already use node-gyp?](https://github.com/search?q=repo%3Aadvplyr%2Faudiobookshelf%20node-gyp&type=code)). I'm not sure how exactly to use node-gyp for our purposes. IIUC, its purpose is building native addon modules for node. I don't need to build a native addon - I need to build a shared library, which sqlite3 loads from inside its C implementation. In any case, I think it's important to remember that this is a node project. I don't want to require developers to have C/C++ toolchains on their dev machines. If dependencies need to be built, we need to do it on github workflows or a similar infrastructure, and make it available as release assets. or just check-in the pre-built extension binaries.

adam commented

@mikiher commented on GitHub (Aug 19, 2024):

OK, so here's some update on my experiments.

I compiled the icu extension and loaded it successfully. I was then also able to create a custom collation with:
SELECT icu_load_extension('root', 'aici', 'PRIMARY')
I was also able to test that comparison based on this collation was indeed accent- and case-insensitive with:
SELECT 'Árbol' = 'arbol' collate aici AS result
However, I was not successful testing that the following also returns 1:
SELECT 'Árbol' LIKE '%arb%' collate aici AS result
It turns out that LIKE in SQLite ignores collations (unlike other databases, which seem to support this syntax)

Now since we need LIKE for searching, we're back to square one, at least for this specific solution.

The other thing I noted about icu, is that the full data for icu (libicudata) is ~30MB, and the size of the required icu shared objects (libicuuc and libicui18n) is an additional ~6MB. Even if we were able to make it work with icu dependencies, we would pay a hefty price in terms in terms of size, compared to the ~100KB that the unicode extension weighs. And I haven't tested query performance at all yet.

@mikiher commented on GitHub (Aug 19, 2024): OK, so here's some update on my experiments. - I compiled the icu extension and loaded it successfully. I was then also able to create a custom collation with: `SELECT icu_load_extension('root', 'aici', 'PRIMARY')` - I was also able to test that comparison based on this collation was indeed accent- and case-insensitive with: `SELECT 'Árbol' = 'arbol' collate aici AS result` - However, I was not successful testing that the following also returns 1: `SELECT 'Árbol' LIKE '%arb%' collate aici AS result` It turns out that `LIKE` in SQLite ignores collations (unlike other databases, which seem to support this syntax) Now since we need `LIKE` for searching, we're back to square one, at least for this specific solution. The other thing I noted about icu, is that the full data for icu (libicudata) is ~30MB, and the size of the required icu shared objects (libicuuc and libicui18n) is an additional ~6MB. Even if we were able to make it work with icu dependencies, we would pay a hefty price in terms in terms of size, compared to the ~100KB that the unicode extension weighs. And I haven't tested query performance at all yet.

adam commented

@devnoname120 commented on GitHub (Aug 20, 2024):

@mikiher Thanks for the tests. With regard to the LIKE something must be wrong because ext/icu/icu.c explicitly mentions the following (emphasis mine):

This file implements an integration between the ICU library
("International Components for Unicode", an open-source library
for handling unicode data) and SQLite. The integration uses
ICU to provide the following to SQLite:

* An implementation of the SQL regexp() function (and hence REGEXP
operator) using the ICU uregex_XX() APIs.

* Implementations of the SQL scalar upper() and lower() functions
for case mapping.

* Integration of ICU and SQLite collation sequences.

* An implementation of the LIKE operator that uses ICU to
provide case-independent matching.

See icuLikeFunc() in the same file for the implementation.

Note: I was able to reproduce your issue, and here is what I did for my future reference:

brew install sqlite3 icu4c

export PKG_CONFIG_PATH="/opt/homebrew/opt/icu4c/lib/pkgconfig:/opt/homebrew/opt/sqlite/lib/pkgconfig"

# Note: I first tried -o libSqliteIcu.dylib but it broke .load with this error:
# Error: dlsym(0x83836050, sqlite3_sqliteicu_init): symbol not found
gcc -fPIC -dynamiclib icu.c `pkg-config --libs --cflags icu-uc icu-io sqlite3` -o icu.dylib

export PATH="/opt/homebrew/opt/sqlite/bin:$PATH"

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load icu.dylib
sqlite> SELECT icu_load_collation('root', 'aici', 'PRIMARY');

sqlite> SELECT 'Árbol' = 'arbol' collate aici AS result;
1
sqlite> SELECT 'Árbol' LIKE '%arb%' collate aici AS result;
0

I'm not sure if we are doing this right in order to make LIKE work with the collation.

@devnoname120 commented on GitHub (Aug 20, 2024): @mikiher Thanks for the tests. With regard to the `LIKE` something must be wrong because [`ext/icu/icu.c`](https://sqlite.org/src/file?name=ext/icu/icu.c) explicitly mentions the following (emphasis mine): > This file implements an integration between the ICU library > ("International Components for Unicode", an open-source library > for handling unicode data) and SQLite. The integration uses > ICU to provide the following to SQLite: > > \* An implementation of the SQL regexp() function (and hence REGEXP > operator) using the ICU uregex_XX() APIs. > > \* Implementations of the SQL scalar upper() and lower() functions > for case mapping. > > \* Integration of ICU and SQLite collation sequences. > > **\* An implementation of the LIKE operator that uses ICU to** > **provide case-independent matching.** See `icuLikeFunc()` in the same file for the implementation. ----------- Note: I was able to reproduce your issue, and here is what I did for my future reference: ```shell brew install sqlite3 icu4c export PKG_CONFIG_PATH="/opt/homebrew/opt/icu4c/lib/pkgconfig:/opt/homebrew/opt/sqlite/lib/pkgconfig" # Note: I first tried -o libSqliteIcu.dylib but it broke .load with this error: # Error: dlsym(0x83836050, sqlite3_sqliteicu_init): symbol not found gcc -fPIC -dynamiclib icu.c `pkg-config --libs --cflags icu-uc icu-io sqlite3` -o icu.dylib ``` ```shell export PATH="/opt/homebrew/opt/sqlite/bin:$PATH" ❯ sqlite3 SQLite version 3.46.1 2024-08-13 09:16:08 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. sqlite> .load icu.dylib sqlite> SELECT icu_load_collation('root', 'aici', 'PRIMARY'); sqlite> SELECT 'Árbol' = 'arbol' collate aici AS result; 1 sqlite> SELECT 'Árbol' LIKE '%arb%' collate aici AS result; 0 ``` I'm not sure if we are doing this right in order to make `LIKE` work with the collation.

adam commented

@mikiher commented on GitHub (Aug 20, 2024):

We are doing it right.

The LIKE implementation is just hard-coded to doing what it's written to do (Unicode aware case insensitive matching), and does not respect the specified collation. You can look at the code - it's relatively straightforward.

The SQL syntax is correct, otherwise there would be a syntax error. the COLLATE part is just ignored.

If you find another suitable SQL expression to make the collation work with substring matching, please let me know.

@mikiher commented on GitHub (Aug 20, 2024): We are doing it right. The LIKE implementation is just hard-coded to doing what it's written to do (Unicode aware case insensitive matching), and does not respect the specified collation. You can look at the code - it's relatively straightforward. The SQL syntax is correct, otherwise there would be a syntax error. the COLLATE part is just ignored. If you find another suitable SQL expression to make the collation work with substring matching, please let me know.

adam commented

2026-04-24 23:57:53 +02:00

@devnoname120 commented on GitHub (Aug 20, 2024):

The other thing I noted about icu, is that the full data for icu (libicudata) is ~30MB, and the size of the required icu shared objects (libicuuc and libicui18n) is an additional ~6MB. Even if we were able to make it work with icu dependencies, we would pay a hefty price in terms in terms of size, compared to the ~100KB that the unicode extension weighs

@mikiher Ah indeed. I took a look at the current Docker image and it's currently 564 MB. According to your estimations we would add 36 MB which is a 6% increase in image size. Definitely not negligible, but not a deal-breaker IMO especially if it means very little maintenance and all the edge cases and things we haven't thought about nicely working (so we won't have to get back to it and debug possible issues).

I'm curious to see how either fares in terms of performances. I would guess not too far apart if we handle memory (re)allocations properly. An accent-/case-insensitive INDEX is what we would need to have actually good performances IMO. But since the tables usually have (I guess?) just a few thousands of rows it shouldn't be that bad.

As an exercise I made a quick implementation of an ICU-based unaccent() function for SQLite and here are the results:

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load unaccent.dylib
sqlite> SELECT 'Árbol' = 'arbol' COLLATE NOCASE;
0
sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NOCASE;
1
sqlite> SELECT 'Árbol' LIKE '%arb%' COLLATE NOCASE;
0
sqlite> SELECT unaccent('Árbol') LIKE '%arb%' COLLATE NOCASE;
1
sqlite> .load icu.dylib
SELECT icu_load_collation('root', 'aici', 'PRIMARY');

sqlite> SELECT 'Árbol' = 'arbol' collate aici;
1
sqlite> SELECT 'Árbol' LIKE '%arb%' collate aici;
0
sqlite> SELECT unaccent('Árbol') LIKE '%arb%' collate aici;
1

The code (this is an old version, I'm almost done cleaning it up, using sqlite_malloc() instead of malloc(), etc etc. and it uses unnecessary dynamic allocations):

#include <sqlite3ext.h>

#include <unicode/ustring.h>
#include <unicode/unorm2.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
#include <stdlib.h>
#include <string.h>

SQLITE_EXTENSION_INIT1

void unaccent(const UChar *input, UChar **output, UErrorCode *status) {
    const UNormalizer2 *normalizer = unorm2_getNFDInstance(status);
    if (U_FAILURE(*status)) {
        return;
    }

    int32_t inputLength = u_strlen(input);

    /* Maximum output of NFD transformation is a factor of 4.
    ** See: https://unicode.org/faq/normalization.html#12
    */
    int32_t maxNormalizedLength = 4 * inputLength;
    UChar *normalized = (UChar *)malloc(maxNormalizedLength * sizeof(UChar));
    if (!normalized) {
        *status = U_MEMORY_ALLOCATION_ERROR;
        return;
    }

    int32_t normalizedLength = unorm2_normalize(normalizer, input, -1, normalized, maxNormalizedLength, status);
    if (U_FAILURE(*status)) {
        free(normalized);
        return;
    }

    *output = (UChar *)malloc((normalizedLength + 1) * sizeof(UChar));
    if (!(*output)) {
        *status = U_MEMORY_ALLOCATION_ERROR;
        free(normalized);
        return;
    }

    int32_t resultIndex = 0;
    for (int32_t i = 0; i < normalizedLength;) {
        UChar32 c;
        U16_NEXT(normalized, i, normalizedLength, c);
        /* See https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a6a2dbc531efce8d77fdb4c314e7fc25e */
        if (u_charType(c) != U_COMBINING_SPACING_MARK && u_charType(c) != U_NON_SPACING_MARK && u_charType(c) != U_ENCLOSING_MARK) {
            U16_APPEND(*output, resultIndex, normalizedLength, c, *status);
        }
    }
    (*output)[resultIndex] = 0;

    free(normalized);
}

static void unaccent_sqlite(sqlite3_context *context, int argc, sqlite3_value **argv) {
    if (argc != 1) {
        sqlite3_result_error(context, "Invalid number of arguments", -1);
        return;
    }

    const char *input = (const char *)sqlite3_value_text(argv[0]);
    if (input == NULL) {
        sqlite3_result_null(context);
        return;
    }

    /* FIXME: do a dynamic allocation like the for the output */
    UChar u_input[256];
    UErrorCode status = U_ZERO_ERROR;

    u_strFromUTF8(u_input, sizeof(u_input)/sizeof(u_input[0]), NULL, input, -1, &status);
    if (U_FAILURE(status)) {
        sqlite3_result_error(context, u_errorName(status), -1);
        return;
    }

    UChar *u_output = NULL;
    unaccent(u_input, &u_output, &status);
    if (U_FAILURE(status)) {
        sqlite3_result_error(context, u_errorName(status), -1);
        if (u_output) free(u_output);
        return;
    }

    char *output = (char *)malloc((u_strlen(u_output) * 4 + 1) * sizeof(char));
    if (!output) {
        sqlite3_result_error(context, "Memory allocation error", -1);
        free(u_output);
        return;
    }

    u_strToUTF8(output, u_strlen(u_output) * 4 + 1, NULL, u_output, -1, &status);
    if (U_FAILURE(status)) {
        sqlite3_result_error(context, u_errorName(status), -1);
        free(output);
        free(u_output);
        return;
    }

    sqlite3_result_text(context, output, -1, SQLITE_TRANSIENT);

    free(output);
    free(u_output);
}

#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_unaccent_init(
  sqlite3 *db,
  char **pzErrMsg,
  const sqlite3_api_routines *pApi
){
    SQLITE_EXTENSION_INIT2(pApi);
    return sqlite3_create_function(db, "unaccent", 1, SQLITE_UTF8, NULL, unaccent_sqlite, NULL, NULL);
}

I could definitely integrate it in icu.c, and if wanted apply unaccent() in the LIKE operator.

@devnoname120 commented on GitHub (Aug 20, 2024): > The other thing I noted about icu, is that the full data for icu (libicudata) is ~30MB, and the size of the required icu shared objects (libicuuc and libicui18n) is an additional ~6MB. Even if we were able to make it work with icu dependencies, we would pay a hefty price in terms in terms of size, compared to the ~100KB that the unicode extension weighs @mikiher Ah indeed. I took a look at the current Docker image and it's currently `564 MB`. According to your estimations we would add `36 MB` which is a 6% increase in image size. Definitely not negligible, but not a deal-breaker IMO especially if it means very little maintenance and all the edge cases and things we haven't thought about nicely working (so we won't have to get back to it and debug possible issues). I'm curious to see how either fares in terms of performances. I would guess not too far apart if we handle memory (re)allocations properly. An accent-/case-insensitive INDEX is what we would need to have actually good performances IMO. But since the tables usually have (I guess?) just a few thousands of rows it shouldn't be that bad. ------------ As an exercise I made a quick implementation of an ICU-based `unaccent()` function for SQLite and here are the results: ``` ❯ sqlite3 SQLite version 3.46.1 2024-08-13 09:16:08 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. sqlite> .load unaccent.dylib sqlite> SELECT 'Árbol' = 'arbol' COLLATE NOCASE; 0 sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NOCASE; 1 sqlite> SELECT 'Árbol' LIKE '%arb%' COLLATE NOCASE; 0 sqlite> SELECT unaccent('Árbol') LIKE '%arb%' COLLATE NOCASE; 1 sqlite> .load icu.dylib SELECT icu_load_collation('root', 'aici', 'PRIMARY'); sqlite> SELECT 'Árbol' = 'arbol' collate aici; 1 sqlite> SELECT 'Árbol' LIKE '%arb%' collate aici; 0 sqlite> SELECT unaccent('Árbol') LIKE '%arb%' collate aici; 1 ``` ---------------- The code (this is an old version, I'm almost done cleaning it up, using `sqlite_malloc()` instead of `malloc()`, etc etc. and it uses unnecessary dynamic allocations): ```C #include <sqlite3ext.h> #include <unicode/ustring.h> #include <unicode/unorm2.h> #include <unicode/uchar.h> #include <unicode/utypes.h> #include <stdlib.h> #include <string.h> SQLITE_EXTENSION_INIT1 void unaccent(const UChar *input, UChar **output, UErrorCode *status) { const UNormalizer2 *normalizer = unorm2_getNFDInstance(status); if (U_FAILURE(*status)) { return; } int32_t inputLength = u_strlen(input); /* Maximum output of NFD transformation is a factor of 4. ** See: https://unicode.org/faq/normalization.html#12 */ int32_t maxNormalizedLength = 4 * inputLength; UChar *normalized = (UChar *)malloc(maxNormalizedLength * sizeof(UChar)); if (!normalized) { *status = U_MEMORY_ALLOCATION_ERROR; return; } int32_t normalizedLength = unorm2_normalize(normalizer, input, -1, normalized, maxNormalizedLength, status); if (U_FAILURE(*status)) { free(normalized); return; } *output = (UChar *)malloc((normalizedLength + 1) * sizeof(UChar)); if (!(*output)) { *status = U_MEMORY_ALLOCATION_ERROR; free(normalized); return; } int32_t resultIndex = 0; for (int32_t i = 0; i < normalizedLength;) { UChar32 c; U16_NEXT(normalized, i, normalizedLength, c); /* See https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a6a2dbc531efce8d77fdb4c314e7fc25e */ if (u_charType(c) != U_COMBINING_SPACING_MARK && u_charType(c) != U_NON_SPACING_MARK && u_charType(c) != U_ENCLOSING_MARK) { U16_APPEND(*output, resultIndex, normalizedLength, c, *status); } } (*output)[resultIndex] = 0; free(normalized); } static void unaccent_sqlite(sqlite3_context *context, int argc, sqlite3_value **argv) { if (argc != 1) { sqlite3_result_error(context, "Invalid number of arguments", -1); return; } const char *input = (const char *)sqlite3_value_text(argv[0]); if (input == NULL) { sqlite3_result_null(context); return; } /* FIXME: do a dynamic allocation like the for the output */ UChar u_input[256]; UErrorCode status = U_ZERO_ERROR; u_strFromUTF8(u_input, sizeof(u_input)/sizeof(u_input[0]), NULL, input, -1, &status); if (U_FAILURE(status)) { sqlite3_result_error(context, u_errorName(status), -1); return; } UChar *u_output = NULL; unaccent(u_input, &u_output, &status); if (U_FAILURE(status)) { sqlite3_result_error(context, u_errorName(status), -1); if (u_output) free(u_output); return; } char *output = (char *)malloc((u_strlen(u_output) * 4 + 1) * sizeof(char)); if (!output) { sqlite3_result_error(context, "Memory allocation error", -1); free(u_output); return; } u_strToUTF8(output, u_strlen(u_output) * 4 + 1, NULL, u_output, -1, &status); if (U_FAILURE(status)) { sqlite3_result_error(context, u_errorName(status), -1); free(output); free(u_output); return; } sqlite3_result_text(context, output, -1, SQLITE_TRANSIENT); free(output); free(u_output); } #ifdef _WIN32 __declspec(dllexport) #endif int sqlite3_unaccent_init( sqlite3 *db, char **pzErrMsg, const sqlite3_api_routines *pApi ){ SQLITE_EXTENSION_INIT2(pApi); return sqlite3_create_function(db, "unaccent", 1, SQLITE_UTF8, NULL, unaccent_sqlite, NULL, NULL); } ``` I could definitely integrate it in `icu.c`, and if wanted apply `unaccent()` in the `LIKE` operator.

adam commented

@mikiher commented on GitHub (Aug 21, 2024):

This is very nice, and (at a glance) seems to do what's needed (NFD normalization and removal of all accent characters).

At this point, since you're now implementing your own full-fledged extension, if you wish to continue with this effort, I'm happy to leave the rest to you.

From my perspective, I'd be happy to use this code if it's well tested, and if it doesn't degrade search performance significantly. I think it would also be a valuable extension for the SQLite user community in general.

I think this should be put in a separate project, and preferably provide pre-built binaries for the top platforms, like SQLean does - as I said above, I'd prefer not to require ABS developers to maintain C toolchains.

@advplyr, since you will need to eventually approve integrating this, it would be good to get early feedback on this.

I'm curious to see how either fares in terms of performances. I would guess not too far apart if we handle memory (re)allocations properly. An accent-/case-insensitive INDEX is what we would need to have actually good performances IMO. But since the tables usually have (I guess?) just a few thousands of rows it shouldn't be that bad.

I think we have some users with tens of thousands of books in their library.

Regarding an accent and case-insensitive INDEX - yes, I was intending to replace the existing NOCASE index with an aici index. As for how much it can improve performance (especially with LIKE comparisons), I really don't know - it needs to be tested.

@mikiher commented on GitHub (Aug 21, 2024): This is very nice, and (at a glance) seems to do what's needed (NFD normalization and removal of all accent characters). At this point, since you're now implementing your own full-fledged extension, if you wish to continue with this effort, I'm happy to leave the rest to you. From my perspective, I'd be happy to use this code if it's well tested, and if it doesn't degrade search performance significantly. I think it would also be a valuable extension for the SQLite user community in general. I think this should be put in a separate project, and preferably provide pre-built binaries for the top platforms, like SQLean does - as I said above, I'd prefer not to require ABS developers to maintain C toolchains. @advplyr, since you will need to eventually approve integrating this, it would be good to get early feedback on this. > I'm curious to see how either fares in terms of performances. I would guess not too far apart if we handle memory (re)allocations properly. An accent-/case-insensitive INDEX is what we would need to have actually good performances IMO. But since the tables usually have (I guess?) just a few thousands of rows it shouldn't be that bad. I think we have some users with tens of thousands of books in their library. Regarding an accent and case-insensitive INDEX - yes, I was intending to replace the existing NOCASE index with an aici index. As for how much it can improve performance (especially with LIKE comparisons), I really don't know - it needs to be tested.

adam commented

2026-04-24 23:57:53 +02:00

@devnoname120 commented on GitHub (Aug 24, 2024):

@mikiher Thank you for your thoughtful comment. After digging into alekseyt/nunicode I'm convinced that it's an amazing option and I'd advocate for going in that direction. I think it's much better/performant/resilient than I could do building my own SQLite extension using ICU.

Upsides:

Great performances.
Clear/clean code.
No dependencies.
300 KB.

Downsides:

No accent-insensitive COLLATION, but it has an unaccent() function.
- Should not be too complicated though to create a new collation that is the combination of NU1300_NOCASE + unaccent() if we feel the need to later on.
Not necessarily a downside but the LIKE operator is (surprisingly — at least for me) overridden to perform case-folding aka expansion-insensitive comparisons (see example SELECT 'masse' = 'maße'; below).
- Quote from the README:
  
  The LIKE operation in nunicode extension support cases when strings might grow in size during case transformation. On demand this extension may be modified to be compatible with ICU extension.
No commits for 4 years.
- But it looks mature/feature-complete.
- We might want to shoot an email at the maintainer to gauge his interest into maintaining it if audiobookshelf uses it.

Now for the tests:

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

sqlite> SELECT 'masse' = 'Masse';
0
sqlite> SELECT 'masse' LIKE 'Masse';
1
sqlite> SELECT 'masse' = 'maße';
0
sqlite> SELECT 'masse' LIKE 'maße';
0

sqlite> .load libnusqlite3.dylib

sqlite> SELECT 'masse' = 'Masse';
0
sqlite> SELECT 'masse' LIKE 'Masse';
1
sqlite> SELECT 'masse' = 'maße';
0
sqlite> SELECT 'masse' LIKE 'maße';
1
sqlite> SELECT 'masse' = 'maße' COLLATE NU1300;
0
sqlite> SELECT 'masse' LIKE 'maße' COLLATE NU1300;
1
sqlite> SELECT 'masse' = 'maße' COLLATE NU1300_NOCASE;
1

sqlite> SELECT 'árbol' = 'arbol';
0
sqlite> SELECT unaccent('árbol') = 'arbol';
1
sqlite> SELECT unaccent('Árbol') = 'arbol';
0
sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NU1300;
0
sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NU1300_NOCASE;
1
sqlite> SELECT lower(unaccent('Árbol')) = 'arbol';
1

sqlite> SELECT 'árbol' LIKE 'arbol';
0
sqlite> SELECT unaccent('árbol') LIKE 'arbol';
1
sqlite> SELECT unaccent('Árbol') LIKE 'arbol';
1
sqlite> SELECT unaccent('Árbol') LIKE 'arbol' COLLATE NU1300;
1
sqlite> SELECT unaccent('Árbol') LIKE 'arbol' COLLATE NU1300_NOCASE;
1
sqlite> SELECT lower(unaccent('Árbol')) LIKE 'arbol';
1

@mikiher Let me know your thoughts!

@devnoname120 commented on GitHub (Aug 24, 2024): @mikiher Thank you for your thoughtful comment. After digging into [alekseyt/nunicode](https://bitbucket.org/alekseyt/nunicode) I'm convinced that it's an amazing option and I'd advocate for going in that direction. I think it's much better/performant/resilient than I could do building my own SQLite extension using ICU. **Upsides**: - Great performances. - Clear/clean code. - No dependencies. - 300 KB. **Downsides**: - No accent-insensitive `COLLATION`, but it has an `unaccent()` function. - Should not be too complicated though to create a new collation that is the combination of `NU1300_NOCASE` + `unaccent()` if we feel the need to later on. - Not necessarily a downside but the `LIKE` operator is (surprisingly — at least for me) overridden to perform case-folding aka expansion-insensitive comparisons (see example `SELECT 'masse' = 'maße';` below). - Quote from the README: > The LIKE operation in nunicode extension support cases when strings might grow in size during case transformation. On demand this extension may be modified to be compatible with ICU extension. - No commits for 4 years. - But it looks mature/feature-complete. - We might want to shoot an email at the maintainer to gauge his interest into maintaining it if audiobookshelf uses it. **Now for the tests**: ``` ❯ sqlite3 SQLite version 3.46.1 2024-08-13 09:16:08 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. sqlite> SELECT 'masse' = 'Masse'; 0 sqlite> SELECT 'masse' LIKE 'Masse'; 1 sqlite> SELECT 'masse' = 'maße'; 0 sqlite> SELECT 'masse' LIKE 'maße'; 0 sqlite> .load libnusqlite3.dylib sqlite> SELECT 'masse' = 'Masse'; 0 sqlite> SELECT 'masse' LIKE 'Masse'; 1 sqlite> SELECT 'masse' = 'maße'; 0 sqlite> SELECT 'masse' LIKE 'maße'; 1 sqlite> SELECT 'masse' = 'maße' COLLATE NU1300; 0 sqlite> SELECT 'masse' LIKE 'maße' COLLATE NU1300; 1 sqlite> SELECT 'masse' = 'maße' COLLATE NU1300_NOCASE; 1 sqlite> SELECT 'árbol' = 'arbol'; 0 sqlite> SELECT unaccent('árbol') = 'arbol'; 1 sqlite> SELECT unaccent('Árbol') = 'arbol'; 0 sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NU1300; 0 sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NU1300_NOCASE; 1 sqlite> SELECT lower(unaccent('Árbol')) = 'arbol'; 1 sqlite> SELECT 'árbol' LIKE 'arbol'; 0 sqlite> SELECT unaccent('árbol') LIKE 'arbol'; 1 sqlite> SELECT unaccent('Árbol') LIKE 'arbol'; 1 sqlite> SELECT unaccent('Árbol') LIKE 'arbol' COLLATE NU1300; 1 sqlite> SELECT unaccent('Árbol') LIKE 'arbol' COLLATE NU1300_NOCASE; 1 sqlite> SELECT lower(unaccent('Árbol')) LIKE 'arbol'; 1 ``` @mikiher Let me know your thoughts!

adam commented

2026-04-24 23:57:53 +02:00

@mikiher commented on GitHub (Sep 22, 2024):

Hi, sorry for the long time it took - had many other things on my plate.

Wanted to give an update on my experiments with nunicode.

The first thing I needed to do was to built it. The project as it is currently only provides pre-built binaries for Windows x86 32-bit and Linux x86 32-bit and 64-bit. So:

I wrote toolchain files for cross compiling to win64 and arm64.
I had to slightly modify the main cmake configuration of the project
Wrote a github workflow to run the build on a linux machine and upload the generated libraries

The experiments are at https://github.com/mikiher/nunicode-sqlite

I tested the built binaries:

On my Windows machine
On WSL on my Windows (which has Ubuntu installed)
On a QEMU arm64 emulator with Alpine Linux (with gcompat)

I ran the same rudimentary tests as you did above with the sqlite3 CLI. the Windows and Ubuntu tests seem to work nicely, but the arm64 test has some issues, which I still need to figure out: It loads the library successfully and runs some of the test queries successfully, but fails on the following:

sqlite> SELECT 'masse' LIKE 'maße';
0
sqlite> SELECT upper('maße');
MAßÑא≈אאא

I have no idea why it fails - you're welcome to take a look at the workflow and if you have any suggestions, I'll be glad to hear.

@mikiher commented on GitHub (Sep 22, 2024): Hi, sorry for the long time it took - had many other things on my plate. Wanted to give an update on my experiments with nunicode. The first thing I needed to do was to built it. The project as it is currently only provides pre-built binaries for Windows x86 32-bit and Linux x86 32-bit and 64-bit. So: - I wrote toolchain files for cross compiling to win64 and arm64. - I had to slightly modify the main cmake configuration of the project - Wrote a github workflow to run the build on a linux machine and upload the generated libraries The experiments are at https://github.com/mikiher/nunicode-sqlite I tested the built binaries: - On my Windows machine - On WSL on my Windows (which has Ubuntu installed) - On a QEMU arm64 emulator with Alpine Linux (with gcompat) I ran the same rudimentary tests as you did above with the sqlite3 CLI. the Windows and Ubuntu tests seem to work nicely, but the arm64 test has some issues, which I still need to figure out: It loads the library successfully and runs some of the test queries successfully, but fails on the following: ``` sqlite> SELECT 'masse' LIKE 'maße'; 0 sqlite> SELECT upper('maße'); MAßÑא≈אאא ``` I have no idea why it fails - you're welcome to take a look at the workflow and if you have any suggestions, I'll be glad to hear.

adam commented

@devnoname120 commented on GitHub (Sep 22, 2024):

@mikiher Thank you for your tests. Weirdly enough the tests that I performed were done on arm64… Specifically on macOS where both M1 and M2 worked fine. I didn't (need to) cross-compile though.

Which version of sqlite3 do you use? Which options are enabled?
As far as I know you don't need gcompat for this one because it doesn't rely on glibc.

I just tested your example and here is what I see on my end:

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load libnusqlite3.dylib
sqlite> SELECT 'masse' LIKE 'maße';
1
sqlite> SELECT upper('maße');
MASSE

I did this in order to compile the sqlite extension:

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make

@devnoname120 commented on GitHub (Sep 22, 2024): @mikiher Thank you for your tests. Weirdly enough the tests that I performed were done on arm64… Specifically on macOS where both M1 and M2 worked fine. I didn't (need to) cross-compile though. Which version of sqlite3 do you use? Which options are enabled? As far as I know you don't need gcompat for this one because it doesn't rely on glibc. I just tested your example and here is what I see on my end: ``` ❯ sqlite3 SQLite version 3.46.1 2024-08-13 09:16:08 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. sqlite> .load libnusqlite3.dylib sqlite> SELECT 'masse' LIKE 'maße'; 1 sqlite> SELECT upper('maße'); MASSE ``` I did this in order to compile the sqlite extension: ```shell mkdir build cd build cmake .. -DCMAKE_BUILD_TYPE=Release make ```

adam commented

@mikiher commented on GitHub (Sep 22, 2024):

Which version of sqlite3 do you use?

SQLite version 3.45.3 2024-04-15 13:34:05
It's the one installed by apk add sqlite. I haven't enabled or disabled any particular options.

As far as I know you don't need gcompat for this one because it doesn't rely on glibc.

It doesn't? so it doesn't matter which compiler I used for cross compilation?

I did this in order to compile the sqlite extension

This is roughly what I did as well. You can see all of the actions I ran in the github workflow

As I said there were a few changes I had to make:

Added toolchain files for cross compiling to win64 and arm64 in nunicode/cmake
Slightly modified the main cmake configuration of the project
- I removed the find_package(Sqlite3) directive (which sets SQLITE3_INCLUDE_DIR to /usr/include), and the SQLITE3_FOUND condition. Instead, I copied the sqlite3 headers to nunicode/include, and set SQLITE3_INCLUDE_DIR to that. I had to do this because otherwise the cross-compilers would be trying to use headers from /usr/include when compiling the extension, which is not desired.

I don't think any of these changes really matter, with (maybe) the exception of the choice of cross compiler - I used aarch64-linux-gnu-gcc. I mentioend gcompat because I know gcc uses glibc by default.

@mikiher commented on GitHub (Sep 22, 2024): > Which version of sqlite3 do you use? `SQLite version 3.45.3 2024-04-15 13:34:05` It's the one installed by `apk add sqlite`. I haven't enabled or disabled any particular options. > As far as I know you don't need gcompat for this one because it doesn't rely on glibc. It doesn't? so it doesn't matter which compiler I used for cross compilation? >I did this in order to compile the sqlite extension This is roughly what I did as well. You can see all of the actions I ran in the [github workflow](https://github.com/mikiher/nunicode-sqlite/blob/main/.github/workflows/build-binaries.yaml) As I said there were a few changes I had to make: - Added toolchain files for cross compiling to win64 and arm64 in `nunicode/cmake` - Slightly modified the main cmake configuration of the project - I removed the `find_package(Sqlite3)` directive (which sets `SQLITE3_INCLUDE_DIR` to `/usr/include`), and the `SQLITE3_FOUND` condition. Instead, I copied the sqlite3 headers to `nunicode/include`, and set `SQLITE3_INCLUDE_DIR` to that. I had to do this because otherwise the cross-compilers would be trying to use headers from `/usr/include` when compiling the extension, which is not desired. I don't think any of these changes really matter, with (maybe) the exception of the choice of cross compiler - I used `aarch64-linux-gnu-gcc`. I mentioend gcompat because I know gcc uses glibc by default.

adam commented

@devnoname120 commented on GitHub (Sep 23, 2024):

@mikiher Hmm it seems that the .so is indeed linked against glibc which is not optimal. I'd suggest trying again to compile it but from Alpine this time around. I cannot dig further into it right now unfortunately — the next two weeks will be quite busy on my side.

Tangentially I think that it would make sense at some point to reach out to @alekseyt directly (aleksey.tulinov@gmail.com). The README of nunicode provides this email address and explicitly encourages people to contact him.

@devnoname120 commented on GitHub (Sep 23, 2024): @mikiher Hmm it seems that the `.so` is indeed linked against glibc which is not optimal. I'd suggest trying again to compile it but from Alpine this time around. I cannot dig further into it right now unfortunately — the next two weeks will be quite busy on my side. Tangentially I think that it would make sense at some point to reach out to @alekseyt directly (aleksey.tulinov@gmail.com). The README of nunicode provides this email address and explicitly encourages people to contact him.

adam commented

@mikiher commented on GitHub (Sep 23, 2024):

Thanks, I will also try to cross compile for musl and see if that helps.

The cross compilation is kind of important since I need a way to automatically build the required libraries for each architecture, and there are no github-hosted Alpine aarch64 runners.

@mikiher commented on GitHub (Sep 23, 2024): Thanks, I will also try to cross compile for musl and see if that helps. The cross compilation is kind of important since I need a way to automatically build the required libraries for each architecture, and there are no github-hosted Alpine aarch64 runners.

adam commented

@mikiher commented on GitHub (Sep 24, 2024):

So it turned out all of the issues were due to the crappy QEMU console, that seems to mess up non-ascii input :(

Once I connected to the QEMU emulator using ssh (which has proper UTF-8 support), all the issues went away.
I checked all 3 linux aarch64 libraries I built:

cross-compiled with glibc
cross-compiled with musl
native-compiled (on QEMU aarch64 emulator running Alpine) with musl

They all pass the simple tests above.

I think I'll stick with the original version I built because it's easiest to set-up the cross-compiler to build it.

Next I'm going to work on integrating the extenstions in ABS and testing.

@mikiher commented on GitHub (Sep 24, 2024): So it turned out all of the issues were due to the crappy QEMU console, that seems to mess up non-ascii input :( Once I connected to the QEMU emulator using ssh (which has proper UTF-8 support), all the issues went away. I checked all 3 linux aarch64 libraries I built: - cross-compiled with glibc - cross-compiled with musl - native-compiled (on QEMU aarch64 emulator running Alpine) with musl They all pass the simple tests above. I think I'll stick with the original version I built because it's easiest to set-up the cross-compiler to build it. Next I'm going to work on integrating the extenstions in ABS and testing.

adam commented

@devnoname120 commented on GitHub (Sep 29, 2024):

@mikiher For Docker I'd suggest compiling nunicode directly for each architecture from an Alpine container as part of a multi-stage build and not rely on gcompat or cross-compilation at all for those.

See here for an example of a multi-platform build. Docker makes it surprisingly easy, and you don't have to bother with QEMU by yourself:
https://github.com/code-inflation/cfspeedtest/pull/94/files

And the doc for multi-stage builds in case you are not familiar with them:
https://docs.docker.com/build/building/multi-stage/#use-multi-stage-builds

I can cobble up a PR if it helps!

@devnoname120 commented on GitHub (Sep 29, 2024): @mikiher For Docker I'd suggest compiling nunicode directly for each architecture from an Alpine container as part of a multi-stage build and not rely on `gcompat` or cross-compilation at all for those. See here for an example of a multi-platform build. Docker makes it [surprisingly easy](https://docs.docker.com/build/building/multi-platform/#build-multi-platform-images), and you don't have to bother with QEMU by yourself: https://github.com/code-inflation/cfspeedtest/pull/94/files And the doc for multi-stage builds in case you are not familiar with them: https://docs.docker.com/build/building/multi-stage/#use-multi-stage-builds I can cobble up a PR if it helps!

adam commented

@mikiher commented on GitHub (Sep 29, 2024):

But I need to build binaries and make them available anyway for non-docker servers. Why not use those artifacts in Audiobookshelf DockerFile since they're already available?

In the future we can consider providing Nunicode as an apk (I admit the additional code in the dockerfile is somewhat ugly).

@mikiher commented on GitHub (Sep 29, 2024): But I need to build binaries and make them available anyway for non-docker servers. Why not use those artifacts in Audiobookshelf DockerFile since they're already available? In the future we can consider providing Nunicode as an apk (I admit the additional code in the dockerfile is somewhat ugly).

adam commented

@devnoname120 commented on GitHub (Oct 5, 2024):

But I need to build binaries and make them available anyway for non-docker servers. Why not use those artifacts in Audiobookshelf DockerFile since they're already available?

As far as I know Alpine is the only non-marginal Linux distribution that uses musl by default. It does so because it's designed to minimize the size to the strict minimum for use in containers and embedded systems. Virtually all the other Linux distributions use glibc.

IMO the library that is deployed in the Alpine containers should be built from a musl distro (e.g. in an Alpine container) to avoid any issues that may arise while using the gcompat compatibility layer. I'm worried that bugs that end up being actually caused by the compat layer will be very difficult to diagnose.

For all the other distributions the libraries that are distributed should be compiled against glibc. So you could either release two flavors of the nunicode library (musl and glibc), or only release the glibc flavor and build the musl flavor in a stage of the audiobookshelf Docker image.

@devnoname120 commented on GitHub (Oct 5, 2024): > But I need to build binaries and make them available anyway for non-docker servers. Why not use those artifacts in Audiobookshelf DockerFile since they're already available? As far as I know Alpine is the only non-marginal Linux distribution that uses musl by default. It does so because it's designed to minimize the size to the strict minimum for use in containers and embedded systems. Virtually all the other Linux distributions use glibc. IMO the library that is deployed in the Alpine containers should be built from a musl distro (e.g. in an Alpine container) to avoid any issues that may arise while using the gcompat compatibility layer. I'm worried that bugs that end up being actually caused by the compat layer will be very difficult to diagnose. For all the other distributions the libraries that are distributed should be compiled against glibc. So you could either release two flavors of the nunicode library (musl and glibc), or only release the glibc flavor and build the musl flavor in a stage of the audiobookshelf Docker image.

adam commented

@mikiher commented on GitHub (Oct 6, 2024):

I'll add the musl flavor to the nunicode-binaries release, and use it to build the arm64 Docker image.

@mikiher commented on GitHub (Oct 6, 2024): I'll add the musl flavor to the nunicode-binaries release, and use it to build the arm64 Docker image.

adam commented