mirror of
https://github.com/advplyr/audiobookshelf.git
synced 2026-05-30 23:40:40 +02:00
[enhancement] sync ebooks and audiobooks via processing audiobook to text (Pie in the sky idea) #99
Open
opened 2026-04-24 22:58:24 +02:00 by adam
·
36 comments
No Branch/Tag Specified
master
book_tags_genres_dedupe
episode_download_fallback
Issue-4540-SortBy-StartedDate-and-FinishedDate
episode_meta_tagging
fix_authorize_race_condition
redirect_transcode_requests
progress_updated_sort
fix_ereader_socket_event
fix_change_empty_root_password
fix_podcast_session_track_index
fix_set_token
session_modal_user
localize_durations
fix_oidc_create_user
jwt_auth_refactor
fix_scanner_deleting_single_file_books
fix_mediaprogress_updatedat_2
experimental_next_client
podcast_episode_duration
episode-timestamps-clickable
book_author_secondary_sort_title
podcast_useragents
pathexists_user_access
fix_pathexists_join
book_author_secondary_sort
clean_duplicate_mediaprogress
sanitize_html_description
trix_prevent_attachments
check_path_api_fix
fix_mediaprogress_updatedat
increase_express_json_limit
fix_dockerfile_nunicode
search_episodes
audiobook_tools_update
episode_secondary_sorts
hls_stream_url_update
new_session_track_endpoint
audiobook_tools_enhancements
watcher_rescans_update
player_track_tooltip
fix_exclude_prefixes_crash
socket_item_events
fix_podcast_episode_scanner_promise
new_stats_controller
count_cache_for_userpermissions
parsing-opf-v3
validate_migration_files
fix-quick-match-all-crash
fix-chapter-end-sleep-timer
stringify_sequelize_query
remove-col-ambiguity
fix_next_prev_edit_description
details_trim_whitespace
fix_content_url_basepath
fix_logger_fatal
progress_bar_visibility
batch-edit-populate-map-details
feed_generator_updates
bookmark-modal-updates
migrate-library-item-in-scanner
migrate-new-library-items
migrate-podcasts-new-library-item-2
migrate-podcasts-new-library-item
fix-remove-episode-from-playlist
playback-session-use-new-library-item
refactor-library-item
fix-heatmap-caption
feed-episodes-upsert
share-media-player-media-session-api
remove-old-playlist
remove_old_collection_object
plugin-implementation-demo
feed_migration
refactor-feeds-from-item
fix_remove_authors_no_books
v2.17.3-fk-constraints-migration
migrations-first-upgrade
sqlite_2
feature/nuxt-target-server
waveform
sqlite
playlists
video
v2.35.1
v2.35.0
v2.34.0
v2.33.2
v2.33.1
v2.33.0
v2.32.1
v2.32.0
v2.31.0
v2.30.0
v2.29.0
v2.28.0
v2.27.0
v2.26.3
v2.26.2
v2.26.1
v2.26.0
v2.25.1
v2.25.0
v2.24.0
v2.23.0
v2.22.0
v2.21.0
v2.20.0
v2.19.5
v2.19.4
v2.19.3
v2.19.2
v2.19.1
v2.19.0
v2.18.1
v2.18.0
v2.17.7
v2.17.6
v2.17.5
v2.17.4
v2.17.3
v2.17.2
v2.17.1
v2.17.0
v2.16.2
v2.16.1
v2.16.0
v2.15.1
v2.15.0
v2.14.0
v2.13.4
v2.13.3
v2.13.2
v2.13.1
v2.13.0
v2.12.3
v2.12.2
v2.12.1
v2.12.0
v2.11.0
v2.10.1
v2.10.0
v2.9.0
v2.8.1
v2.8.0
v2.7.2
v2.7.1
v2.7.0
v2.6.0
v2.5.0
v2.4.4
v2.4.3
v2.4.2
v2.4.1
v2.4.0
v2.3.5
v2.3.4
v2.3.3
v2.3.2
v2.3.1
v2.3.0
v2.2.23
v2.2.22
v2.2.21
v2.2.20
v2.2.19
v2.2.18
v2.2.17
v2.2.16
v2.2.15
v2.2.14
v2.2.13
v2.2.12
v2.2.11
v2.2.10
v2.2.9
v2.2.8
v2.2.7
v2.2.6
v2.2.5
v2.2.4
v2.2.3
v2.2.2
v2.2.1
v2.2.0
v2.1.5
v2.1.4
v2.1.3
v2.1.2
v2.1.1
v2.1.0
v2.0.24
v2.0.23
v2.0.22
v2.0.21
v2.0.20
v2.0.19
v2.0.18
v2.0.17
v2.0.16
v2.0.15
v2.0.14
v2.0.13
v2.0.12
v2.0.11
v2.0.10
v2.0.9
v2.0.8
v2.0.7
v2.0.6
v2.0.5
v2.0.4
v2.0.3
v2.0.2
v2.0.1
v1.7.2
v1.7.1
v1.7.0
v1.6.0
v1.5.5
v1.5.0
v1.4.11
v1.4.9
v1.4.7
v1.4.6
v1.4.4
v1.4.2
v1.4.0
v1.4.1
v1.3.4
v1.3.3
v1.3.1
v1.2.8
v1.2.6
v1.2.5
v1.2.4
v1.2.1
v1.1.15
v1.1.14
v1.1.13
v1.1.12
v1.1.11
v1.1.10
v1.1.9
v1.1.8
v1.0.0
0.9.61-beta.0
0.9.61-beta
Labels
Clear labels
authentication
backlog
bug
chapter editor
config-issue
ebooks
encoding/embedding
enhancement
help wanted
listening sessions & progress
planned
possible plugin
progress sync
pull-request
sorting/filtering/searching
unable to reproduce
upload
users & permissions
waiting
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
Assignees
adam (Adam Melkus)
Clear assignees
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: starred/audiobookshelf#99
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @zombiehoffa on GitHub (Nov 17, 2021).
once ebook's are a lot more mature it would be awesome to be able to identify when an ebook and an audiobook are the same book and automagically text to speech the audiobook so that the audiobook and the ebook can be kept in sync.
@gelsas commented on GitHub (Dec 30, 2021):
So basically a selfmade version of Amazon's whispersync feature.
That would be a game changer!
@jrhbcn commented on GitHub (Apr 7, 2022):
I cannot give more +1 to this. For me it would be the killer feature of audiobookshelf as soon as the ebook reader is more mature.
As a reference these libraries might help implementing this: afaligner and aeneas.
@DDriggs00 commented on GitHub (Oct 11, 2022):
While I agree that this would be an incredible feature, it is definitely a very long-term goal, and would require an incredible amount of work.
@andrewls commented on GitHub (Dec 19, 2022):
This project also seems relevant. I haven't tried it out yet but I've been meaning to. I'll report back on what I find if I do end up trying it out in the next couple of months. A huge issue with this feature is going to be incorporating support for a reading experience of some kind. For that we could probably look at porting Epub3 Media Overlay functionality out from minstrel but all of that code is pretty dated and therefore likely not in the best of shape, and it also locks you into requiring users to create an EPUB3 file with a media overlay instead of any other possible format we might choose. I've definitely looked at implementing something like this in the past and then didn't keep up on it because I didn't have anywhere near enough free time to dedicate to something of this scale. I agree though, this would be an absolutely incredible feature.
@zombiehoffa commented on GitHub (Dec 20, 2022):
andrewls, wow, that makes this seem a lot more possible than the pie in the sky idea I thought it was.
@pbozzay commented on GitHub (Feb 10, 2023):
+1, this would be the killer feature
@donkevlar commented on GitHub (Mar 1, 2023):
Would love to see this as well!
@jonasrk commented on GitHub (Oct 21, 2023):
Just found out about audiobookshelf googling for "Whispersync for Voice open source alternatives". Would be so cool to make this happen somehow.
@sphars commented on GitHub (Dec 24, 2023):
Came across this on Hacker News this morning, wonder if it's something that could be integrated, or use the epubs that it creates?
From their docs: It's an self-hosted platform for taking an audiobook (either as an m4b/mp4 file, or as a zip of mp3 files) and an ebook (as an epub file) and producing a new epub file with synced narration support. This follows the media overlay spec for epubs.
@FreedomBen commented on GitHub (Dec 24, 2023):
I've been experimenting locally with using whisper.cpp to make transcripts of my audiobooks. The reason transcripts rather than just an epub version is that it includes timestamps, which can be easily used to:
I suspect it wouldn't be terribly hard to build a "whispersync" type of thing on top of this (once it exists of course).
If somebody wants to implement this sooner than I have availability, I'm happy to yield it. Let me know and I'll try to knowledge dump what I have. Also happy to brainstorm the idea. I'm @freedomben in the Matrix chat
@smoores-dev commented on GitHub (Dec 25, 2023):
This is actually how Media Overlays work, as well (I'm the author of Storyteller, the project that @sphars linked to). A Media Overlay is just an XML file that maps XHTML elements to segments of audio files. The Storyteller reader apps can (and do!), for example, highlight the current sentence while it's being read:
And they could also allow you to find the written text based on the timestamp (that's essentially the premise that the Storyteller reader apps are predicated on)! For any given timestamp, you can always find the location in the EPUB text that corresponds to it.
@gelsas commented on GitHub (Dec 27, 2023):
Is it also possible to finetune the highlighting even more? It think with Amazon whispersync it highlights it word by word. And I am so used to that by now, so I wondered if it would be possible to do that aswell with storyteller
@smoores-dev commented on GitHub (Dec 28, 2023):
It's possible! Storyteller has word-level timestamps available, but its reliance on fuzzy search for alignment (to account for inaccuracies in the transcription) might make word-level highlights challenging to get right.
If it's a feature you're interested in, feel free to make an Issue on the Storyteller project! It's on GitLab (gitlab.com/smoores/storyteller), but there's a mirror on GitHub if you don't have a GitLab account; I'll copy any Issues created there over to GitLab.
@mr-ransel commented on GitHub (Dec 29, 2023):
I'm thinking through how Storyteller and Audiobookshelf could be fairly tightly integrated to create "whispersync as a service" and combine the library management of ABS, and the media overlay setup of ST.
Essentially the flow would look like:
An extension would be to handle conversion of non epubs to epub transparently as well for convenience.
Better yet, on top of all this, with a little bit of fuzzy matching the entire library could be ported into ST directly and auto-pair all the audio and ebooks so no manual pairing is necessary.
@smoores-dev commented on GitHub (Dec 29, 2023):
That flow sounds excellent to me! I think it would definitely make sense to be able to create a book entity in Storyteller from existing files, in addition to the current upload flow. An automated matching system sounds a little fraught, but I'm open to exploring it; the manual matching system you have laid out here sounds great as a start.
@MxMarx commented on GitHub (Feb 12, 2024):
I was playing around Storyteller, it looks so amazing for this! Media overlays don't look super easy to access with epub.js, although there's a pull request for that, but something like this snippet, inserted here, can extract the timestamp to cfi mappings from the epubs output from Storyteller
Since the current epub reader needs the whole epub to be sent to the client, it might be a good idea to use either the original epub since the marked up epub includes embedded audio files, or strip the audio files from Storyteller output.
If using the existing audio files instead of embedding them, another consideration is that the timestamps generated by Storyteller are relative to the audiobook chapters instead of the whole audio. If going down that path, I'm not sure if it would make more sense to modify Storyteller to include some metadata to map the chapter offsets back to the original file, or have audiobookshelf do some post processing after running Storyteller.
@stassinari commented on GitHub (Mar 12, 2024):
With the latest iOS 17.4 update, Apple introduced a new transcript feature which is useful and quite intuitive.
I know it's not exactly like what this issue is about, but there might interesting ideas, especially in terms of UX.
@sevenlayercookie commented on GitHub (Mar 30, 2024):
Have you experimented with live transcription using Whisper? As in, using whisper to transcribe what is currently being played and "buffering" 30 seconds ahead or so. Even using CPU alone, it sounds like faster-whisper can easily outpace an audiobook playing at original speed (1x). Would essentially be Immersive Reading (and would localize to the individual word as well, rather than just the whole sentence). And I suppose this transcription could be cached for future use and fed into the fuzzy search to attempt to sync with an ebook as well.
Basically an on-demand, live transcription version of Storyteller, cutting out need for pre-processing.
@Astorsoft commented on GitHub (May 9, 2024):
This idea would be amazing and outsourcing the sync to a dedicated tool like storyteller is a great idea. If you want to go down the route of an internal service however, I've already mentioned this on storyteller's project but I think https://github.com/echogarden-project/echogarden is an amazing backend for speech to transcript alignment that works with many more language than English, I did some test on Swedish and it was very conclusive, based on their doc it can go down to word-level alignment with great accuracy.
Audiobook/epub alignment is always better than TTS as the reader often make great effort to change their tone of voice to each character and make a good job at expressing the persons' feeling. Maybe one day whisper will reach this stage but we're not there yet.
Lastly, good luck on the player part. It's a nightmare to find a good epub reader with media overlay support, at least on android. Some don't work with specific file format (like ogg vorbis), some add weird delay in the playback, making you think the alignment is off while it is in fact perfect when checked on other platforms like windows.
@Bothari commented on GitHub (May 25, 2024):
I have written a local system which transcribes an audiobook to text, converts an epub to text, and then performs matching on the two pieces of text to match timestamps in the audiobook to a "percentage" in the epub text. I do not have an understanding of an accurate way to reference a location in the epub, which is restricting my ability to do anything better than this.
On my server - a pretty low powered NUC - it will perform the matching at approx 15x the speed of the audiobook, meaning a 15 hour audiobook would take around an hour to process. I haven't spent any time trying to optimise this, it's just a first pass.
I see this being an on demand tool that a user could perform on an item, much like the "Embed Metadata" tool which exists for audiobooks.
@megawubs commented on GitHub (Jun 6, 2024):
Chiming in here because this flow is exactly like what I'm looking for.
What are the steps needed to make this work?
@iamhenry commented on GitHub (Jul 12, 2024):
Snipd just released audiobook transcriptions. would love to see this in ABS
https://x.com/snipd_app/status/1811024587292864948
@CoryGH commented on GitHub (Nov 3, 2024):
+1
@toonvank commented on GitHub (Jan 16, 2025):
Bump
@megawubs commented on GitHub (Jan 30, 2025):
Isn't the first step to an integration to at least sync the progress to abs from Storyteller? This way, I can pick up where I left of on a different device that doesn't have the Storyteller app. I only use the Storyteller server to align the books, not to manage my books. That's what abs is for.
@smoores-dev commented on GitHub (Jan 30, 2025):
Yup, this is exactly what we're going to do; add support in Storyteller to sync progress to external services, like KOReader and ABS. If any ABS devs would be interested in helping out with this, I'd love to collaborate! Otherwise I'll get to this as soon as I have a chance
@ScyllPoesis commented on GitHub (Jan 30, 2025):
Have been following the Storyteller progress for a while now but haven't used it much, still awesome work! A native integration with ABS would be incredible and be a perfect resolution to this issue and likely end up making my household incredibly happy (often we switch on-and-off between listening to the audiobook while working, then reading between breaks at much faster speed).
@zombiehoffa commented on GitHub (Jan 31, 2025):
You read faster than an audiobook can play at 2-2.5x?
@weckere commented on GitHub (Feb 24, 2025):
I don't know about ScyllPoesis, but I certainly do! Even if I didn't, anything faster than 1.25x is really too fast for me to process as an audio stream. I prefer to read visually as I can go much, much faster. Playing audio at the same time is only a garbled distraction at my normal reading speed.
When reading a story, I typically will scan back and forth over the text as I read. If I misunderstand something, scanning the text again is nearly instantaneous. With an audiobook, trying to skip back to hear a missed word, especially at high-speed, is a significant interruption in my flow.
I used to refuse to listen to audiobooks, but my life has become busier, and if I want to continue enjoying books, I need to take advantage of time when I'm already using my hands or eyes for another activity. I used to buy paper (because I find paper easier to read than a screen), ebook, and audiobook copies of the same title so that I could do this, but it was very expensive, and even more importantly, time-consuming to sync my progress between mediums.
Creating epub3 files with media overlays using Storyteller has basically solved this problem for me. I still maintain a library in ABS, but I find myself using only the Storyteller app to read or listen to my books now because it syncs my progress whether I choose to listen or read.
I've also discovered that using a read-along feature helps me stay focused when reading very dry material like textbooks or manuals. I used to copy text out of a PDF and have Siri read it to me on my mac to accomplish this. Although this is something I don't do often, for me and other folks with ADHD or information processing difficulties, this can be extremely helpful. Storyteller does this one better by highlighting sentences as it reads, taking the onus off me to scroll along, and making it much easier to find my place if I get lost.
If ABS was able to implement or hook-in to this functionality, that would be a massive benefit to me, as every other aspect of ABS is unparalleled in its functionality, polish, and ease-of-use. Absolutely fantastic, this app is best-in-class.
@sevenlayercookie commented on GitHub (Feb 25, 2025):
advplyr has said a priority feature that is in the works is a plugin system -- could be a good avenue for integrating Storyteller. Not sure how far along the plugin system is.
@JudasSleeze commented on GitHub (Apr 21, 2025):
I have been using ebook2audiobook for making audiobooks out of my epub books and using Storyteller to sync them. A plugin system would be great so we can have these all in one amazing useful place. Thank you for this consideration.
@vindex10 commented on GitHub (May 25, 2025):
Hi! I was investigating the topic to find a simple workaround with currently existing tools. I just want to share my findings :)
The following combination seems to be viable:
Kosync alternatives:
@J-Lich commented on GitHub (Nov 25, 2025):
Hi All - I built a bridge between KoSync and ABS. Feel free to check it out - I have been testing it for a few weeks. Code needs work but functions just fine for now (I'm sure edge cases will arise!).
Will try to get a docker hub image up soon.Edit: Dockerhub
ABS-KoSync-Bridge
@kevincox commented on GitHub (Feb 10, 2026):
I've been thinking about this and I don't think it is too technically challenging to implement. It would require quite a bit of work to glue the pieces together but I think it wouldn't require any new research or clever algorithms to do so. Unfortunately I am unlikely to have time to do this soon so I figured I would right my thoughts down to hopefully make the problem appear more approachable to anyone who is able to implement it.
Fundamentally I have realized that this problem is basically a diff algorithm (like a text diff). So if we make the following assumptions:
Then you can do a process like this:
That's it. Not you can link (most) of the words in the audiobook to the words in the ebook. There will be a lot of small mismatches due to slight differences in the text as well as transcription errors but that doesn't matter much. While getting every word absolutely perfect this feature would be very valuable even if you only have a few syncronization points per paragraph (or even page). In practice I suspect it would be much better than that. I suspect you can get well under sentence-level accuracy.
For the non-perfect matches you can decide what you want to do. You can do some basic guesses like just linearly map the ebook words across the time gap in the audio book. But I suspect in most cases it would be best to group these into a gap and just ignore it.
To avoid small errors it would probably be beneficial to do some preprocessing on both sides like stemming, case conversion, spelling normalization (colour vs color). But even without this I suspect the result would be sufficient.
Here is a short example of two chunks of text. I diffed them with
diff -U999 /tmp/{audiobook,ebook}You then just group up the bunches of
-and+sections into "Non-Match" groups.So most words would map 1-1, but you have some brief mismatches for example a missed word in the transcription or slight errors. This group would still have a time range which matches a block of words, however the exact timing within would be unknown. You may also have cases where the sentences were reordered or something and those would just show up as a block of unknown as well. I think the key point is to treat each "section" as a small range. So you can be inside the "pot to me" range and you know that that range of audio maps to the "not to be" range of text in the ebook. Then you can seek back and forth to make the ranges overlap in both.
Performance
Unfortunately many diff algorithms are O(NM) with N and M are the size of the left and right side of the diff. In this case those are both roughly the same so O(n^2). I suspect this isn't much of an issue in practice, diff algorithms have gotten very good at diffing large files and since the sides of the diff are pretty similar I think the heuristics will do a good job finding the common text quickly. This should be pretty easy to evaluate by dumping a book and transcription to files and using the command line diff tool to evaluate. The worst-case may be slightly different depending on exactly how similar the texts are but I think we should be able to get a ballpark fairly easily. Since this runs just once on the server at import time I'm not too worried.
This could also be made a lot better if we choose to trust chapters. In that case we would only have to run the diff on a fraction of the book at a time and that O(n^2) worst case shrinks dramatically.
UX
I don't want to get too far into UX because I'm not a UX expert. However I think something very simple such as "whenever the user makes progress in one move the other" would be sufficient. So if the user turns the page of the ebook the autobook would be progressed towards the first word of the new page (if forward) or the last word of the previous page (if back). Similarly as the audiobook plays the page will be turned to whichever has the currently active word. (Of course this probably isn't actively done on every page turn or second of sound, but whenever the user next opens the other media).
@smoores-dev commented on GitHub (Feb 10, 2026):
Yup, you're describing (in broad strokes) the Storyteller alignment algorithm. It works pretty well! I wouldn't exactly say it was "not too challenging," but it's certainly possible (and there are certainly opportunities for improvement).
If it's something that ABS devs are actually interested in, we would be happy to collaborate on splitting out the Storyteller aligner into a standalone command line tool (I've always wanted to do this, anyway), and adding output options other than EPUB3, in case ABS wanted to just store the mapping between text and audio positions and use that for converting between the two.
@NikoKS commented on GitHub (Mar 27, 2026):
@J-Lich, If sync between ABS audiobook and Koreader ebook is possible, what's stopping you from syncing ABS audiobook and ABS ebook?