[enhancement] sync ebooks and audiobooks via processing audiobook to text (Pie in the sky idea) #99

Open
opened 2026-04-24 22:58:24 +02:00 by adam · 36 comments
Owner

Originally created by @zombiehoffa on GitHub (Nov 17, 2021).

once ebook's are a lot more mature it would be awesome to be able to identify when an ebook and an audiobook are the same book and automagically text to speech the audiobook so that the audiobook and the ebook can be kept in sync.

Originally created by @zombiehoffa on GitHub (Nov 17, 2021). once ebook's are a lot more mature it would be awesome to be able to identify when an ebook and an audiobook are the same book and automagically text to speech the audiobook so that the audiobook and the ebook can be kept in sync.
adam added the enhancementebooks labels 2026-04-24 22:58:24 +02:00
Author
Owner

@gelsas commented on GitHub (Dec 30, 2021):

So basically a selfmade version of Amazon's whispersync feature.
That would be a game changer!

@gelsas commented on GitHub (Dec 30, 2021): So basically a selfmade version of Amazon's whispersync feature. That would be a game changer!
Author
Owner

@jrhbcn commented on GitHub (Apr 7, 2022):

I cannot give more +1 to this. For me it would be the killer feature of audiobookshelf as soon as the ebook reader is more mature.

As a reference these libraries might help implementing this: afaligner and aeneas.

@jrhbcn commented on GitHub (Apr 7, 2022): I cannot give more +1 to this. For me it would be *the* killer feature of audiobookshelf as soon as the ebook reader is more mature. As a reference these libraries might help implementing this: [afaligner](https://github.com/r4victor/afaligner) and [aeneas](https://github.com/readbeyond/aeneas).
Author
Owner

@DDriggs00 commented on GitHub (Oct 11, 2022):

While I agree that this would be an incredible feature, it is definitely a very long-term goal, and would require an incredible amount of work.

@DDriggs00 commented on GitHub (Oct 11, 2022): While I agree that this would be an incredible feature, it is definitely a very long-term goal, and would require an incredible amount of work.
Author
Owner

@andrewls commented on GitHub (Dec 19, 2022):

This project also seems relevant. I haven't tried it out yet but I've been meaning to. I'll report back on what I find if I do end up trying it out in the next couple of months. A huge issue with this feature is going to be incorporating support for a reading experience of some kind. For that we could probably look at porting Epub3 Media Overlay functionality out from minstrel but all of that code is pretty dated and therefore likely not in the best of shape, and it also locks you into requiring users to create an EPUB3 file with a media overlay instead of any other possible format we might choose. I've definitely looked at implementing something like this in the past and then didn't keep up on it because I didn't have anywhere near enough free time to dedicate to something of this scale. I agree though, this would be an absolutely incredible feature.

@andrewls commented on GitHub (Dec 19, 2022): [This project](https://github.com/r4victor/syncabook) also seems relevant. I haven't tried it out yet but I've been meaning to. I'll report back on what I find if I do end up trying it out in the next couple of months. A huge issue with this feature is going to be incorporating support for a reading experience of some kind. For that we could probably look at porting Epub3 Media Overlay functionality out from [minstrel](https://github.com/readbeyond/minstrel/) but all of that code is pretty dated and therefore likely not in the best of shape, and it also locks you into requiring users to create an EPUB3 file with a media overlay instead of any other possible format we might choose. I've definitely looked at implementing something like this in the past and then didn't keep up on it because I didn't have anywhere near enough free time to dedicate to something of this scale. I agree though, this would be an absolutely incredible feature.
Author
Owner

@zombiehoffa commented on GitHub (Dec 20, 2022):

andrewls, wow, that makes this seem a lot more possible than the pie in the sky idea I thought it was.

@zombiehoffa commented on GitHub (Dec 20, 2022): andrewls, wow, that makes this seem a lot more possible than the pie in the sky idea I thought it was.
Author
Owner

@pbozzay commented on GitHub (Feb 10, 2023):

+1, this would be the killer feature

@pbozzay commented on GitHub (Feb 10, 2023): +1, this would be **the** killer feature
Author
Owner

@donkevlar commented on GitHub (Mar 1, 2023):

Would love to see this as well!

@donkevlar commented on GitHub (Mar 1, 2023): Would love to see this as well!
Author
Owner

@jonasrk commented on GitHub (Oct 21, 2023):

Just found out about audiobookshelf googling for "Whispersync for Voice open source alternatives". Would be so cool to make this happen somehow.

@jonasrk commented on GitHub (Oct 21, 2023): Just found out about audiobookshelf googling for "Whispersync for Voice open source alternatives". Would be so cool to make this happen somehow.
Author
Owner

@sphars commented on GitHub (Dec 24, 2023):

Came across this on Hacker News this morning, wonder if it's something that could be integrated, or use the epubs that it creates?

From their docs: It's an self-hosted platform for taking an audiobook (either as an m4b/mp4 file, or as a zip of mp3 files) and an ebook (as an epub file) and producing a new epub file with synced narration support. This follows the media overlay spec for epubs.

@sphars commented on GitHub (Dec 24, 2023): Came across [this on Hacker News](https://news.ycombinator.com/item?id=38747710) this morning, wonder if it's something that could be integrated, or use the epubs that it creates? From [their docs](https://smoores.gitlab.io/storyteller/docs/how-it-works/the-algorithm): It's an self-hosted platform for taking an audiobook (either as an m4b/mp4 file, or as a zip of mp3 files) and an ebook (as an epub file) and producing a new epub file with synced narration support. This follows the [media overlay spec for epubs](https://www.w3.org/TR/epub-33/#sec-media-overlays).
Author
Owner

@FreedomBen commented on GitHub (Dec 24, 2023):

I've been experimenting locally with using whisper.cpp to make transcripts of my audiobooks. The reason transcripts rather than just an epub version is that it includes timestamps, which can be easily used to:

  1. Display "subtitles" while playing the book. This is actually even cooler than I thought it would be. Right now my prototype is a hack together with VLC player, but I have eventual plans for a PR for the web and mobile players to be able to display "subtitles" if they exist for the book (and if feature is enabled). With whisper it's possible to have ABS run a periodic job to auto-generate these transcript files for books where they don't yet exist. Will need to be disabled by default cause it uses a ton of CPU, but IMHO would be a super awesome feature.
  2. Easily find the written text based on a timestamp. I often find myself wanting to look up quotes and things that I heard and want to preserve for later.

I suspect it wouldn't be terribly hard to build a "whispersync" type of thing on top of this (once it exists of course).

If somebody wants to implement this sooner than I have availability, I'm happy to yield it. Let me know and I'll try to knowledge dump what I have. Also happy to brainstorm the idea. I'm @freedomben in the Matrix chat

@FreedomBen commented on GitHub (Dec 24, 2023): I've been experimenting locally with using whisper.cpp to make transcripts of my audiobooks. The reason transcripts rather than just an epub version is that it includes timestamps, which can be easily used to: 1. Display "subtitles" while playing the book. This is actually even cooler than I thought it would be. Right now my prototype is a hack together with VLC player, but I have eventual plans for a PR for the web and mobile players to be able to display "subtitles" if they exist for the book (and if feature is enabled). With whisper it's possible to have ABS run a periodic job to auto-generate these transcript files for books where they don't yet exist. Will need to be disabled by default cause it uses a ton of CPU, but IMHO would be a super awesome feature. 2. Easily find the written text based on a timestamp. I often find myself wanting to look up quotes and things that I heard and want to preserve for later. I suspect it wouldn't be terribly hard to build a "whispersync" type of thing on top of this (once it exists of course). If somebody wants to implement this sooner than I have availability, I'm happy to yield it. Let me know and I'll try to knowledge dump what I have. Also happy to brainstorm the idea. I'm @freedomben in the Matrix chat
Author
Owner

@smoores-dev commented on GitHub (Dec 25, 2023):

The reason transcripts rather than just an epub version is that it includes timestamps

This is actually how Media Overlays work, as well (I'm the author of Storyteller, the project that @sphars linked to). A Media Overlay is just an XML file that maps XHTML elements to segments of audio files. The Storyteller reader apps can (and do!), for example, highlight the current sentence while it's being read:

And they could also allow you to find the written text based on the timestamp (that's essentially the premise that the Storyteller reader apps are predicated on)! For any given timestamp, you can always find the location in the EPUB text that corresponds to it.

@smoores-dev commented on GitHub (Dec 25, 2023): > The reason transcripts rather than just an epub version is that it includes timestamps This is actually how Media Overlays work, as well (I'm the author of Storyteller, the project that @sphars linked to). A Media Overlay is just an XML file that maps XHTML elements to segments of audio files. The Storyteller reader apps can (and do!), for example, highlight the current sentence while it's being read: ![](https://is1-ssl.mzstatic.com/image/thumb/PurpleSource116/v4/ec/1b/8c/ec1b8c29-0b19-43e7-f9c2-6bfe76c2096f/0a2518b9-ce46-4242-a3dc-d4d6760a90bb_IMG_2030.png/300x0w.webp) And they could also allow you to find the written text based on the timestamp (that's essentially the premise that the Storyteller reader apps are predicated on)! For any given timestamp, you can always find the location in the EPUB text that corresponds to it.
Author
Owner

@gelsas commented on GitHub (Dec 27, 2023):

Is it also possible to finetune the highlighting even more? It think with Amazon whispersync it highlights it word by word. And I am so used to that by now, so I wondered if it would be possible to do that aswell with storyteller

@gelsas commented on GitHub (Dec 27, 2023): Is it also possible to finetune the highlighting even more? It think with Amazon whispersync it highlights it word by word. And I am so used to that by now, so I wondered if it would be possible to do that aswell with storyteller
Author
Owner

@smoores-dev commented on GitHub (Dec 28, 2023):

It's possible! Storyteller has word-level timestamps available, but its reliance on fuzzy search for alignment (to account for inaccuracies in the transcription) might make word-level highlights challenging to get right.

If it's a feature you're interested in, feel free to make an Issue on the Storyteller project! It's on GitLab (gitlab.com/smoores/storyteller), but there's a mirror on GitHub if you don't have a GitLab account; I'll copy any Issues created there over to GitLab.

@smoores-dev commented on GitHub (Dec 28, 2023): It's possible! Storyteller has word-level timestamps available, but its reliance on fuzzy search for alignment (to account for inaccuracies in the transcription) might make word-level highlights challenging to get right. If it's a feature you're interested in, feel free to make an Issue on the Storyteller project! It's on GitLab ([gitlab.com/smoores/storyteller](https://gitlab.com/smoores/storyteller)), but there's a [mirror on GitHub](https://github.com/smoores-dev/storyteller) if you don't have a GitLab account; I'll copy any Issues created there over to GitLab.
Author
Owner

@mr-ransel commented on GitHub (Dec 29, 2023):

I'm thinking through how Storyteller and Audiobookshelf could be fairly tightly integrated to create "whispersync as a service" and combine the library management of ABS, and the media overlay setup of ST.

Essentially the flow would look like:

  1. User "pairs" and ebook and audiobook in ABS
  2. ABS reaches out to ST over the API, and triggers the generation of an updated epub file, sending the user-defined chapter demarcations as well
  3. ST parses the audiobook tracks, preferably by filesystem reference instead of a wasteful upload, uses the chapter times to assist the algorithm, and generates new marked up epubs
  4. The new epub gets synced back to ABS via either the API or just a filesystem write replacing/adding a duplicate of the existing epubs, but now with the marked up files

An extension would be to handle conversion of non epubs to epub transparently as well for convenience.

Better yet, on top of all this, with a little bit of fuzzy matching the entire library could be ported into ST directly and auto-pair all the audio and ebooks so no manual pairing is necessary.

@mr-ransel commented on GitHub (Dec 29, 2023): I'm thinking through how Storyteller and Audiobookshelf could be fairly tightly integrated to create "whispersync as a service" and combine the library management of ABS, and the media overlay setup of ST. Essentially the flow would look like: 1. User "pairs" and ebook and audiobook in ABS 2. ABS reaches out to ST over the API, and triggers the generation of an updated epub file, sending the user-defined chapter demarcations as well 3. ST parses the audiobook tracks, preferably by filesystem reference instead of a wasteful upload, uses the chapter times to assist the algorithm, and generates new marked up epubs 4. The new epub gets synced back to ABS via either the API or just a filesystem write replacing/adding a duplicate of the existing epubs, but now with the marked up files An extension would be to handle conversion of non epubs to epub transparently as well for convenience. Better yet, on top of all this, with a little bit of fuzzy matching the entire library could be ported into ST directly and auto-pair all the audio and ebooks so no manual pairing is necessary.
Author
Owner

@smoores-dev commented on GitHub (Dec 29, 2023):

That flow sounds excellent to me! I think it would definitely make sense to be able to create a book entity in Storyteller from existing files, in addition to the current upload flow. An automated matching system sounds a little fraught, but I'm open to exploring it; the manual matching system you have laid out here sounds great as a start.

@smoores-dev commented on GitHub (Dec 29, 2023): That flow sounds excellent to me! I think it would definitely make sense to be able to create a book entity in Storyteller from existing files, in addition to the current upload flow. An automated matching system sounds a little fraught, but I'm open to exploring it; the manual matching system you have laid out here sounds great as a start.
Author
Owner

@MxMarx commented on GitHub (Feb 12, 2024):

I was playing around Storyteller, it looks so amazing for this! Media overlays don't look super easy to access with epub.js, although there's a pull request for that, but something like this snippet, inserted here, can extract the timestamp to cfi mappings from the epubs output from Storyteller

  var manifestItem = this.book.packaging.manifest[item.idref]
  var overlay = this.book.packaging.manifest[manifestItem.overlay]

  if (overlay) {
    const href = resolveURL(overlay.href, basePath)
    this.book.load(href).then(function (overlayXml) {
      var doc = new DOMParser().parseFromString(overlayXml, 'text/xml')

      doc.querySelectorAll('par').forEach((par) => {
        var audio = par.getElementsByTagName('audio')[0]
        var textId = par.getAttribute('id')
        this.audioMapping.push({
          cfi: item.cfiFromElement(item.document.getElementById(textId)),
          clipBegin: parseFloat(audio.getAttribute('clipBegin')),
          clipEnd: parseFloat(audio.getAttribute('clipEnd'))
        })
      })
    })
  }

Since the current epub reader needs the whole epub to be sent to the client, it might be a good idea to use either the original epub since the marked up epub includes embedded audio files, or strip the audio files from Storyteller output.

If using the existing audio files instead of embedding them, another consideration is that the timestamps generated by Storyteller are relative to the audiobook chapters instead of the whole audio. If going down that path, I'm not sure if it would make more sense to modify Storyteller to include some metadata to map the chapter offsets back to the original file, or have audiobookshelf do some post processing after running Storyteller.

@MxMarx commented on GitHub (Feb 12, 2024): I was playing around Storyteller, it looks so amazing for this! Media overlays don't look super easy to access with epub.js, although there's a [pull request](https://github.com/futurepress/epub.js/pull/1284) for that, but something like this snippet, inserted [here](https://github.com/advplyr/audiobookshelf/blob/ce7f81d67679bb9f86f39be9de7f28212dfa511c/client/components/readers/EpubReader.vue#L383), can extract the timestamp to cfi mappings from the epubs output from Storyteller ``` var manifestItem = this.book.packaging.manifest[item.idref] var overlay = this.book.packaging.manifest[manifestItem.overlay] if (overlay) { const href = resolveURL(overlay.href, basePath) this.book.load(href).then(function (overlayXml) { var doc = new DOMParser().parseFromString(overlayXml, 'text/xml') doc.querySelectorAll('par').forEach((par) => { var audio = par.getElementsByTagName('audio')[0] var textId = par.getAttribute('id') this.audioMapping.push({ cfi: item.cfiFromElement(item.document.getElementById(textId)), clipBegin: parseFloat(audio.getAttribute('clipBegin')), clipEnd: parseFloat(audio.getAttribute('clipEnd')) }) }) }) } ``` Since the current epub reader needs the whole epub to be sent to the client, it might be a good idea to use either the original epub since the marked up epub includes embedded audio files, or strip the audio files from Storyteller output. If using the existing audio files instead of embedding them, another consideration is that the timestamps generated by Storyteller are relative to the audiobook chapters instead of the whole audio. If going down that path, I'm not sure if it would make more sense to modify Storyteller to include some metadata to map the chapter offsets back to the original file, or have audiobookshelf do some post processing after running Storyteller.
Author
Owner

@stassinari commented on GitHub (Mar 12, 2024):

With the latest iOS 17.4 update, Apple introduced a new transcript feature which is useful and quite intuitive.

I know it's not exactly like what this issue is about, but there might interesting ideas, especially in terms of UX.

@stassinari commented on GitHub (Mar 12, 2024): With the latest iOS 17.4 update, Apple introduced a [new transcript feature](https://www.apple.com/newsroom/2024/03/apple-introduces-transcripts-for-apple-podcasts/) which is useful and quite intuitive. I know it's not exactly like what this issue is about, but there might interesting ideas, especially in terms of UX.
Author
Owner

@sevenlayercookie commented on GitHub (Mar 30, 2024):

That flow sounds excellent to me! I think it would definitely make sense to be able to create a book entity in Storyteller from existing files, in addition to the current upload flow. An automated matching system sounds a little fraught, but I'm open to exploring it; the manual matching system you have laid out here sounds great as a start.

Have you experimented with live transcription using Whisper? As in, using whisper to transcribe what is currently being played and "buffering" 30 seconds ahead or so. Even using CPU alone, it sounds like faster-whisper can easily outpace an audiobook playing at original speed (1x). Would essentially be Immersive Reading (and would localize to the individual word as well, rather than just the whole sentence). And I suppose this transcription could be cached for future use and fed into the fuzzy search to attempt to sync with an ebook as well.

Basically an on-demand, live transcription version of Storyteller, cutting out need for pre-processing.

@sevenlayercookie commented on GitHub (Mar 30, 2024): > That flow sounds excellent to me! I think it would definitely make sense to be able to create a book entity in Storyteller from existing files, in addition to the current upload flow. An automated matching system sounds a little fraught, but I'm open to exploring it; the manual matching system you have laid out here sounds great as a start. Have you experimented with live transcription using Whisper? As in, using whisper to transcribe what is currently being played and "buffering" 30 seconds ahead or so. Even using CPU alone, it sounds like faster-whisper can easily outpace an audiobook playing at original speed (1x). Would essentially be Immersive Reading (and would localize to the individual word as well, rather than just the whole sentence). And I suppose this transcription could be cached for future use and fed into the fuzzy search to attempt to sync with an ebook as well. Basically an on-demand, live transcription version of Storyteller, cutting out need for pre-processing.
Author
Owner

@Astorsoft commented on GitHub (May 9, 2024):

This idea would be amazing and outsourcing the sync to a dedicated tool like storyteller is a great idea. If you want to go down the route of an internal service however, I've already mentioned this on storyteller's project but I think https://github.com/echogarden-project/echogarden is an amazing backend for speech to transcript alignment that works with many more language than English, I did some test on Swedish and it was very conclusive, based on their doc it can go down to word-level alignment with great accuracy.

Audiobook/epub alignment is always better than TTS as the reader often make great effort to change their tone of voice to each character and make a good job at expressing the persons' feeling. Maybe one day whisper will reach this stage but we're not there yet.

Lastly, good luck on the player part. It's a nightmare to find a good epub reader with media overlay support, at least on android. Some don't work with specific file format (like ogg vorbis), some add weird delay in the playback, making you think the alignment is off while it is in fact perfect when checked on other platforms like windows.

@Astorsoft commented on GitHub (May 9, 2024): This idea would be amazing and outsourcing the sync to a dedicated tool like storyteller is a great idea. If you want to go down the route of an internal service however, I've already mentioned this on storyteller's project but I think https://github.com/echogarden-project/echogarden is an amazing backend for speech to transcript alignment that works with many more language than English, I did some test on Swedish and it was very conclusive, based on their doc it can go down to word-level alignment with great accuracy. Audiobook/epub alignment is always better than TTS as the reader often make great effort to change their tone of voice to each character and make a good job at expressing the persons' feeling. Maybe one day whisper will reach this stage but we're not there yet. Lastly, good luck on the player part. It's a nightmare to find a good epub reader with media overlay support, at least on android. Some don't work with specific file format (like ogg vorbis), some add weird delay in the playback, making you think the alignment is off while it is in fact perfect when checked on other platforms like windows.
Author
Owner

@Bothari commented on GitHub (May 25, 2024):

I have written a local system which transcribes an audiobook to text, converts an epub to text, and then performs matching on the two pieces of text to match timestamps in the audiobook to a "percentage" in the epub text. I do not have an understanding of an accurate way to reference a location in the epub, which is restricting my ability to do anything better than this.

On my server - a pretty low powered NUC - it will perform the matching at approx 15x the speed of the audiobook, meaning a 15 hour audiobook would take around an hour to process. I haven't spent any time trying to optimise this, it's just a first pass.

I see this being an on demand tool that a user could perform on an item, much like the "Embed Metadata" tool which exists for audiobooks.

@Bothari commented on GitHub (May 25, 2024): I have written a local system which transcribes an audiobook to text, converts an epub to text, and then performs matching on the two pieces of text to match timestamps in the audiobook to a "percentage" in the epub text. I do not have an understanding of an accurate way to reference a location in the epub, which is restricting my ability to do anything better than this. On my server - a pretty low powered NUC - it will perform the matching at approx 15x the speed of the audiobook, meaning a 15 hour audiobook would take around an hour to process. I haven't spent any time trying to optimise this, it's just a first pass. I see this being an on demand tool that a user could perform on an item, much like the "Embed Metadata" tool which exists for audiobooks.
Author
Owner

@megawubs commented on GitHub (Jun 6, 2024):

I'm thinking through how Storyteller and Audiobookshelf could be fairly tightly integrated to create "whispersync as a service" and combine the library management of ABS, and the media overlay setup of ST.

Essentially the flow would look like:

  1. User "pairs" and ebook and audiobook in ABS
  2. ABS reaches out to ST over the API, and triggers the generation of an updated epub file, sending the user-defined chapter demarcations as well
  3. ST parses the audiobook tracks, preferably by filesystem reference instead of a wasteful upload, uses the chapter times to assist the algorithm, and generates new marked up epubs
  4. The new epub gets synced back to ABS via either the API or just a filesystem write replacing/adding a duplicate of the existing epubs, but now with the marked up files

An extension would be to handle conversion of non epubs to epub transparently as well for convenience.

Better yet, on top of all this, with a little bit of fuzzy matching the entire library could be ported into ST directly and auto-pair all the audio and ebooks so no manual pairing is necessary.

Chiming in here because this flow is exactly like what I'm looking for.

What are the steps needed to make this work?

@megawubs commented on GitHub (Jun 6, 2024): > I'm thinking through how Storyteller and Audiobookshelf could be fairly tightly integrated to create "whispersync as a service" and combine the library management of ABS, and the media overlay setup of ST. > > Essentially the flow would look like: > > 1. User "pairs" and ebook and audiobook in ABS > 2. ABS reaches out to ST over the API, and triggers the generation of an updated epub file, sending the user-defined chapter demarcations as well > 3. ST parses the audiobook tracks, preferably by filesystem reference instead of a wasteful upload, uses the chapter times to assist the algorithm, and generates new marked up epubs > 4. The new epub gets synced back to ABS via either the API or just a filesystem write replacing/adding a duplicate of the existing epubs, but now with the marked up files > > An extension would be to handle conversion of non epubs to epub transparently as well for convenience. > > Better yet, on top of all this, with a little bit of fuzzy matching the entire library could be ported into ST directly and auto-pair all the audio and ebooks so no manual pairing is necessary. Chiming in here because this flow is exactly like what I'm looking for. What are the steps needed to make this work?
Author
Owner

@iamhenry commented on GitHub (Jul 12, 2024):

Snipd just released audiobook transcriptions. would love to see this in ABS
https://x.com/snipd_app/status/1811024587292864948

@iamhenry commented on GitHub (Jul 12, 2024): Snipd just released audiobook transcriptions. would love to see this in ABS https://x.com/snipd_app/status/1811024587292864948
Author
Owner

@CoryGH commented on GitHub (Nov 3, 2024):

+1

@CoryGH commented on GitHub (Nov 3, 2024): +1
Author
Owner

@toonvank commented on GitHub (Jan 16, 2025):

Bump

@toonvank commented on GitHub (Jan 16, 2025): Bump
Author
Owner

@megawubs commented on GitHub (Jan 30, 2025):

Isn't the first step to an integration to at least sync the progress to abs from Storyteller? This way, I can pick up where I left of on a different device that doesn't have the Storyteller app. I only use the Storyteller server to align the books, not to manage my books. That's what abs is for.

@megawubs commented on GitHub (Jan 30, 2025): Isn't the first step to an integration to at least sync the progress to abs from Storyteller? This way, I can pick up where I left of on a different device that doesn't have the Storyteller app. I only use the Storyteller server to align the books, not to manage my books. That's what abs is for.
Author
Owner

@smoores-dev commented on GitHub (Jan 30, 2025):

Yup, this is exactly what we're going to do; add support in Storyteller to sync progress to external services, like KOReader and ABS. If any ABS devs would be interested in helping out with this, I'd love to collaborate! Otherwise I'll get to this as soon as I have a chance

@smoores-dev commented on GitHub (Jan 30, 2025): Yup, this is exactly what we're going to do; add support in Storyteller to sync progress to external services, like KOReader and ABS. If any ABS devs would be interested in helping out with this, I'd love to collaborate! Otherwise I'll get to this as soon as I have a chance
Author
Owner

@ScyllPoesis commented on GitHub (Jan 30, 2025):

Have been following the Storyteller progress for a while now but haven't used it much, still awesome work! A native integration with ABS would be incredible and be a perfect resolution to this issue and likely end up making my household incredibly happy (often we switch on-and-off between listening to the audiobook while working, then reading between breaks at much faster speed).

@ScyllPoesis commented on GitHub (Jan 30, 2025): Have been following the Storyteller progress for a while now but haven't used it much, still awesome work! A native integration with ABS would be incredible and be a perfect resolution to this issue and likely end up making my household incredibly happy (often we switch on-and-off between listening to the audiobook while working, then reading between breaks at much faster speed).
Author
Owner

@zombiehoffa commented on GitHub (Jan 31, 2025):

Have been following the Storyteller progress for a while now but haven't used it much, still awesome work! A native integration with ABS would be incredible and be a perfect resolution to this issue and likely end up making my household incredibly happy (often we switch on-and-off between listening to the audiobook while working, then reading between breaks at much faster speed).

You read faster than an audiobook can play at 2-2.5x?

@zombiehoffa commented on GitHub (Jan 31, 2025): > Have been following the Storyteller progress for a while now but haven't used it much, still awesome work! A native integration with ABS would be incredible and be a perfect resolution to this issue and likely end up making my household incredibly happy (often we switch on-and-off between listening to the audiobook while working, then reading between breaks at much faster speed). You read faster than an audiobook can play at 2-2.5x?
Author
Owner

@weckere commented on GitHub (Feb 24, 2025):

Have been following the Storyteller progress for a while now but haven't used it much, still awesome work! A native integration with ABS would be incredible and be a perfect resolution to this issue and likely end up making my household incredibly happy (often we switch on-and-off between listening to the audiobook while working, then reading between breaks at much faster speed).

You read faster than an audiobook can play at 2-2.5x?

I don't know about ScyllPoesis, but I certainly do! Even if I didn't, anything faster than 1.25x is really too fast for me to process as an audio stream. I prefer to read visually as I can go much, much faster. Playing audio at the same time is only a garbled distraction at my normal reading speed.

When reading a story, I typically will scan back and forth over the text as I read. If I misunderstand something, scanning the text again is nearly instantaneous. With an audiobook, trying to skip back to hear a missed word, especially at high-speed, is a significant interruption in my flow.

I used to refuse to listen to audiobooks, but my life has become busier, and if I want to continue enjoying books, I need to take advantage of time when I'm already using my hands or eyes for another activity. I used to buy paper (because I find paper easier to read than a screen), ebook, and audiobook copies of the same title so that I could do this, but it was very expensive, and even more importantly, time-consuming to sync my progress between mediums.

Creating epub3 files with media overlays using Storyteller has basically solved this problem for me. I still maintain a library in ABS, but I find myself using only the Storyteller app to read or listen to my books now because it syncs my progress whether I choose to listen or read.

I've also discovered that using a read-along feature helps me stay focused when reading very dry material like textbooks or manuals. I used to copy text out of a PDF and have Siri read it to me on my mac to accomplish this. Although this is something I don't do often, for me and other folks with ADHD or information processing difficulties, this can be extremely helpful. Storyteller does this one better by highlighting sentences as it reads, taking the onus off me to scroll along, and making it much easier to find my place if I get lost.

If ABS was able to implement or hook-in to this functionality, that would be a massive benefit to me, as every other aspect of ABS is unparalleled in its functionality, polish, and ease-of-use. Absolutely fantastic, this app is best-in-class.

@weckere commented on GitHub (Feb 24, 2025): > > Have been following the Storyteller progress for a while now but haven't used it much, still awesome work! A native integration with ABS would be incredible and be a perfect resolution to this issue and likely end up making my household incredibly happy (often we switch on-and-off between listening to the audiobook while working, then reading between breaks at much faster speed). > > You read faster than an audiobook can play at 2-2.5x? I don't know about ScyllPoesis, but I certainly do! Even if I didn't, anything faster than 1.25x is really too fast for me to process as an audio stream. I prefer to read visually as I can go much, much faster. Playing audio at the same time is only a garbled distraction at my normal reading speed. When reading a story, I typically will scan back and forth over the text as I read. If I misunderstand something, scanning the text again is nearly instantaneous. With an audiobook, trying to skip back to hear a missed word, especially at high-speed, is a significant interruption in my flow. I used to refuse to listen to audiobooks, but my life has become busier, and if I want to continue enjoying books, I need to take advantage of time when I'm already using my hands or eyes for another activity. I used to buy paper (because I find paper easier to read than a screen), ebook, and audiobook copies of the same title so that I could do this, but it was very expensive, and even more importantly, time-consuming to sync my progress between mediums. Creating epub3 files with media overlays using Storyteller has basically solved this problem for me. I still maintain a library in ABS, but I find myself using only the Storyteller app to read or listen to my books now because it syncs my progress whether I choose to listen or read. I've also discovered that using a read-along feature helps me stay focused when reading very dry material like textbooks or manuals. I used to copy text out of a PDF and have Siri read it to me on my mac to accomplish this. Although this is something I don't do often, for me and other folks with ADHD or information processing difficulties, this can be extremely helpful. Storyteller does this one better by highlighting sentences as it reads, taking the onus off me to scroll along, and making it much easier to find my place if I get lost. If ABS was able to implement or hook-in to this functionality, that would be a massive benefit to me, as every other aspect of ABS is unparalleled in its functionality, polish, and ease-of-use. Absolutely fantastic, this app is best-in-class.
Author
Owner

@sevenlayercookie commented on GitHub (Feb 25, 2025):

advplyr has said a priority feature that is in the works is a plugin system -- could be a good avenue for integrating Storyteller. Not sure how far along the plugin system is.

@sevenlayercookie commented on GitHub (Feb 25, 2025): advplyr has said a priority feature that is in the works is a plugin system -- could be a good avenue for integrating Storyteller. Not sure how far along the plugin system is.
Author
Owner

@JudasSleeze commented on GitHub (Apr 21, 2025):

I have been using ebook2audiobook for making audiobooks out of my epub books and using Storyteller to sync them. A plugin system would be great so we can have these all in one amazing useful place. Thank you for this consideration.

@JudasSleeze commented on GitHub (Apr 21, 2025): I have been using [ebook2audiobook](https://github.com/DrewThomasson/ebook2audiobook) for making audiobooks out of my epub books and using [Storyteller](https://storyteller-platform.gitlab.io/storyteller/) to sync them. A plugin system would be great so we can have these all in one amazing useful place. Thank you for this consideration.
Author
Owner

@vindex10 commented on GitHub (May 25, 2025):

Hi! I was investigating the topic to find a simple workaround with currently existing tools. I just want to share my findings :)

The following combination seems to be viable:

  • using ABS as a streaming server for audiobooks
  • using Koreader for eink devices (as hackable reader with strong open source community)
  • using one of several solutions for koreader syncing "kosync", see below for the options - it is a server that stores the state of progress for each book of the user.
  • the only missing part is getting the "streaming progress" stored in ABS aligned with the state maintained by kosync, for this some plugin for ABS, or an utility running regularly to do the sync between ABS and kosync could be used. (this might use Storyteller annotations to sync duration with the place in the text)

Kosync alternatives:

@vindex10 commented on GitHub (May 25, 2025): Hi! I was investigating the topic to find a simple workaround with currently existing tools. I just want to share my findings :) The following combination seems to be viable: * using ABS as a streaming server for audiobooks * using Koreader for eink devices (as hackable reader with strong open source community) * using one of several solutions for koreader syncing "kosync", see below for the options - it is a server that stores the state of progress for each book of the user. * the only missing part is getting the "streaming progress" stored in ABS aligned with the state maintained by kosync, for this some plugin for ABS, or an utility running regularly to do the sync between ABS and kosync could be used. (this might use Storyteller annotations to sync duration with the place in the text) Kosync alternatives: * https://github.com/koreader/koreader-sync-server - lua * https://github.com/jberlyn/kosync-dotnet - dotnet * https://github.com/lzyor/kosync - rust * some open servers running kosync are discussed in this thread: https://github.com/koreader/koreader/issues/11154
Author
Owner

@J-Lich commented on GitHub (Nov 25, 2025):

Hi All - I built a bridge between KoSync and ABS. Feel free to check it out - I have been testing it for a few weeks. Code needs work but functions just fine for now (I'm sure edge cases will arise!). Will try to get a docker hub image up soon.

Edit: Dockerhub

ABS-KoSync-Bridge

@J-Lich commented on GitHub (Nov 25, 2025): Hi All - I built a bridge between KoSync and ABS. Feel free to check it out - I have been testing it for a few weeks. Code needs work but functions just fine for now (I'm sure edge cases will arise!). ~~Will try to get a docker hub image up soon.~~ Edit: [Dockerhub](https://hub.docker.com/repository/docker/00jlich/abs-kosync-bridge/general) [ABS-KoSync-Bridge](https://github.com/J-Lich/abs-kosync-bridge)
Author
Owner

@kevincox commented on GitHub (Feb 10, 2026):

I've been thinking about this and I don't think it is too technically challenging to implement. It would require quite a bit of work to glue the pieces together but I think it wouldn't require any new research or clever algorithms to do so. Unfortunately I am unlikely to have time to do this soon so I figured I would right my thoughts down to hopefully make the problem appear more approachable to anyone who is able to implement it.

Fundamentally I have realized that this problem is basically a diff algorithm (like a text diff). So if we make the following assumptions:

  1. We have a transcript of the audio with timestamps.
  2. We have the text of the book.
  3. These two don't match, but are substantially similar.

Then you can do a process like this:

  1. Run a word diff against the two blocks of text.
  2. Things that are present in both are perfect matches. Each word in the left and right match one-to-one.
  3. We can mostly ignore the chunks present in only the left or right. (more thought later).

That's it. Not you can link (most) of the words in the audiobook to the words in the ebook. There will be a lot of small mismatches due to slight differences in the text as well as transcription errors but that doesn't matter much. While getting every word absolutely perfect this feature would be very valuable even if you only have a few syncronization points per paragraph (or even page). In practice I suspect it would be much better than that. I suspect you can get well under sentence-level accuracy.

For the non-perfect matches you can decide what you want to do. You can do some basic guesses like just linearly map the ebook words across the time gap in the audio book. But I suspect in most cases it would be best to group these into a gap and just ignore it.

To avoid small errors it would probably be beneficial to do some preprocessing on both sides like stemming, case conversion, spelling normalization (colour vs color). But even without this I suspect the result would be sufficient.

Here is a short example of two chunks of text. I diffed them with diff -U999 /tmp/{audiobook,ebook}

--- /tmp/audiobook
+++ /tmp/ebook
@@ -1,7 +1,8 @@
 To
 be
 or
-pot to me
+not to be
 that
 is
+the
 question

You then just group up the bunches of - and + sections into "Non-Match" groups.

Audiobook Ebook Chunk Type
To To Match
be be Match
or or Match
pot to me not to be Non-Match
that that Match
is is Match
the Non-Match
question question Match

So most words would map 1-1, but you have some brief mismatches for example a missed word in the transcription or slight errors. This group would still have a time range which matches a block of words, however the exact timing within would be unknown. You may also have cases where the sentences were reordered or something and those would just show up as a block of unknown as well. I think the key point is to treat each "section" as a small range. So you can be inside the "pot to me" range and you know that that range of audio maps to the "not to be" range of text in the ebook. Then you can seek back and forth to make the ranges overlap in both.

Performance

Unfortunately many diff algorithms are O(NM) with N and M are the size of the left and right side of the diff. In this case those are both roughly the same so O(n^2). I suspect this isn't much of an issue in practice, diff algorithms have gotten very good at diffing large files and since the sides of the diff are pretty similar I think the heuristics will do a good job finding the common text quickly. This should be pretty easy to evaluate by dumping a book and transcription to files and using the command line diff tool to evaluate. The worst-case may be slightly different depending on exactly how similar the texts are but I think we should be able to get a ballpark fairly easily. Since this runs just once on the server at import time I'm not too worried.

This could also be made a lot better if we choose to trust chapters. In that case we would only have to run the diff on a fraction of the book at a time and that O(n^2) worst case shrinks dramatically.

UX

I don't want to get too far into UX because I'm not a UX expert. However I think something very simple such as "whenever the user makes progress in one move the other" would be sufficient. So if the user turns the page of the ebook the autobook would be progressed towards the first word of the new page (if forward) or the last word of the previous page (if back). Similarly as the audiobook plays the page will be turned to whichever has the currently active word. (Of course this probably isn't actively done on every page turn or second of sound, but whenever the user next opens the other media).

@kevincox commented on GitHub (Feb 10, 2026): I've been thinking about this and I don't think it is too technically challenging to implement. It would require quite a bit of work to glue the pieces together but I think it wouldn't require any new research or clever algorithms to do so. Unfortunately I am unlikely to have time to do this soon so I figured I would right my thoughts down to hopefully make the problem appear more approachable to anyone who is able to implement it. Fundamentally I have realized that this problem is basically a diff algorithm (like a text diff). So if we make the following assumptions: 1. We have a transcript of the audio with timestamps. 2. We have the text of the book. 3. These two don't match, but are substantially similar. Then you can do a process like this: 1. Run a word diff against the two blocks of text. 2. Things that are present in both are perfect matches. Each word in the left and right match one-to-one. 3. We can mostly ignore the chunks present in only the left or right. (more thought later). That's it. Not you can link (most) of the words in the audiobook to the words in the ebook. There will be a lot of small mismatches due to slight differences in the text as well as transcription errors but that doesn't matter much. While getting every word absolutely perfect this feature would be very valuable even if you only have a few syncronization points per paragraph (or even page). In practice I suspect it would be much better than that. I suspect you can get well under sentence-level accuracy. For the non-perfect matches you can decide what you want to do. You can do some basic guesses like just linearly map the ebook words across the time gap in the audio book. But I suspect in most cases it would be best to group these into a gap and just ignore it. To avoid small errors it would probably be beneficial to do some preprocessing on both sides like stemming, case conversion, spelling normalization (colour vs color). But even without this I suspect the result would be sufficient. Here is a short example of two chunks of text. I diffed them with `diff -U999 /tmp/{audiobook,ebook}` ```patch --- /tmp/audiobook +++ /tmp/ebook @@ -1,7 +1,8 @@ To be or -pot to me +not to be that is +the question ``` You then just group up the bunches of `-` and `+` sections into "Non-Match" groups. | Audiobook | Ebook | Chunk Type | |--------|--------| ---- | | To | To | Match | | be | be | Match | | or | or | Match | | pot to me | not to be | Non-Match | | that | that | Match | | is | is | Match | | | the | Non-Match | | question | question | Match | So most words would map 1-1, but you have some brief mismatches for example a missed word in the transcription or slight errors. This group would still have a time range which matches a block of words, however the exact timing within would be unknown. You may also have cases where the sentences were reordered or something and those would just show up as a block of unknown as well. I think the key point is to treat each "section" as a small range. So you can be inside the "pot to me" range and you know that that range of audio maps to the "not to be" range of text in the ebook. Then you can seek back and forth to make the ranges overlap in both. ## Performance Unfortunately many diff algorithms are O(NM) with N and M are the size of the left and right side of the diff. In this case those are both roughly the same so O(n^2). I suspect this isn't much of an issue in practice, diff algorithms have gotten very good at diffing large files and since the sides of the diff are pretty similar I think the heuristics will do a good job finding the common text quickly. This should be pretty easy to evaluate by dumping a book and transcription to files and using the command line diff tool to evaluate. The worst-case may be slightly different depending on exactly how similar the texts are but I think we should be able to get a ballpark fairly easily. Since this runs just once on the server at import time I'm not too worried. This could also be made a lot better if we choose to trust chapters. In that case we would only have to run the diff on a fraction of the book at a time and that O(n^2) worst case shrinks dramatically. ## UX I don't want to get too far into UX because I'm not a UX expert. However I think something very simple such as "whenever the user makes progress in one move the other" would be sufficient. So if the user turns the page of the ebook the autobook would be progressed towards the first word of the new page (if forward) or the last word of the previous page (if back). Similarly as the audiobook plays the page will be turned to whichever has the currently active word. (Of course this probably isn't actively done on every page turn or second of sound, but whenever the user next opens the other media).
Author
Owner

@smoores-dev commented on GitHub (Feb 10, 2026):

Yup, you're describing (in broad strokes) the Storyteller alignment algorithm. It works pretty well! I wouldn't exactly say it was "not too challenging," but it's certainly possible (and there are certainly opportunities for improvement).

If it's something that ABS devs are actually interested in, we would be happy to collaborate on splitting out the Storyteller aligner into a standalone command line tool (I've always wanted to do this, anyway), and adding output options other than EPUB3, in case ABS wanted to just store the mapping between text and audio positions and use that for converting between the two.

@smoores-dev commented on GitHub (Feb 10, 2026): Yup, you're describing (in broad strokes) the [Storyteller alignment algorithm](https://storyteller-platform.gitlab.io/storyteller/docs/the-algorithm). It works pretty well! I wouldn't exactly say it was "not too challenging," but it's certainly possible (and there are certainly opportunities for improvement). If it's something that ABS devs are actually interested in, we would be happy to collaborate on splitting out the Storyteller aligner into a standalone command line tool (I've always wanted to do this, anyway), and adding output options other than EPUB3, in case ABS wanted to just store the mapping between text and audio positions and use that for converting between the two.
Author
Owner

@NikoKS commented on GitHub (Mar 27, 2026):

Hi All - I built a bridge between KoSync and ABS. Feel free to check it out - I have been testing it for a few weeks. Code needs work but functions just fine for now (I'm sure edge cases will arise!). Will try to get a docker hub image up soon.

Edit: Dockerhub

ABS-KoSync-Bridge

@J-Lich, If sync between ABS audiobook and Koreader ebook is possible, what's stopping you from syncing ABS audiobook and ABS ebook?

@NikoKS commented on GitHub (Mar 27, 2026): > Hi All - I built a bridge between KoSync and ABS. Feel free to check it out - I have been testing it for a few weeks. Code needs work but functions just fine for now (I'm sure edge cases will arise!). ~Will try to get a docker hub image up soon.~ > > Edit: [Dockerhub](https://hub.docker.com/repository/docker/00jlich/abs-kosync-bridge/general) > > [ABS-KoSync-Bridge](https://github.com/J-Lich/abs-kosync-bridge) @J-Lich, If sync between ABS audiobook and Koreader ebook is possible, what's stopping you from syncing ABS audiobook and ABS ebook?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/audiobookshelf#99