Introduction#
The other day, I was wondering whether I could use AI to generate custom audio books based on existing documents like PDFs.
While professionally recorded audiobooks already exist for most books, I wanted to have something more customizable where you could for example only use chapter summaries instead of the full chapter.
Also, I wanted the audiobooks to be in the m4b
format which comes as a single file but still supports chapters for easier navigation (unlike wav
or mp3
).
To achieve this, I had to solve the following parts:
- Extract relevant content from the source text.
- Split the content into chapters.
- Create audio for each chapter.
- Merge all chapters into a single audiobook.
The final project can be found on GitHub: floscha/gemini-audiobook-generator
The goal of this post is not to walk you through the full implementation but rather discuss some things I found easier and some I found more challenging while developing this project.
What Just Worked#
Gemini API#
To use Google’s Gemini AI models from Python code, you need a Gemini API key. Creating this key through Google AI Studio was really straight-forward and only required one or two clicks.
Similarly, writing the Python code to call the models went really effortlessly because of the great developer documentation which has Python snippets ready for many practical use cases. These include processing text from PDF files and converting text to audio which I needed. Because of this, I was done building the core AI logic within about 5 minutes.
Challenges#
Prompting#
Writing the prompt to preprocess the source document into customized chapters turned out to be slightly more challenging than I initially thought.
For example, I use a simple heuristic where I take the first line of a chapter as the chapter name. Unfortunately, this does not work for some formatting like the one below.
Chapter 1
Name of the Chapter
To make the heuristic work, I’ve added the following prompt:
Keep the heading in a single line.
This usually does the trick, but due to the probabilistic nature of LLMs, issues can still occur, for example dropping the Chapter 1
part of the example above.
Reusing Intermediate Files#
While running the whole script from source might work for other tools, it is much more practical to store intermediate files from which the script can recover in case of failure or to speed up experimentation during development.
Even though the Gemini free tier is really generous, a medium-sized PDF file can quickly result in 100k tokens per request, leading to higher than desirable token consumption. Creating the audio files on the other hand was not as token-expensive but took multiple minutes, leading to slower development speed.
The simple solution to this problem was to provide a keep_intermediate_files
option which can be used during development.
When turned on, intermediate files are kept and when re-running the script, it would recover from those files, rather than starting from scratch.
Merging Audio Files#
After the AI-specific code was implemented rather quickly, writing the code to merge the audio files for the individual chapters into a combined m4b
file turned out to be much more tedious.
This is mainly due to the fact that no Python library with this functionality exists.
Instead, I had to fall back to using ffmpeg through subprocess
calls.
Even worse, this approach required generating lots of hacky metadata files through Python, which took me a couple of tries to get right.
Finally, I see the lack of a Python m4b
converter as an opportunity to create one myself which I might do in the near future.