Creating audiobooks with Azure AI Speech
Last updated on January 20, 2025
Storytime: In August 2024, four friends of mine and I started playing the Dungeons and Dragons: Descent into Avernus campaign. Three of us had previously played through Baldur’s Gate 3 together and now wanted to experience the D&D story that was the prequel for the best video game of all time.
If you are not familiar with Dungeons and Dragons, it is a fantasy tabletop role-playing game developed and published by Wizards of the Coast. I’ve also heard it described as a cooperative storytelling game, which I think is quite fitting. The Dungeon Master reads through the available campaign material (book), sets the scene, and describes the situations the players encounter. Each player creates their own character using the D&D rulebook and their imagination, and then during the game, tells what their character does in each situation. Even though the published campaign material is always the same, each game ends up being vastly different because the players, the characters, and the decisions they make–which are limited only by the players’ imagination–are different. And the dice can either roll in your favor or not.
For my character, I selected the sage background that comes with a letter from a dead colleague posing a question you have not yet been able to answer as a part of the staring equipment. Our dungeon master asked me to think what that letter contained, which resulted in me writing a ten-page long Word document of my character’s background. Finding such fiction writing to be quite fun, I started taking notes during our game sessions and writing fantasy-novel-style “recaps” based on those notes.
So far, we’ve had twelve 3-4 hour long sessions, each resulting in a 10-20 page Word document. Even if I personally do not find reading such lengthy stories to be a problem, I’ve noticed not all of our players find time for such a task. Thus, I figured I’d convert the writings into audio format so those players could instead listen to the recap of the previous events before the next session. And so, I started looking into the Azure AI Speech service.
Azure AI Speech and SSML files
There are two ways to convert text into speech using the Azure AI Speech service. In both cases, the text inputs must be either plain text or Speech Synthesis Markup Language (SSML). I highly recommend the latter for audiobooks because it allows you to adjust the speech rhythm, among other things.
The first thing you need to decide is the voice. There are two types of voices: neural and neural HD. According to Microsoft documentation, “The HD voices provide a higher quality for more versatile scenarios.”
For selecting the voice, you can browse the Voice Gallery and listen to the different voices. You can find the SpeechSynthesisVoiceName
on the code tab. You can also browse Microsoft documentation for all voice codes and HD voice codes.
Another variable you can adjust is the temperature. It is a float value ranging from 0 to 1, and it influences the randomness of the output. Lower temperature results in less randomness, leading to more predictable results, while higher temperature increases randomness, allowing for more diversity and versatility. The default temperature is set at 1.0 (max randomness).
You can also make adjustments to how the input text is spoken by adding SSML elements to the content. Personally, I found the default silence between paragraphs to be too short for my taste, which is why I added an additional break after each paragraph. Note that not all SSML elements are compatible with HD voices, so make sure to check the list of supported elements.
Speech Synthesizer is for real-time conversion
At first, I looked at the Microsoft.CognitiveServices.Speech
package and its SpeechSynthesizer
. Using it is pretty straightforward.
However, using the SpeechSynthesizer
posed two problems: The conversion is limited to 10 minutes of audio, and the result is played through the speakers. There does not seem to be a way to turn off the feature.
Based on the above two points, it is quite clear that the SpeechSynthesizer
is meant for real-time conversion of text-to-speech, not for creating long audiobook kind of files for later consumption. Of course, you can split the text into smaller chunks, call SpeakSsmlAsync
multiple times, and then concatenate the resulting audio files. However, there is another service at our disposal—one that is meant to be able to create a single, long audio file and does not speak out about the conversion result.
Use the asynchronous batch synthesizer API
The asynchronous Batch synthesis API can be used to create audio files that are longer than 10 minutes. It is meant to be well-suited for generating long audiobook-like files. The conversion delay of this asynchronous service is not a problem because we do not need to play the results immediately. As an added bonus, we don’t have to listen to the generated speech through the computer speakers while the conversion is happening or even necessarily keep our computer powered.
I was super excited about this API at first, but then it repeatedly failed when I tried to synthesize “large” (50KB) SSML files into audio. The annoying thing about these failures was that the request was accepted with a success status code by the API, and the synthesis status was shown as Running until it four hours later turned into Failed status. There was no explanation available as to why the conversion had failed.
I started investigating if there was an issue with some part of the SSML file. I split the contents into multiple smaller SSML files and sent each of them to the service. All of them were synthesized successfully.
Then, I began to test if the issue was the content length, and this seemed to be the problem. Based on my tests, there seems to be a limit somewhere around 25000-30000 characters (including the SSML elements). Thus, I still needed to split the content into multiple inputs, do the synthesis in batches, and finally concatenate the resulting audio files. There is a nice thing about splitting the content into smaller chunks, though: it allows the synthesis to happen with parallelism, reducing the total time taken for the conversion.
If you are interested in looking at the code or giving the console application a whirl, you can find the project under my GitHub profile in the AzureAI.SpeechConcat repo.
The verdict
One thing we have not yet talked about is the synthesis results. Am I satisfied with the quality of how the AI narrates the story? Unfortunately, I am not. Hopefully, the results will improve in the future. Good news for voice actors, though; they won’t be out of a job yet!
I will still convert the stories into audio format for my fellow players, but for the moment, I highly recommend that they read the original text instead. The feel of the story is just much better that way. Still, I’ll continue experimenting with these text-to-speech features. After all, we only just reached level 5, and the adventure will apparently take us to level 13, so we’ll be playing the campaign most likely for another 1-2 years. The Azure AI Speech service has plenty of opportunities to improve during that time.
Laura