Creating audiobooks with Azure AI Speech

Last updated on May 4, 2026

Storytime: In August 2024, four friends of mine and I started playing the Dungeons and Dragons: Descent into Avernus campaign. Three of us had previously played through Baldur’s Gate 3 together and now wanted to experience the D&D story that was the prequel to the best video game of all time.

If you are not familiar with Dungeons and Dragons, it is a fantasy tabletop role-playing game developed and published by Wizards of the Coast. I’ve also heard it described as a cooperative storytelling game, which I think is quite fitting. The Dungeon Master (DM) reads through the available campaign material (book), sets the scene, and describes the situations the players encounter. Each player creates their own character using the D&D rulebook and their imagination, and then during the game, tells what their character does in each situation. Even though the published campaign material is always the same, each game ends up being vastly different because the players, the characters, and the decisions they make–which are limited only by the players’ imagination–are different. And the dice can either roll in your favor or not.

One of us with previous dungeon master experience delved into the campaign material of Descent into Avernus, and we got a couple of new people involved, resulting in a total of four players + the DM. During the first session, we players created our characters for the adventure. For my character, I selected the sage background that comes with a letter from a dead colleague posing a question you have not yet been able to answer as a part of the starting equipment. Our dungeon master asked me to think about what that letter contained, which resulted in me writing a ten-page-long Word document of my character’s background. Finding such fiction writing to be quite fun, I started taking notes during our game sessions and writing fantasy-novel-style “recaps” based on those notes, which people could then read before our next session to remind themselves what happened last time.

Our sessions were 3-4 hours long, and each resulted in a 10-20 page-long Word document as the recap. Very soon, I noticed that even though I personally do not find reading such lengthy stories to be a problem, not all of our players find time for such a task. Thus, I figured I’d convert the writings into audio format so that those players could instead listen to the recap of the previous events before the next session. And so, I started looking into the Azure AI Speech service.

Comparison of the speech synthesis methods
1. The Speech Synthesizer is for real-time conversion
2. Use the batch synthesis API for long-form audio
Selecting the voice
Speech Synthesis Markup Language (SSML)
1. SSML elements and structure
2. What happens if there are errors in the SSML?
My experimentations with SSML
Generating the SSML with Azure AI services
1. How to get the generation to happen correctly
2. The format of the text also plays a role in success
Training a custom voice
Conclusion

Comparison of the speech synthesis methods

To get started with using the Azure AI Speech service, you first need to create a Speech service resource on Azure (duh). But what then? What would I need to do to be able to perform text-to-speech conversion as a part of my custom application?

I started browsing through the Microsoft documentation for this bit of information, and the first thing I ran into was the speech synthesis options Azure AI Speech service offers, of which there are two: speech synthesis and batch synthesis.

The Speech Synthesizer is for real-time conversion

At first, I looked at the Microsoft.CognitiveServices.Speech package and its SpeechSynthesizer. Using it is pretty straightforward. On the Overview blade of your Speech service resource, you can find the information you need to connect to the service from an application: the service region and its key. You use those to create a new SpeechConfig object, and then use that object to create the SpeechSynthesizer, which you can use to perform the speech synthesis.

However, using the SpeechSynthesizer posed two problems: The conversion is limited to 10 minutes of audio, and the result has to be played through the speakers before it can be saved as an audio file. There does not seem to be a way to turn off the feature.

Based on the above two points, it is pretty clear that the SpeechSynthesizer is meant for real-time conversion of small chunks of text into speech, not for creating long audiobook-like files for later consumption. Of course, you can split the text into smaller chunks, call the conversion method multiple times, and then concatenate the resulting audio files, but you’d still have to spend time listening through the entire thing before the result can be saved to a file. Luckily, there is another service at our disposal—one that is able to create a single, long audio file and does not require us to play through the conversion result.

Use the batch synthesis API for long-form audio

The Azure AI Speech service’s batch synthesis API can be used to create audio files that are longer than 10 minutes. It is well-suited for generating long audiobook-like files. The conversion delay of this asynchronous service is not a problem because we do not need to play the results immediately. As an added bonus, we don’t have to listen to the generated speech through the computer speakers while the conversion is happening or even necessarily keep our computer powered, because the conversion occurs as an asynchronous job on a server. The conversion also happens much, much faster!

The API works in the following manner. Please note that I’ve removed some of the properties from the examples to keep them relatively short. For a complete list of available properties, see the list on Microsoft documentation.

We call the API and include the text we wish to convert in the request body.
When the API receives our request, it creates a new batch synthesis job in the Azure AI Speech service for the conversion, and returns the job ID in the response body.
Our application can then use the job ID to poll the synthesis job status to check when the conversion has finished.
When the job is no longer running, the response contains a URL in the outputs/result property that can be used to download the synthesis result as a zip file. The package contains the following files:
- 0001.wav is the audio file
- 0001.debug.json contains information regarding the job
- summary.json contains information about the input and output
It is considered a good practice to delete the synthesis job after the zip has been downloaded. Otherwise, the job result will be retained for 168 hours (7 days) by default (can be increased up to 31 days when creating the job). You can delete the job by making a DELETE request to the same URL as you used to check the synthesis status.

If you are interested in looking at the code or giving the console application a whirl, you can find the project under my GitHub profile in the AzureAI.TextToSpeech repo.

When I initially started experimenting with the API in January 2025, it wasn’t in as good a state as it is today. Back then, it repeatedly failed when I tried to synthesize “large” (50KB) files into audio. And what made the situation even more annoying was that the request was accepted with a success status code by the API, and the synthesis status was shown as Running until it turned into Failed status four hours later. There was no explanation available as to why the conversion had failed. After checking the input file for syntax issues, I split it into smaller files and discovered that there was a limit somewhere around 25000-30000 characters. Thus, I still needed to split the content into multiple inputs, do the synthesis in batches, and finally concatenate the resulting audio files.

Luckily, the API has taken leaps forward during 2025 and can today tackle both of the things that were frustrating me earlier:

Today, it can convert the entire long file into audio all at once. It is no longer necessary to split the input into smaller batches.
If an error occurs during the conversion, an error message can be found in both the 0001.debug.json and summary.json files that are contained within the downloaded zip file. So far, the error message has always correctly described the problem and allowed me to fix it in no time.

Selecting the voice

In addition to delving into the differences of the speech synthesis methods, another thing I needed to look into in the early stages of my project was the voice I wanted to use. For finding a suitable voice, one can browse the Voice Gallery and listen to the different voice examples.

There are two types of voices to choose from: neural and neural HD (high-definition).

Neural (non-HD) voices

The neural (non-HD) voices do not understand the context. The tone of the spoken text can be adjusted via something called speaking styles and their style degree.

Speaking styles tell the speech synthesizer which emotion should be conveyed when speaking the text. Some examples of speaking styles are angry, cheerful, sad, excited, friendly, terrified, shouting, unfriendly, whispering, and hopeful. Check the Microsoft documentation for the complete list of supported speaking styles for each voice. Not all styles are available for all voices, so one needs to select a voice that supports the desired speaking style (or select the closest match among the available speaking styles for the used voice). You can see which styles are supported for a voice in the Voice Gallery and can filter voices based on the speaking styles they support.

Style degree is a float value between 0 and 2, and the higher the value, the more intensely the speaking style is applied.

Neural HD voices

The neural HD voices are context-aware. That means that when text is sent to the speech service for synthesis using an HD voice, the service aims to understand the context of the text and adjust the tone of the voice accordingly–with varying degrees of success.

When using an HD voice, you can also specify the temperature you want it to use. The temperature is a float value ranging from 0 to 1, and it influences the “randomness” of the output. Lower temperature results in less randomness, leading to flatter, consistent results, while higher temperature increases randomness, allowing for more diversity and variance. The default temperature is set at 1.0 (max randomness).

There are three different kinds of HD voices: DragonHD, DragonHDOmni and DragonHDFlash. DragonHD voices were the first to be introduced, and they can produce consistent results for purposes like customer service and commercials. DragonHDOmni is a more recent set of voices for more creative pursuits such as audiobooks and storytelling. And finally, DragonHDFlash is meant to produce a more natural sounding output but only works in Chinese and American English, with a limited set of voices and a narrow set of speaking styles. If your use case can fit the narrow parameters of DragonHDFlash voices, they are definitely worth a look.

As of March 2026, neural HD voices also support speaking styles. Previously speaking styles were only available for non-HD voices. This means that instead of giving the speech service’s context deduction complete control over the tone, you can now nudge it into right direction by also applying a speaking style. This is particularly useful if the same text can be interpreted to be spoken in multiple different tones (e.g., seriously or sarcastically). Note that at the time of this writing, HD voice speaking styles are only supported for voices whose primary language is English, i.e., they won’t work for a multilingual voices set to speak English if the voice’s primary language, for example, is Italian.

There are still some features that are not supported by HD voices, such as pitch, and thus non-HD voices can still be useful in certain situations.

Multilingual voices

There are a lot of voices available that support multiple (90+) languages. Each voice always has its primary language that is mentioned in the voice name, and then a list of other supported languages. The speech service always aims to automatically detect which language it should use to speak the provided text. Usually, it gets the language right, but at times I’ve also had it revert to using the primary language–perhaps because both languages have words with similar spelling, even though they are pronounced differently.

So, you do not necessarily need to do anything to have a multilingual voice, such as fr-FR-Remy:DragonHDLatestNeural (French) or ja-JP-Masaru:DragonHDLatestNeural (Japanese), to speak English. However, to avoid any potential issues with those voices, I’d still recommend wrapping the text within lang xml:lang="en-GB" tags to ensure that the text is spoken in English in case the speech service’s automatic language detection doesn’t get it right.

Also, be aware that multilingual voices with the primary language other than English can have an accent that is common for the speakers of their primary language. So, for example, Remy (fr-FR-Remy:DragonHDLatestNeural) speaks English with quite a strong French accent. The accent is more apparent when the content language is not specified explicitly by a language tag but deduced automatically, so if you wish to tone down the accent of the voice, make sure to use the language tags to specify the content language.

How to use the voice in practice

For actually using the voice, you need its voice name. You can find it in the Voice Gallery on the code tab as the SpeechSynthesisVoiceName configuration value. You can also browse Microsoft documentation for all voice names and HD voice names.

When using the real-time speech synthesis, you can simply specify the voice name as the SpeechConfig object’s SpeechSynthesisVoiceName attribute value. However, when using the batch synthesis API, we need to specify it in an SSML element. So, let’s talk about SSML next.

Speech Synthesis Markup Language (SSML)

The Speech Synthesis Markup Language (SSML) is an XML-based markup language that can be used for directing the speech synthesis (how the text is spoken). It can be used just as well with both the real-time speech synthesizer and the batch synthesis API. You don’t necessarily need to use any SSML with a speech synthesizer, because you can specify the voice as a config value, but when using the batch synthesis API, you need to specify the following tags at a minimum.

The speak and voice tags are mandatory. All other SSML tags are optional.

The speak tag has the following attributes.

The version attribute defines the version of the SSML specification that should be used to interpret the document markup.
The xmlns refers to a document that lists the element and attribute names that can appear within the document.
The xmlns:mstts attribute is required if you wish to use any of the tags that are prefixed with mstts; it points to the document describing those elements and attributes, similar to the xmlns attribute.
The xml:lang attribute just indicates that the natural language content within the xml is written in the specified language. It does not have any effect on the speech synthesis. If you wish to use a multilingual voice and specify the language instead of having the speech service deduce it automatically, you need to wrap the text within the voice tag in a separate lang element.

The voice tag is used to specify the voice you wish to use, and in the case of HD voices, also its temperature. Note that if you are using the real-time speech synthesizer and specifying the voice as a configuration value, you still need to specify the temperature via SSML because there is no configuration setting for specifying it in the SpeechConfig object.

Initially, I chose to use a single HD voice and temperature for reading out the entire story. At this point, I sent the actual story content simply as plain text to the service. However, the result was not that great in my opinion. One of the things that most bugged me was the length of the silence between paragraphs. It was too short in my opinion; I wanted the rhythm of the text to be slower. That’s what prompted me to start looking into injecting more SSML into the text to influence the duration of those breaks.

SSML elements and structure

There are a bunch of elements you can use to direct how the text is spoken. You can find the complete list of available SSML elements on Microsoft documentation. There are a couple of things you should note regarding them:

Just like with any XML document, there’s a certain order in which the elements need to be nested, and not all elements can be used with each other. Refer to the list linked above to check supported nesting.
Not all elements are compatible with HD voices. Even if the list of HD-compatible elements lists an element as compatible, I’d advise you to test its effect to be certain. For example, the prosody tag is listed as supported for HD voices, but adjusting the pitch with that element does not actually work for HD voices.

What happens if there are errors in the SSML?

There are three things that can happen if there are errors in the SSML that is sent to the batch synthesis API.

If the SSML is not in a correct format (e.g., you are missing a closing tag for an element) when you send it to the Speech Service, the batch synthesis API will immediately return an HTTP 400 Bad Request response.
If the request is initially accepted, but then an error occurs during the conversion, the job status will switch to Failed. When you then download the zip file, the 0001.debug.json and summary.json files contain a descriptive error message that explains what went wrong so that you can fix the SSML input. When the job fails, there will naturally not be an audio file within the zip, but otherwise the contents are the same as when the job succeeds.
If your SSML contains elements or attributes that are not supported by the used voice, no error is returned. The specified elements and their attribute values simply have no effect. That’s why it is always good to listen through the audio and verify that it matches what you are expecting.

My experimentations with SSML

At first, I only wanted to increase the silence duration after each paragraph. Turned out that it could be easily achieved with the break element. Still, adding a lot of those tags to the text by hand would have been super tedious, and I am a programmer after all. Thus, I implemented some basic logic to add a break tag after each paragraph.

After introducing the breaks, the “audiobooks” started to sound a bit better, but I was still not quite satisfied with them. I was curious to discover how much better I could make the resulting audio if I introduced more of the SSML elements in the text. So I started experimenting further.

Using different voices for narration and dialogue

In my opinion, the HD voice I was initially using for speaking through the entire story was good during dialogue, but too all-over-the-place for narration. Even when setting the temperature to 0 and using HD voices that were marked as tailored for narration, it was usually still the case, although there are some voices that have the calmness I was looking for (like Seraphina, if you don’t mind the slight German accent).

But then, I discovered something even better.

While I was browsing through the different voices available in the voice gallery, I noticed that one of the neural (non-HD) voices supported the story speaking style. This voice and speaking style turned out to be great for the kind of story narration I required. It was way too bland for dialogue, though, so I wanted to use the voice only for the narration and then different voices for the dialogue spoken by each of the characters in the story.

Originally, I wanted to use an HD voice for each of the characters, because I find the HD voices to be more lively and have more character. Additionally, being context aware, the tone of the voice can adjust to match what the character is feeling, if it is apparent based on the spoken words within that specific bit of dialogue. However, there are a few exceptions when I actually found non-HD voices to be more suitable.

Adjusting pitch does not work for HD voices

In our Dungeons & Dragons campaign, one of the players has a character that is a mapach, a small raccoon-like humanoid. I figured it’d be fun to have that character speak her lines in a squeaky voice. To accomplish this, I went to introduce a prosody pitch="30%" tag, which should have increased the pitch of the voice by 30 percent. However, I discovered that the pitch adjustment does not work for HD voices; it only works for the neural non-HD voices. And thus, I was forced to swap the voice of this character to a non-HD one to be able to accomplish the pitch alteration.

Later on, I ended up using the pitch alteration for other non-player characters (NPCs) we encountered during our adventure as well. Small characters used a higher pitch while monsters such as devils used a lower pitch, making them sound more menacing. Being able to adjust the pitch increased the number of different kinds of voices I could use in my story beyond the number of available voices on the Voice Gallery.

The tags I ended up utilizing in my project

The elements I ended up using to convert my story into audio were the following:

break time="ms" tags to introduce 450ms breaks in between paragraphs. In addition, certain punctuation characters were replaced with breaks of 300ms, and dinkus (***) with a 1800ms break.
Different voice tags for narration and dialogue lines, each character having their own specified voice.
mstts:express-as style="" styledegree="2" tag with the speaking style story for the narration lines. Also, for dialogue lines using a non-HD voice, the style could be set to convey an appropriate emotion.
prosody pitch="%" tag to increase the pitch of a non-HD voice for small characters, and to decrease the pitch with a negative value for devils and the like. The prosody element can also be used for adjusting the speaking rate and volume. See Microsoft documentation for a more thorough explanation of the different attributes.
phoneme element to tell the speech service how to pronounce a word or a name if it did not pronounce it correctly. E.g., the speech service kept on spelling out the name “Jynks” letter by letter at every occurrence of the character’s name. Thus, I needed to surround the name with a phoneme alphabet="ipa" ph="ʒinks" tag to have the name correctly pronounced.
lang xml:lang="en-GB" tag for multilingual voices because sometimes the automatic language detection does not work correctly (and allows me to specify the English text to be spoken with a British accent)

I also tested the mstts:backgroundaudio tag to play ambient sounds that could be heard in the kind of environment where the story was taking place. Sadly, found it to be too distracting for my purposes.

I was also curious about the “role” attribute available on the same express-as element as speaking style. It is meant to be used to define how old the speaker’s voice should sound. The value could be set to Boy, Girl, OlderAdultFemale, OlderAdultMale, SeniorFemale, SeniorMale, YoungAdultFemale, YoungAdultMale to adjust the voice to sound like someone of that age. However, it doesn’t seem to do anything, even when using one of the voices that are meant to support it.

Generating the SSML with Azure AI services

As you can probably already guess at this point, adding all those elements I described above using the character replacement method I was using before wouldn’t work. Introducing the different voices and speaking styles for the dialogue required semantic understanding. And I was definitely not going to do it manually, because it would have been super tedious. Thus, I started looking into using AI to incorporate those elements into my text.

I set up the Azure AI service originally with the gpt-4o model and started experimenting to see if I could have it reliably decorate my text with the desired SSML elements. Since then, I’ve upgraded the model to claude-opus-4-6, and at the time of this writing, the SSML generation works very well, especially when using Claude Opus 4.6 for the generation. The AI…

Applies a specific voice to all narration
Detects which character is speaking during dialogue and applies a voice tag using the character-specific voice
Includes a prosody tag with pitch if specified for the character
Deduces what emotion should be conveyed in the dialogue and applies an appropriate speaking style to the express-as element
Applies break and phoneme elements where appropriate
Includes all other necessary SSML elements (speak, lang) and ensures the result is valid

Usually, I only need to make some minor adjustments when checking through the SSML before it is sent to be converted into speech. However, it has not always been the case.

How to get the generation to happen correctly

In order to have high chances of the markup generation happening as expected, I needed to address quite a few issues first.

Initially, the AI…

Removed original text
Invented additional text
Changed text order
Messed up with the SSML syntax, resulting in an invalid format
Invented non-existent speaking styles
Duplicated dialogue
Did not correctly separate the spoken dialogue from adjacent narration
Pronounced some words wrong (speech service issue, fixed by phoneme tags)
Refused to process input that described a battle scene

The above issues could be fixed by either adjusting the system message or the Azure service configuration. In the case of the service refusing the process text describing a battle scene, I needed to create a custom filter with high tolerance instead of the default medium. But in the end, it mostly came down to having an extremely thorough system message.

A system message is a chunk of text that contains the instructions for the AI service to do whatever it is expected to do. The instructions are always sent along with the actual input to the Azure AI service for processing.

I found that it is not enough to simply write instructions on what you want the AI to do and what not to do. Even though I explicitly told it to behave in a certain way, it did not happen–until I introduced examples. They turned out to be super important in getting the results I was after. Every time I converted a piece of text into SSML, I created examples of the bits the AI got wrong by taking the input text and specifying the kind of SSML I was expecting to receive in return. Over time, I seemed to cover enough ground with the examples that the errors lessened. I also noticed that if the AI was conflicted between some examples, the last provided example in the system message was used to deduce how to format the SSML.

As I mentioned before, the system message and the kind of input I’m sending to the service work quite well together these days. And Anthropic’s Claude Opus 4.6 model works wonders compared to Open AI’s latest models (even Claude Sonnet 4.6 works better). But still, sometimes AI just makes mistakes. One time, it can provide the perfect outcome for a piece of text, and the next time you run it with the same text and system message, it might make mistakes. And that’s why a human being always needs to ensure the result is what we are expecting. To achieve this, the solution I’ve implemented for the conversion has a pause in between SSML generation and sending it to the speech service, so that I have time to check through the SSML and make any necessary fixes before the text is converted into audio.

The format of the text also plays a role in success

There were some issues that could not be solved by adjusting the system message or the service configuration.

When sending the text to Azure AI for processing, it needs to be split into batches because there is a token limit for the returned output. I’ve ensured that the point at which the content is cut into batches always occurs in between paragraphs for minimal context disruption. But still, sometimes the narration describing which character is speaking and with what emotion, and the actual lines of spoken dialogue can be split into separate batches. In this case, the AI cannot apply the correct tags for the dialogue.

To get around this issue, I needed to ensure that the spoken dialogue and the narration that described who was speaking and how were always presented in the same paragraph. This way, they’d always be sent to the Azure AI service in the same batch, and the service would have the necessary information to apply the tags correctly.

Another thing I noticed that caused problems was that if a character’s dialogue were split by narration, the conveyed feeling would change between the parts when using an HD voice. Additionally, having the voice to switch from character to narration and then back to character sounded a bit clunky. Thus, for achieving better results, I’ve taken up the practice of always writing a character’s dialogue in a longer chunk, and having the descriptive narration either before or after it, not in the middle of it. The more words you have inside the HD voice block, the better chance the service has of using the correct tone.

A similar issue occurs if text spoken by an HD voice contains a break tag. It breaks the context awareness. My solution was replacing em dashes and ellipses with break tags of 300 milliseconds to have a bit longer period of silence when those characters appeared in the text. However, I needed to stop using them in dialogue when the character in question was specified to have an HD voice. Instead, to achieve a longer silence, I began using a period. Sadly, the dialogue doesn’t look as nice in writing anymore.

I also looked at the silence element that can be used to specify breaks for certain situations and characters to get around the above context-loss issue. However, at the time of this writing, it does not support specifying the length of silence in between for characters like dashes or ellipses (nor dinkuses or ends of paragraphs).

Training a custom voice

At this point, I had a voice for the narrator and each of the characters in the story. But then, during our campaign, we encountered a character that also appears in Baldur’s Gate 3. It felt so wrong to use one of the readily available Azure AI Speech service voices when they make the character sound nothing like the one in the video game.

I started thinking, Hmm, what if I trained a custom voice based on the character’s voice lines that are available in the game, and used that voice for the character’s dialogue? It should be fine if I only use it for my private hobby purposes, right? I was curious to find out if it was a thing that could actually be done.

So, I went to the speech service resource on Azure and navigated my way to the Speech Studio to take a closer look at what it would require for me to train a custom voice.

Custom voice options

There are two types of custom voices you can train: personal and professional. Personal voices require you to record voice samples for preset phrases via a microphone, while professional voices can be trained by providing an existing audio sample. The latter option was more suitable for my purposes because it was not my own voice that I wanted to record, but a voice based on an audio file extracted from a YouTube video that played through all of the video game character’s voice lines.

When you are about to create a new professional voice project, you are presented with two options: Lite and Pro. The lite voice can only be used within the Speech Studio. It cannot be used for speech synthesis via the API, which is something we’d need to be able to do. Thus, the pro option was the way to go.

The first problem: Recorded statement

Then, I ran into the first problem. To add a “voice talent” into my project, I needed to be able to provide a recorded statement from the voice talent (the person whose voice was going to be used for training) saying the I [state your first and last name] am aware that recordings of my voice will be used by [state the name of the company] to create and use a synthetic version of my voice. And that voice would need to match the audio recording later provided for training the voice.

So, what could I do to get around this? I could use another AI service to train a voice sufficiently that it would be able to speak that phrase in high enough quality to fool the Azure AI Speech service’s validation mechanism. Hmm, hacky.

I decided to entertain myself a bit longer, and I found a website that offered a free trial for training a voice based on 10 minutes of audio, and allowed me to use the voice on the web page to speak the text written in a text box. I wrote the statement above into the box, recorded the spoken audio, and uploaded that file as the voice talent statement. And it worked. When I uploaded the actual audio file I wanted to use for training the voice, the speech service accepted it without a problem.

The second problem: Pricing (and ethics)

Training a professional voice requires the audio sample to contain 50-2000 utterances. The 45 minutes worth of voice lines stripped from the YouTube video contained 528 utterances, making the sample of sufficient size. That was not the problem. The second problem arose when I looked at the pricing page for what it’d cost me to actually train the voice.

Training a professional voice costs approximately 45€ per hour, and according to Microsoft documentation, it usually takes around 10 hours to train a single-style custom voice. In the past, the estimate used to be 20 to 40 compute hours to train a single-style voice, and around 90 compute hours to train a multi-style voice, so the time it takes to train a voice has come down considerably. I don’t seem to be able to find any black-on-white regarding today’s estimate for a multi-style voice, but if we assume it to take three times longer than a single style, it’d be about 30 hours today.

The Azure UI also gives you an estimate for how long it’d take to train a voice in your particular case (assuming based on the number of utterances). In my case, the estimate was 4 hours. So, based on that estimation, it’d cost me 200 euros, on average it’d be 500 euros, but it could also be 800 euros, which is the maximum, because only the first 18 hours of training get billed. This is a considerable improvement from before when my estimate was 15 hours, average 20-40 hours, and the billing cap 96 hours which would have ended up around 4300 euros. Yikes!

And what about using the voice? To be able to use the voice for speech synthesis, you need to deploy it as a custom endpoint. The custom endpoint is functionally identical to the standard endpoint that’s used for text-to-speech requests. The endpoint URL has the format https://region.voice.speech.microsoft.com/cognitiveservices/v1?deploymentId=12345 where the deployment ID parameter has your custom endpoint ID. Hosting the custom endpoint costs about 3,5 euros per hour, and you can always un-deploy it when you don’t need it.

Actually, using the custom voice for speech synthesis is about twice as expensive as using the standard voices. Using a neural voice costs approximately 20€ per million characters, while using an HD voice costs twice as much (40€ per million characters) when using the pay-as-you-go model.

It was not the cost of using a professional voice but the cost of training it that made me back off from experimenting with the feature any further. And there is, of course, also the ethical perspective. Even though the voice would have only been used in a private hobby project that would never be released anywhere publicly, would it still have been ok for me to train an AI voice using another person’s voice without their explicit permission? Perhaps not.

Conclusion

AI does not yet sound as good as a real person, even if we use SSML to direct the speech synthesis. HD voices can’t always correctly interpret the content for what kind of emotion should be conveyed, and non-HD voices sound quite flat. Especially with short pieces of text, there is typically not enough information for the HD voices to figure out what kind of tone should be used. Those voices most likely work much better with non-story content where the emotion is quite neutral, and long chunks of text are spoken by the same voice. For stories, you might be able to get better results if you manually adjust the very detailed SSML markup and spend a lot of time tweaking and testing it, but at least for me, that is way too time-consuming. And at the end of the day, when it comes to conveying an emotion through voice, a real person can adjust their voice to contain the smallest of subtleties, while AI, at this moment in time at least, can not.

Still, I think the AI speech synthesis results are good enough for projects similar to mine: when one does not have the resources to hire professional voice actors, and it is ok for the result not to be perfect. If it is ok for people to realize the audio has been synthesized by AI, and it is acceptable.

As of May 2026, we’ve played the Dungeons and Dragons campaign for nearly two years and have had 30 sessions. We are currently on level 10, and the campaign apparently ends at level 13. If the pace continues, I estimate us to play the campaign for another six months. I’ll definitely keep on experimenting with the Azure AI Speech service during this time, so keep an eye on this article for future updates!

Author: Laura Kokkarinen

Filed Under: Azure AI

Tags: Audiobooks, Azure AI Speech, text-to-speech