Custom audio lessons with GPT and Azure AI Speech

Published: Mon Oct 23 2023

Ever felt like language apps weren't cutting it for you? Here's how I generated my own custom tailored audio course for learning Japanese, and how you can too!

DALL-E 3: Subtly colored drawing of a robot in a podcast studio, microphone in front, with a moderately vibrant backdrop of famous Japanese landmarks. The robot has muted tones while pointing to a delicately colored Japanese phrase on a whiteboard.

Update, Jun 13 2024: This article was written before the release GPT-4 Turbo and GPT-4o. These models support a much larger output size and are significantly cheaper, which should make it possible to generate much longer lessons for a far lower price.

TL;DR? Get the lessons or generate your own on GitHub.

What

I wanted to learn Japanese for an upcoming holiday trip. After acquiring a streak on Duolingo of over a hundred days, I was not satisfied with my progress. Completing lessons started feeling like a chore.

I've found audio lessons to my preferred way to learn languages at a level where grammar isn't a priority, because they can be combined with other activities. However, I had acquired some knowledge from the time spent on Duolingo, and did not want to start from scratch with a beginner level audio book.

Several AI-powered language-learning applications exist [0][1][2], but they were too interactive for me. Text-to-speech systems have become impressively realistic. Could I use GPT and text-to-speech to generate audio lessons that suit my level?

How

Neural Text-to-speech

I wanted a text-to-speech service offering:

English and japanese in the same text, preferably within the same sentence.
An easy to use API.
A nice and realistic sounding voice.
A way to control pronunciation.

One such service (with a free tier) is Azure Text-to-speech. They recently released a neural voice with support for 41 different languages/accents. With this, I can use the same voice for all narration.

To control pronunciation, Azure Text-to-speech supports the Speech Synthesis Markup Language (SSML). SSML can be used specify which language to speak, to speak the Japanese part slower, and add breaks between words:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
  xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-RyanMultilingualNeural">
    <lang xml:lang="en-US">
      Now, let's use "water" in a sentence. "I want some water" is:
    </lang>
  </voice>
  <voice name="en-US-RyanMultilingualNeural">
    <lang xml:lang="ja-JP">
      <mstts:prosody rate="-20%"><break time='1s' />
      水が欲しいです
      </mstts:prosody>
    </lang>
  </voice>
</speak>

Audio generated with Azure Text-to-speech from the above SSML

Now that I was able to generate realistic sounding bilingual audio, I just needed to generate a set of personalized lessons in SSML.

Generating lesson transcripts

The task is split into multiple steps. First, generate a learning plan containing descriptions for a large number of short lessons. Then, for each lesson description, generate a transcript.

Generating a learning plan

Each lesson in the learning plan features a title and a description detailed enough to produce a transcript yet concise enough to adhere to GPT's token limitations. As the course advances, its complexity should escalate, for example by transitioning to more conversation-based lessons towards the end of the course.

I started out by asking for a learning plan with a JSON schema:

[{"title": string, "description": string}]

While easy to parse, that ended up taking a long time to generate and consume a lot of tokens. To fix this I switched to a learning plan in this format:

{title}:{description}\n

This reduced token usage by approximately by 40%.

Even still, GPT were not generating a comprehensive learning plan covering all essential topics. When I asked for between 50 and 200 lessons, only 10 to 20 were generated, covering too much material per lesson. The trick was to specify a fixed amount of lessons (100), and to tell GPT to keep track by counting. Example output:

15 - Adverbs of place: Where, here, there
16 - Adverbs of place 2: Above, below, inside and outside
17 - Directions: Left, right, straight ahead and turn.

Generating transcripts

I initially had GPT generate SSML directly. By instructing GPT to use <prosody/> and <break/> to adjust speaking style, rather natural sounding audio was synthesized, with pauses and slowed down pronunciation of new words.

This unfortunately turned out to be a bad idea. For one because the SSML schema consumed a lot of tokens. Secondly because it turned out near impossible to make GPT actually wrap every Japanese part in <lang/>. It would always find a way to sneak some Japanese into the English part, causing anglicized pronunciation.

Arigatou. For more bloopers, see bloopers.

To make the task easier, I instructed GPT to create a transcript with regular text, with the exception that Japanese parts should be wrapped in <lang/>.

First up is the word for 'left', which in Japanese is: <lang lang="ja-JP">左</lang>.
Once more, 'left' is: <lang lang="ja-JP">左</lang>.

I added instructions on what a lesson might contain, such as repetitions and sentences including the word, as well as example outputs for GPT to use as inspiration.

I then created my own silly little parser and SSML generator. Finally, I used the Speech SDK to synthesize audio.

Results

I went with GPT-4 as LLM of choice for high quality lessons. For my Japanese lessons, I asked for 100 lessons for the following input:

Target language: Japanese
Prior knowledge: 
  I've done 100 lessons on duolingo, so I know words like hello,
  goodbye, some sentences like where is, my name is, some colors,
  how to say where and there
Target knowledge: Enough to be able to enjoy a three week vacation

The resulting learning plan spanned a wide variety of topics, from basic greetings to historical contexts.

Generating the course set me back about $7 in OpenAI credits. Here is every lesson, concatenated into a single three and a half hour .mp3:

I also generated a learning plan for Italian course at a slightly higher level:

Target language: Italian
Prior knowledge: 
  I am able to speak simple Italian, as I have been there 
  many times before and have studied spanish
Target knowledge: I want to be able to speak Italian comfortably

Intermediate Italian lesson 1
Review of Basic Italian: Greetings, introductions, and farewells.

As well as a beginner level Japanese course:

Target language: Japanese
Prior knowledge: I am a complete beginner
Target knowledge: The basics, e.g. sentence structure, how to order at a restaurant, simple stuff for a vacation

Beginner level Japanese lesson 1
Introduction to Japanese: Basic greetings and expressions.

Pretty good! The learning plans take prior knowledge into account, explaining the basics in the beginners course, while spending less time on the basics in the slightly higher level Japanese course, and exploring higher level concepts and conversations in the intermediate level Italian course.

The lessons are pedagocical, repeating words and explaining sentences. Pronunciation is for the most part very good as far as I can tell.

Admittedly, there is room for improvement:

Despite various attempts at convincing it not to, GPT occasionally inserts placeholders in sentences, such as
"Let's start with the phrase 'I am', which translates to 私は~です in Japanese. Here '~' is replaced by whatever activity you're currently doing.".
My best attempt at fixing this so far has been to repeat this instruction several times in the prompt:
Do not add placeholders or variables (such as ..., ~, [name], ___ or similar) to the transcript. Make up a suitable example instead.
Pronunciation isn't always straightforward, especially when a character's sound changes based on context. For instance, the character "は" is usually pronounced "ha", but when used as a topic marker, it is pronounced "wa". Azure Speech is not able to distinguish the context-specific pronunciation when the character is isolated. Addressing this issue may be possible using phonetic alphabets within the SSML.
Breaks sometimes feel unnatural, as I for simplicity hardcode breaks when switching between languages. I'm certain this can be improved by asking GPT to decide when there should be breaks instead.
Lasting three and a half hours, the course is a bit too short. Although possible to improve by generating more lessons or by combining multiple prompts into a single lesson, I suspect the issue is hard to adress properly with the current state of LLMs. GPT-4 is great, but not perfect at generating longer texts while adhering to all instructions.

Concluding

Despite imperfections, the course offers valuable learning experiences and I find it worth listening to. Would I have been better off by listening to an existing audio book? Most likely. But once generated, the courses are freely available for everyone to use. It's scalable too — making a new course for another language or proficiency level is as straightforward as defining your current language level and learning goal. With minor modifications, it should even be possible to create language courses narrated in any of the 41 supported languages!

For those interested in trying it out or contributing, the entire course, including lessons and source code, is freely available on Github.

Bloopers

Not everything went smoothly when generating audio. I experienced both some frustrating and some amusing mistakes:

Azure Text-to-speech has a feature for automatically detecting the language of the input text. I tried using this to not have to specify <lang/> in the SSML. That turned out not working to well for sentences containing multiple languages.
A confused language detecting AI.
I had a bug in my GPT-to-SSML-parser. I was using wrong name for a property, causing the text for every sentence in the SSML to become undefined:
Undefined. Undefinedo. Undefined. Undefinedo. Undefined. Undefinedo. Undefinedo.
Making GPT understand that the text were to be read directly by a text to speech system was surprisingly difficult. It would sometimes add placeholders for the narrator to replace.
Replacing is left as an excercise to the reader.
Adding <break/> in the SSML sometimes causes sentences and words to be repeated by the Text-to-speech engine.