Convert AWS Polly text to speech API result to HTTP streamed response
In previous articles, I’ve explored streaming implementations for Google’s Text-to-Speech API and Azure’s Text-to-Speech service. Continuing this series, let’s see how to implement streaming with AWS Polly, Amazon’s text-to-speech service.
Spoiler: It’s trivial.
AWS Polly Javascript SDK
The command we would be using is SynthesizeSpeechCommand from the AWS SDK for JavaScript v3. The response from the SynthesizeSpeechCommand
already includes an AudioStream
. AudioStream
has methods transformToByteArray
, transformToString
, and transformToWebStream
, which transform the audio stream into different formats. For HTTP streaming, we can just use transformToWebStream()
to convert the AWS SDK audio stream into a web-compatible stream.
Example code
The following code convert a text input to an ogg audio stream using AWS Polly’s SynthesizeSpeechCommand
.
import type {
LanguageCode,
VoiceId,
} from '@aws-sdk/client-polly'
import {
PollyClient,
SynthesizeSpeechCommand,
} from '@aws-sdk/client-polly'
const client = new PollyClient({
region: awsRegion,
credentials: {
accessKeyId: awsAccessKeyId,
secretAccessKey: awsAccessKeySecret,
},
})
...
const command = new SynthesizeSpeechCommand({
Text: text,
OutputFormat: 'ogg_vorbis',
VoiceId: 'Ruth' as VoiceId,
LanguageCode: 'en-US' as LanguageCode,
Engine: 'neural',
TextType: 'text',
})
const response = await client.send(command)
if (!response.AudioStream) {
throw createError({
status: 500,
message: 'SPEECH_SYNTHESIS_FAILED',
})
}
const stream = response.AudioStream.transformToWebStream()
setHeader(event, 'content-type', 'audio/ogg; codecs=opus')
setHeader(event, 'cache-control', 'public, max-age=3600')
return sendStream(event, stream)
Conclusion
Since AWS SDK already provides a stream based response and useful helper methods to convert the audio stream, implementing HTTP streaming for AWS Polly is straightforward. This allows you to reduce the delay between the request and the audio playback, enhancing the overall user experience with real-time audio playback.