AWS Polly text to speech API

In previous articles, I’ve explored streaming implementations for Google’s Text-to-Speech API and Azure’s Text-to-Speech service. Continuing this series, let’s see how to implement streaming with AWS Polly, Amazon’s text-to-speech service.

Spoiler: It’s trivial.

AWS Polly Javascript SDK

The command we would be using is SynthesizeSpeechCommand from the AWS SDK for JavaScript v3. The response from the SynthesizeSpeechCommand already includes an AudioStream. AudioStream has methods transformToByteArray, transformToString, and transformToWebStream, which transform the audio stream into different formats. For HTTP streaming, we can just use transformToWebStream() to convert the AWS SDK audio stream into a web-compatible stream.

Example code

The following code convert a text input to an ogg audio stream using AWS Polly’s SynthesizeSpeechCommand.

import type {
  LanguageCode,
  VoiceId,
} from '@aws-sdk/client-polly'
import {
  PollyClient,
  SynthesizeSpeechCommand,
} from '@aws-sdk/client-polly'

const client = new PollyClient({
  region: awsRegion,
  credentials: {
    accessKeyId: awsAccessKeyId,
    secretAccessKey: awsAccessKeySecret,
  },
})

...

const command = new SynthesizeSpeechCommand({
  Text: text,
  OutputFormat: 'ogg_vorbis',
  VoiceId: 'Ruth' as VoiceId,
  LanguageCode: 'en-US' as LanguageCode,
  Engine: 'neural',
  TextType: 'text',
})
const response = await client.send(command)
if (!response.AudioStream) {
  throw createError({
    status: 500,
    message: 'SPEECH_SYNTHESIS_FAILED',
  })
}
const stream = response.AudioStream.transformToWebStream()
setHeader(event, 'content-type', 'audio/ogg; codecs=opus')
setHeader(event, 'cache-control', 'public, max-age=3600')
return sendStream(event, stream)

Conclusion

Since AWS SDK already provides a stream based response and useful helper methods to convert the audio stream, implementing HTTP streaming for AWS Polly is straightforward. This allows you to reduce the delay between the request and the audio playback, enhancing the overall user experience with real-time audio playback.