Convert Azure text to speech API result to a web ReadableStream
Introduction
Previously, we discussed the usage of the Azure Text-to-Speech API and how to implement real-time audio streaming by converting the API’s output into a stream using Node.js PassThrough. In this tutorial, we will focus on converting the API’s output into a web-compatible ReadableStream.
Why web (readable) stream?
While the Node.js stream module would work for most API use cases, using web streams provides some additional benefits:
-
Web streams are the modern standard across all JavaScript environments. This means you can use the same stream handlers in server and client code. For example, TransformStream handlers can move between API and browser code without modification, allowing for greater flexibility when design needs it.
-
Compatible with the Fetch API. The Fetch API natively returns a
ReadableStream
, so using web streams allows for seamless integration with the Fetch API and other web platform features. -
Unified interface across different TTS providers: By using web streams, we can create a consistent interface for different TTS providers’ SDK or API, since output formats accepted in JavaScript can be converted to a
ReadableStream
one way or another.
Pull or Push?
Azure supports PushAudioOutputStream
and PullAudioOutputStream
. Both can be used to create a web-compatible ReadableStream
. In the previous article we used PushAudioOutputStream
, but let’s revisit the technical differences between the two.
-
PushAudioOutputStream
: To createPushAudioOutputStream
, we need to providePushAudioOutputStreamCallback
, which has to implementfunction write(dataBuffer: ArrayBuffer)
andfunction close()
. -
PullAudioOutputStream
: To createPullAudioOutputStream
, we need to providePullAudioOutputStreamCallback
, which has to implementfunction read(dataBuffer: ArrayBuffer): number
andfunction close()
.
For our use case, we need to mimic a fetch
streamed response, which returns a web standard ReadableStream
. Comparing the interface of ReadableStream
, which requires start
, pull
, and cancel
methods, we can see that PullAudioOutputStream
is more aligned with the ReadableStream
interface, with its read
method corresponding to the pull
method in ReadableStream
.
Implementation
For implementation, first we’ll set up a speech synthesizer similar to our previous article. Note the only difference is that we will be setting up a PullAudioOutputStream
here.
const outputFormat =
sdk.SpeechSynthesisOutputFormat.Audio16Khz128KBitRateMonoMp3;
const pullStream = sdk.PullAudioOutputStream.create();
const audioConfig = sdk.AudioConfig.fromStreamOutput(pullStream);
const speechConfig = sdk.SpeechConfig.fromSubscription(
azureSubscriptionKey,
azureServiceRegion
);
speechConfig.speechSynthesisVoiceName = voiceName;
speechConfig.speechSynthesisOutputFormat = outputFormat;
const synthesizer = new sdk.SpeechSynthesizer(speechConfig, audioConfig);
function speakTextAsync(synthesizer, text) {
return new Promise((resolve, reject) => {
synthesizer.speakTextAsync(
text,
function (result) {
synthesizer.close();
resolve(result);
},
function (err) {
synthesizer.close();
reject(err);
}
);
});
}
await speakTextAsync(synthesizer, text);
Then we’ll create a ReadableStream
from the PullAudioOutputStream
. We only need to implement pull
and cancel
. Since the read method requires an ArrayBuffer
, we’ll create a new buffer for each read operation.
const readableStream = new ReadableStream({
async pull(controller) {
try {
const buffer = new ArrayBuffer(1024);
const bytesRead = await pullStream.read(buffer);
if (bytesRead > 0) {
const chunk = new Uint8Array(buffer, 0, bytesRead);
controller.enqueue(chunk);
} else {
pullStream.close();
controller.close();
}
} catch (error) {
console.error("[Speech] Error reading from pull stream:", error);
controller.error(error);
pullStream.close();
}
},
cancel() {
pullStream.close();
},
});
return readableStream;
One minor optimization we can make is to use the type
and autoAllocateChunkSize
options on the ReadableStream
. This allows the stream to reuse buffers from existing allocations or allocate new ones as needed to minimize memory usage and improve performance.
const readableStream = new ReadableStream({
type: "bytes",
autoAllocateChunkSize: 1024,
async pull(controller) {
try {
const byobRequest = controller.byobRequest;
if (byobRequest?.view) {
const buffer = byobRequest.view.buffer as ArrayBuffer;
const bytesRead = await pullStream.read(buffer);
if (bytesRead > 0) {
byobRequest.respond(bytesRead);
} else {
pullStream.close();
controller.close();
}
} else {
// Fallback: allocate our own buffer
const buffer = new ArrayBuffer(1024);
const bytesRead = await pullStream.read(buffer);
if (bytesRead > 0) {
const chunk = new Uint8Array(buffer, 0, bytesRead);
controller.enqueue(chunk);
} else {
pullStream.close();
controller.close();
}
}
} catch (error) {
console.error("[Speech] Error reading from pull stream:", error);
controller.error(error);
pullStream.close();
}
},
cancel() {
pullStream.close();
},
});
return readableStream;
Then we can proceed to read from the readableStream
as needed.
Conclusion
In this blog post, we explored how to convert Azure Text-to-Speech output into a web-readable stream. By leveraging the PullAudioOutputStream
and the ReadableStream API, we can create a flexible and efficient solution for handling audio data in web applications. This approach not only improves performance but also enhances maintainability by providing a clear separation of concerns between the TTS provider’s interface and the API response format.