Concatenating ogg vorbis (.ogg) audio files on frontend
Background
When working with the Azure Text-to-Speech API, there is no size limit for the input text, but the resulting output audio is limited to a maximum length of 10 minutes. If the input exceeds this limit, the output audio will be truncated, causing unexpected issues. This limitation is particularly problematic for scenarios involving long texts, such as whole chapters from ePub ebooks, which often surpass the threshold and result in undesired truncation.
Working around the output limit
An easy workaround for this limitation is to split the input into 10-minute chunks when making API calls to the Text-to-Speech service. The resulting audio files can then be concatenated into a single file. However, calculating the exact output audio length for a given set of plaintext input can be challenging, especially when dealing with bilingual texts like Chinese and English. In English, words are separated by spaces, making it relatively straightforward to estimate the length by counting number of words. However, in Chinese, characters and words are not separated by spaces, making the issue more complex.
Fortunately, in this particular case, the input text consists of well-formatted paragraphs with new lines (or <br>
and <p>
tags in the case of HTML input). A simple approach would be to split the text by newline characters (\n
) and merge the parts until the maximum character length is reached. We can set a more conservative maximum character length here, as having the audio files shorter than 10 minutes will not affect the final concatenated result.
By utilizing this method, the Azure Text-to-Speech API can be called with these split text sections. When making the API call, it’s recommended to select OGG as the audio output type, as OGG files are smaller in size compared to MP3 files. Following these steps, you will have a collection of OGG files, each shorter than 10 minutes.
Concatenating the audio on the frontend? A challenge
Concatenating audio files is relatively straightforward in a server environment, thanks to powerful tools like FFmpeg. It involves saving all the intermediate audio files on the server, merging them using FFmpeg, and then storing the resulting file until it is downloaded by the user.
However, in this project, a more challenging approach was taken: merging the OGG audio files on the frontend using JavaScript. This approach eliminates the need for server file storage, simplifies the architecture, reduces costs, and fits more seamlessly into the current API structure powered by Nuxt.js.
The challenge arises when trying to find a browser-side replacement for the powerful FFmpeg concat function.
Solution #1: Simple concatenation using cat
After conducting some research, it seems that the OGG specification allows for the direct concatenation of two OGG files to create a single OGG file with two logical streams. For example, in a Linux bash environment, one can simply use the cat
command to achieve this.
One prerequisite of this approach is each OGG file must have a unique serial metadata. Fortunately, it appears that the Azure Text-to-Speech API returns OGG files with randomized serial numbers. Kudos to Azure!
Implementing this in the browser is surprisingly simple using Blob
.
However, despite the merged OGG file being playable, most audio players are unable to properly seek within this track. I tested Chrome, VLC player, and Audacity, and none of them could display the correct total duration or seek the track correctly. Since the ability to seek is crucial for our audio ebook use case, this solution is not acceptable.
Solution #2: Web Audio and MediaStream Recording API
When seeking a solution for concatenating OGG files, next thing I do is to ask ChatGPT, and it suggested using the AudioContext
provided by the Web Audio API
. By using the decodeAudioData
function to decode each input OGG file, the files can be concatenated using audioContext.createBuffer
and AudioBuffer
. It is important to note that this approach assumes that both audio files have the same number of channels and sample rate.
One issue remains: how to save the concatenated AudioBuffer
as an OGG file. It turns out an other modern browser API, the MediaStream Recording API
would come in handy. The AudioBuffer
can be sent to a MediaRecorder
, and the result can be saved as an OGG file after the recording ends. Here is a snippet of example code from MDN’s MediaRecorder
page.
Unfortunately, Chrome does not support saving the file as OGG using MediaRecorder. In fact, only the .webm
format is supported for audio. While sending an uncompressed .wav
file is undesirable due to its large size, using .webm
severely limits the compatibility of the resulting audio file. Since Chrome is a widely used browser, and we cannot ignore this limitation, I ultimately did not implement this approach.
Final Solution: Bring FFmpeg to the browser side
Due to the absence of FFmpeg on the frontend, we attempted two alternative methods, both of which proved unsuccessful. However, what if we can actually use FFmpeg on the client side? This would resolve all the issues and provide a straightforward solution.
As it turns out, this is indeed possible with ffmpeg.wasm
, thanks to powerful WebAssembly technologies that allow running FFmpeg in the browser.
By incorporating FFmpeg into the browser, the problem of concatenating OGG files simplifies to a single FFmpeg concat
command. The resulting code would be similar to the following snippet. Please note that if you use vite
as your build system (in my case, Nuxt 3), you may need to apply additional configuration for baseURL
, coreURL
, and wasmURL
due to a CORS issue.
Once this function is called, the Web worker will handle the rest, and the concatenated OGG file will be available as a downloadable blob. Since FFmpeg recreates the metadata rather than simply concatenating the files, seeking and duration work correctly in the players I tested.
Additional remarks on FFmpeg license
It’s important to note the license issue associated with FFmpeg. FFmpeg is LGPL licensed, which means that if it is directly distributed as WebAssembly to clients, your entire web application would also need to be distributed under a LGPL-compatible license. While this may not be a significant concern for personal or open-source projects, it may not be applicable in other cases.