Blueprints | Mozilla.ai

Focus

Decision

Rationale

Alternatives Considered

Trade-offs

Focus

Decision

Rationale

Alternatives Considered

Trade-offs

Overall Motivation

Build a local-focused workflow for finetuning Speech-To-Text models using your own data or the Common Voice dataset.

Enables users to fine-tune a STT model based on their own needs or for low resource languages, while keeping their data private. Also enables users to use the model as a STT service locally and privately.

Models fine-tuned on low resource languages already exist on HuggingFace, the user could download these, or use another STT service/tool, and try them, instead of fine-uning on their data or Common Voice.

Existing fine-tuned models might be trained on bigger, more diverse datasets so their performance might be better across different environments/use-cases. However, not all languages have a fine-tuned model, or the models might not perform as well. Fine-tuning a model on your own voice data, produces a more personalized, use-case specific model that might perform better.

Model Selection

openai/Whisper

Open-source with MIT license and easy to implement. Big community and support around it. Top 5 in the HF ASR leaderboard as of Feb 2025. Multiple sizes available, making it easy to switch depending on available hardware.

facebook/w2v-bert-2.0, meta/mms

Whisper models, especially larger ones, require considerable computational resources and might not run efficiently on all local setups, however Whisper-tiny and small are low-compute friendly.

Voice Dataset for Fine-tuning

CommonVoice

Open-source, diverse collection of voice samples in multiple languages. One of the best STT datasets available for low-resource languages.

Didn’t consider any alternatives.

n/a

Fine-tuning Framework

hf-transformers

Hugging Face’s transformers library provides well-documented fine-tuning support that is actively maintained and supports most open-source pre-trained models.

SpeechBrain; NeMo by Nvidia

SpeechBrain and NeMo both have instructions on fine-tuning CommonVoice (here and here), however they are not as actively maintained as Transformers, they have a steeper learning curve for beginners and might not support as broad a family of models.

User Interface

Gradio

Good option for voice recording integration and integration with HF Spaces.

Streamlit

Streamlit is another option but the built-in voice recording feature doesn’t work out-of-the-box with the HF Transformers library, i.e. the audio input needs specific transformation before being fed to the STT model.