Focus
Decision
Rationale
Alternatives Considered
Trade-offs
Focus
Decision
Rationale
Alternatives Considered
Trade-offs
Overall Motivation
Build a local-focused workflow for finetuning Speech-To-Text models using your own data or the Common Voice dataset.
Enables users to fine-tune a STT model based on their own needs or for low resource languages, while keeping their data private. Also enables users to use the model as a STT service locally and privately.
Models fine-tuned on low resource languages already exist on HuggingFace, the user could download these, or use another STT service/tool, and try them, instead of fine-uning on their data or Common Voice.
Existing fine-tuned models might be trained on bigger, more diverse datasets so their performance might be better across different environments/use-cases. However, not all languages have a fine-tuned model, or the models might not perform as well. Fine-tuning a model on your own voice data, produces a more personalized, use-case specific model that might perform better.
Model Selection
openai/Whisper
Open-source with MIT license and easy to implement. Big community and support around it. Top 5 in the HF ASR leaderboard as of Feb 2025. Multiple sizes available, making it easy to switch depending on available hardware.
facebook/w2v-bert-2.0, meta/mms
Whisper models, especially larger ones, require considerable computational resources and might not run efficiently on all local setups, however Whisper-tiny and small are low-compute friendly.
Voice Dataset for Fine-tuning
CommonVoice
Open-source, diverse collection of voice samples in multiple languages. One of the best STT datasets available for low-resource languages.
Didn’t consider any alternatives.
n/a
Fine-tuning Framework
hf-transformers
Hugging Face’s transformers library provides well-documented fine-tuning support that is actively maintained and supports most open-source pre-trained models.
User Interface
Gradio
Good option for voice recording integration and integration with HF Spaces.
Streamlit
Streamlit is another option but the built-in voice recording feature doesn’t work out-of-the-box with the HF Transformers library, i.e. the audio input needs specific transformation before being fed to the STT model.