Last updated
2/18/2025
Get started
Finetune a speech recognition model for your voice
This blueprint enables you to create your own Speech-to-Text dataset and model, optimizing performance for your specific language and use case. Everything can run locally - even on your laptop, ensuring your data stays private. You can finetune a model using your own data or leverage the Common Voice dataset, a community-led project from Mozilla that supports a wide range of languages. To see the full list of supported languages, visit the CommonVoice website.
Preview this Blueprint in action
Hosted demo
Step by step walkthrough
Tools used to create
Trusted open source tools used for this Blueprint

HuggingFace Transformers
Use HF Transformers to fine-tune the ASR model, and HF Hub to load Common Voice.
Choices
Insights into our motivations and key technical decisions throughout the development process.
Focus
Decision
Rationale
Alternatives Considered
Trade-offs
Overall Motivation
Overall Motivation
Build a local-focused workflow for finetuning Speech-To-Text models using your own data or the Common Voice dataset.
Enables users to fine-tune a STT model based on their own needs or for low resource languages, while keeping their data private. Also enables users to use the model as a STT service locally and privately.
Models fine-tuned on low resource languages already exist on HuggingFace, the user could download these, or use another STT service/tool, and try them, instead of fine-uning on their data or Common Voice.
Existing fine-tuned models might be trained on bigger, more diverse datasets so their performance might be better across different environments/use-cases. However, not all languages have a fine-tuned model, or the models might not perform as well. Fine-tuning a model on your own voice data, produces a more personalized, use-case specific model that might perform better.
Model Selection
Model Selection
openai/Whisper
Open-source with MIT license and easy to implement. Big community and support around it. Top 5 in the HF ASR leaderboard as of Feb 2025. Multiple sizes available, making it easy to switch depending on available hardware.
facebook/w2v-bert-2.0, meta/mms
Whisper models, especially larger ones, require considerable computational resources and might not run efficiently on all local setups, however Whisper-tiny and small are low-compute friendly.
Voice Dataset for Fine-tuning
Voice Dataset for Fine-tuning
CommonVoice
Open-source, diverse collection of voice samples in multiple languages. One of the best STT datasets available for low-resource languages.
Didn’t consider any alternatives.
n/a
Fine-tuning Framework
Fine-tuning Framework
hf-transformers
Hugging Face’s transformers library provides well-documented fine-tuning support that is actively maintained and supports most open-source pre-trained models.
SpeechBrain; NeMo by Nvidia
SpeechBrain and NeMo both have instructions on fine-tuning CommonVoice, however they are not as actively maintained as Transformers, they have a steeper learning curve for beginners and might not support as broad a family of models.
User Interface
User Interface
Gradio
Good option for voice recording integration and integration with HF Spaces.
Streamlit
Streamlit is another option but the built-in voice recording feature doesn’t work out-of-the-box with the HF Transformers library, i.e. the audio input needs specific transformation before being fed to the STT model.
Ready? Try it yourself!
Explore Blueprints Extensions
See examples of extended blueprints unlocking new capabilities and adjusted configurations enabling tailored solutions—or try it yourself.
Load more