Vosk API Training
This directory contains scripts and tools for training speech recognition models using the Kaldi toolkit.
Table of Contents
Overview
This repository provides tools for training custom speech recognition models using Kaldi. It supports acoustic model training, language model creation, and decoding pipelines.
Directory Structure
.
├── cmd.sh # Command configuration for training and decoding
├── conf/
│ ├── mfcc.conf # Configuration for MFCC feature extraction
│ └── online_cmvn.conf # Online Cepstral Mean Variance Normalization (currently empty)
├── local/
│ ├── chain/
│ │ ├── run_ivector_common.sh # Script for i-vector extraction during chain model training
│ │ └── run_tdnn.sh # Script for training a TDNN model
│ ├── data_prep.sh # Data preparation script for creating Kaldi data directories
│ ├── download_and_untar.sh # Script for downloading and extracting datasets
│ ├── download_lm.sh # Downloads language models
│ ├── prepare_dict.sh # Prepares the pronunciation dictionary
│ └── score.sh # Scoring script for evaluation
├── path.sh # Script for setting Kaldi paths
├── RESULTS # Script for printing the best WER results
├── RESULTS.txt # Contains WER results from decoding
├── run.sh # Main script for the entire training pipeline
├── steps -> ../../wsj/s5/steps/ # Link to Kaldi’s WSJ steps for acoustic model training
└── utils -> ../../wsj/s5/utils/ # Link to Kaldi’s utility scripts
Key Files:
- cmd.sh: Defines commands for running training and decoding tasks.
- path.sh: Sets up paths for Kaldi binaries and scripts.
- run.sh: Main entry point for the training pipeline, running tasks in stages.
- RESULTS: Displays Word Error Rate (WER) for the trained models.
Installation
Prerequisites
- Kaldi: Kaldi toolkit must be installed and configured.
- Required tools:
ffmpeg,sox,sctkfor data preparation and scoring.
Steps
- Clone the Vosk API repository.
- Install Kaldi and ensure the
KALDI_ROOTis correctly set inpath.sh. - Set environment variables using
cmd.shandpath.sh.
Training Process
Data Preparation
Run the data preparation stage in run.sh:
bash run.sh --stage 0 --stop_stage 0
This stage downloads and prepares the LibriSpeech dataset.
Dictionary Preparation
Prepare the pronunciation dictionary with:
bash run.sh --stage 1 --stop_stage 1
This step generates the necessary files for Kaldi's prepare_lang.sh script.
MFCC Feature Extraction
Run the MFCC extraction process:
bash run.sh --stage 2 --stop_stage 2
This step extracts Mel-frequency cepstral coefficients (MFCC) features and computes Cepstral Mean Variance Normalization (CMVN).
Acoustic Model Training
Train monophone, LDA+MLLT, and SAT models:
bash run.sh --stage 3 --stop_stage 3
This stage trains GMM-based models and aligns the data for TDNN training.
TDNN Chain Model Training
Train a Time-Delay Neural Network (TDNN) chain model:
bash run.sh --stage 4 --stop_stage 4
The chain model uses i-vectors for speaker adaptation.
Decoding
After training, decode the test data:
bash run.sh --stage 5 --stop_stage 5
This step decodes using the trained model and evaluates the Word Error Rate (WER).
Results
WER can be evaluated by running:
bash RESULTS
Example of RESULTS.txt:
%WER 14.10 [ 2839 / 20138, 214 ins, 487 del, 2138 sub ] exp/chain/tdnn/decode_test/wer_11_0.0
%WER 12.67 [ 2552 / 20138, 215 ins, 406 del, 1931 sub ] exp/chain/tdnn/decode_test_rescore/wer_11_0.0