From e1597c9496404cf28aaabd243b7cbeeb66034cdf Mon Sep 17 00:00:00 2001 From: Brian McBrayer Date: Fri, 21 Nov 2025 15:59:37 -0500 Subject: [PATCH] Added several typo fixes --- Language-Scaling/German/README.md | 2 +- Language-Scaling/German/data_preparation/README.md | 2 +- .../data_preparation/scripts/process_asr_text_tokenizer.py | 6 +++--- .../German/training/1-train-acoustic-model.ipynb | 2 +- Language-Scaling/Hindi/README.md | 2 +- asr-improve-recognition-for-specific-words.md | 2 +- deploy-aks.md | 2 +- deploy-gke.md | 2 +- 8 files changed, 10 insertions(+), 10 deletions(-) diff --git a/Language-Scaling/German/README.md b/Language-Scaling/German/README.md index 59bf7a40..9c6b0c17 100755 --- a/Language-Scaling/German/README.md +++ b/Language-Scaling/German/README.md @@ -22,7 +22,7 @@ When adapting Riva to a whole new language, a large amount of high-quality trans For German, there are several significant sources of public datasets that we can readily leverage: -- [Mozila Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours, ~ 26 Gbs. +- [Mozilla Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours, ~ 26 Gbs. - [Multilingual LibriSpeech](http://www.openslr.org/94/) (MLS), `DE` subset: 1918 hours, ~115 GBs. - [Voxpopuli](https://ai.facebook.com/blog/voxpopuli-the-largest-open-multilingual-speech-corpus-for-ai-translation-and-more/), `DE` subset: 214 hours, 4.6 Gbs. diff --git a/Language-Scaling/German/data_preparation/README.md b/Language-Scaling/German/data_preparation/README.md index c288f330..86cb56d0 100755 --- a/Language-Scaling/German/data_preparation/README.md +++ b/Language-Scaling/German/data_preparation/README.md @@ -9,7 +9,7 @@ When adapting Riva to a whole new language, a large amount of high-quality trans For German, there are several significant sources of public datasets that we can readily leverage: -- [Mozila Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours +- [Mozilla Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours - [Multilingual LibriSpeech](http://www.openslr.org/94/) (MLS), `DE` subset: 1918 hours - [VoxPopuli](https://ai.facebook.com/blog/voxpopuli-the-largest-open-multilingual-speech-corpus-for-ai-translation-and-more/), `DE` subset: 214 hours diff --git a/Language-Scaling/German/data_preparation/scripts/process_asr_text_tokenizer.py b/Language-Scaling/German/data_preparation/scripts/process_asr_text_tokenizer.py index 21255f3e..1289bfa6 100644 --- a/Language-Scaling/German/data_preparation/scripts/process_asr_text_tokenizer.py +++ b/Language-Scaling/German/data_preparation/scripts/process_asr_text_tokenizer.py @@ -12,7 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -# USAGE: python process_asr_text_tokenizer.py --manifest= \ +# USAGE: python process_asr_text_tokenizer.py --manifest= \ # --data_root="" \ # --vocab_size= \ # --tokenizer=<"spe" or "wpe"> \ @@ -45,7 +45,7 @@ # --tokenizer: Can be either spe or wpe . spe refers to the Google sentencepiece library tokenizer. # wpe refers to the HuggingFace BERT Word Piece tokenizer. # -# --no_lower_case: When this flag is passed, it will force the tokenizer to create seperate tokens for +# --no_lower_case: When this flag is passed, it will force the tokenizer to create separate tokens for # upper and lower case characters. By default, the script will turn all the text to lower case # before tokenization (and if upper case characters are passed during training/inference, the # tokenizer will emit a token equivalent to Out-Of-Vocabulary). Used primarily for the @@ -65,7 +65,7 @@ # positive integer. By default, any negative value (default = -1) will use the entire dataset. # # --spe_train_extremely_large_corpus: When training a sentencepiece tokenizer on very large amounts of text, -# sometimes the tokenizer will run out of memory or wont be able to process so much data on RAM. +# sometimes the tokenizer will run out of memory or won't be able to process so much data on RAM. # At some point you might receive the following error - "Input corpus too large, try with # train_extremely_large_corpus=true". If your machine has large amounts of RAM, it might still be possible # to build the tokenizer using the above flag. Will silently fail if it runs out of RAM. diff --git a/Language-Scaling/German/training/1-train-acoustic-model.ipynb b/Language-Scaling/German/training/1-train-acoustic-model.ipynb index 231a4457..915a92bd 100755 --- a/Language-Scaling/German/training/1-train-acoustic-model.ipynb +++ b/Language-Scaling/German/training/1-train-acoustic-model.ipynb @@ -17,7 +17,7 @@ "\n", "![png](./imgs/german-transfer-learning.PNG)\n", "\n", - "We first demonstrate the training process with NeMo on 1 GPU in this notebook. To speed up training, multiple GPUs should be leveraged using the more efficient DDP (distributed data parallel) protocol, which must run in a seperate [training script](./train.py).\n", + "We first demonstrate the training process with NeMo on 1 GPU in this notebook. To speed up training, multiple GPUs should be leveraged using the more efficient DDP (distributed data parallel) protocol, which must run in a separate [training script](./train.py).\n", "\n", "This notebook can be run from within the NeMo container, such as:\n", "\n", diff --git a/Language-Scaling/Hindi/README.md b/Language-Scaling/Hindi/README.md index 92e4b9a7..654a0225 100644 --- a/Language-Scaling/Hindi/README.md +++ b/Language-Scaling/Hindi/README.md @@ -139,7 +139,7 @@ In addition to evaluating our model on the train test split, we've also evaluate * [Hindi subtask-1](https://us.openslr.org/resources/103/subtask1_blindtest_wReadme.tar.gz) from [Interspeech MUCS Challenge](https://navana-tech.github.io/MUCS2021/challenge_details.html) * etc. -We've observed very competetive WER, as low as 12.78, on these blind test sets. +We've observed very competitive WER, as low as 12.78, on these blind test sets. ## 5. Riva Deployment diff --git a/asr-improve-recognition-for-specific-words.md b/asr-improve-recognition-for-specific-words.md index dadc3add..f86586eb 100755 --- a/asr-improve-recognition-for-specific-words.md +++ b/asr-improve-recognition-for-specific-words.md @@ -51,7 +51,7 @@ Word boosting provides a quick and temporary, on-the-spot adaptation for the mod You will have to explicitly specify the list of boosted words at every request. Other adaptation methods such as custom vocabulary and lexicon mapping provide a more permanent solution, which affects every subsequent request. -Pay attention to the followings while implementing word boosting: +Pay attention to the following while implementing word boosting: - Word boosting can improve the chance of recognition of the desired words, but at the same time can increase false positives. As such, start with a small positive weight and gradually increase till you see positive effects. As a general guide, start with a boosted score of 20 and increase up to 100 if needed. - Word boosting is most suitable as a temporary fix for a new situation. However, if you wish to use it as a permanent adaptation, you can attempt binary search for the boosted weights while monitoring the accuracy metrics on a test set. The accuracy metrics should include both the word error rate (WER) and/or a form of term error rate (TER) focusing on the terms of interest. - Word Boosting is supported only with flashlight decoder. diff --git a/deploy-aks.md b/deploy-aks.md index 255874ef..d69cadae 100644 --- a/deploy-aks.md +++ b/deploy-aks.md @@ -2,7 +2,7 @@ # How do I Deploy Riva at Scale on Azure Cloud with AKS? -This is an example of deploying and scaling Riva Speech Skills on Azure Cloud's Azure Kuberenetes Service (AKS) +This is an example of deploying and scaling Riva Speech Skills on Azure Cloud's Azure Kubernetes Service (AKS) with Traefik-based load balancing. It includes the following steps: 1. Creating the AKS cluster 2. Deploying the Riva API service diff --git a/deploy-gke.md b/deploy-gke.md index 648631a3..633dc11b 100644 --- a/deploy-gke.md +++ b/deploy-gke.md @@ -2,7 +2,7 @@ # How do I Deploy Riva at Scale on Google Cloud with GKE? -This is an example of deploying and scaling Riva Speech Skills on Google Cloud (GCP) Google Kuberenetes Engine (GKE) +This is an example of deploying and scaling Riva Speech Skills on Google Cloud (GCP) Google Kubernetes Engine (GKE) with Traefik-based load balancing. It includes the following steps: 1. Creating the GKE cluster 2. Deploying the Riva API service