Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Language-Scaling/German/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ When adapting Riva to a whole new language, a large amount of high-quality trans

For German, there are several significant sources of public datasets that we can readily leverage:

- [Mozila Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours, ~ 26 Gbs.
- [Mozilla Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours, ~ 26 Gbs.
- [Multilingual LibriSpeech](http://www.openslr.org/94/) (MLS), `DE` subset: 1918 hours, ~115 GBs.
- [Voxpopuli](https://ai.facebook.com/blog/voxpopuli-the-largest-open-multilingual-speech-corpus-for-ai-translation-and-more/), `DE` subset: 214 hours, 4.6 Gbs.

Expand Down
2 changes: 1 addition & 1 deletion Language-Scaling/German/data_preparation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ When adapting Riva to a whole new language, a large amount of high-quality trans

For German, there are several significant sources of public datasets that we can readily leverage:

- [Mozila Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours
- [Mozilla Common Voice](https://commonvoice.mozilla.org/en/datasets) (MCV) corpus 7.0, `DE` subset: 571 hours
- [Multilingual LibriSpeech](http://www.openslr.org/94/) (MLS), `DE` subset: 1918 hours
- [VoxPopuli](https://ai.facebook.com/blog/voxpopuli-the-largest-open-multilingual-speech-corpus-for-ai-translation-and-more/), `DE` subset: 214 hours

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

# USAGE: python process_asr_text_tokenizer.py --manifest=<path to train manifest files, seperated by commas> \
# USAGE: python process_asr_text_tokenizer.py --manifest=<path to train manifest files, separated by commas> \
# --data_root="<output directory>" \
# --vocab_size=<number of tokens in vocabulary> \
# --tokenizer=<"spe" or "wpe"> \
Expand Down Expand Up @@ -45,7 +45,7 @@
# --tokenizer: Can be either spe or wpe . spe refers to the Google sentencepiece library tokenizer.
# wpe refers to the HuggingFace BERT Word Piece tokenizer.
#
# --no_lower_case: When this flag is passed, it will force the tokenizer to create seperate tokens for
# --no_lower_case: When this flag is passed, it will force the tokenizer to create separate tokens for
# upper and lower case characters. By default, the script will turn all the text to lower case
# before tokenization (and if upper case characters are passed during training/inference, the
# tokenizer will emit a token equivalent to Out-Of-Vocabulary). Used primarily for the
Expand All @@ -65,7 +65,7 @@
# positive integer. By default, any negative value (default = -1) will use the entire dataset.
#
# --spe_train_extremely_large_corpus: When training a sentencepiece tokenizer on very large amounts of text,
# sometimes the tokenizer will run out of memory or wont be able to process so much data on RAM.
# sometimes the tokenizer will run out of memory or won't be able to process so much data on RAM.
# At some point you might receive the following error - "Input corpus too large, try with
# train_extremely_large_corpus=true". If your machine has large amounts of RAM, it might still be possible
# to build the tokenizer using the above flag. Will silently fail if it runs out of RAM.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
"\n",
"![png](./imgs/german-transfer-learning.PNG)\n",
"\n",
"We first demonstrate the training process with NeMo on 1 GPU in this notebook. To speed up training, multiple GPUs should be leveraged using the more efficient DDP (distributed data parallel) protocol, which must run in a seperate [training script](./train.py).\n",
"We first demonstrate the training process with NeMo on 1 GPU in this notebook. To speed up training, multiple GPUs should be leveraged using the more efficient DDP (distributed data parallel) protocol, which must run in a separate [training script](./train.py).\n",
"\n",
"This notebook can be run from within the NeMo container, such as:\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion Language-Scaling/Hindi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ In addition to evaluating our model on the train test split, we've also evaluate
* [Hindi subtask-1](https://us.openslr.org/resources/103/subtask1_blindtest_wReadme.tar.gz) from [Interspeech MUCS Challenge](https://navana-tech.github.io/MUCS2021/challenge_details.html)
* etc.

We've observed very competetive WER, as low as 12.78, on these blind test sets.
We've observed very competitive WER, as low as 12.78, on these blind test sets.

## 5. Riva Deployment

Expand Down
2 changes: 1 addition & 1 deletion asr-improve-recognition-for-specific-words.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Word boosting provides a quick and temporary, on-the-spot adaptation for the mod

You will have to explicitly specify the list of boosted words at every request. Other adaptation methods such as custom vocabulary and lexicon mapping provide a more permanent solution, which affects every subsequent request.

Pay attention to the followings while implementing word boosting:
Pay attention to the following while implementing word boosting:
- Word boosting can improve the chance of recognition of the desired words, but at the same time can increase false positives. As such, start with a small positive weight and gradually increase till you see positive effects. As a general guide, start with a boosted score of 20 and increase up to 100 if needed.
- Word boosting is most suitable as a temporary fix for a new situation. However, if you wish to use it as a permanent adaptation, you can attempt binary search for the boosted weights while monitoring the accuracy metrics on a test set. The accuracy metrics should include both the word error rate (WER) and/or a form of term error rate (TER) focusing on the terms of interest.
- Word Boosting is supported only with flashlight decoder.
Expand Down
2 changes: 1 addition & 1 deletion deploy-aks.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# How do I Deploy Riva at Scale on Azure Cloud with AKS?

This is an example of deploying and scaling Riva Speech Skills on Azure Cloud's Azure Kuberenetes Service (AKS)
This is an example of deploying and scaling Riva Speech Skills on Azure Cloud's Azure Kubernetes Service (AKS)
with Traefik-based load balancing. It includes the following steps:
1. Creating the AKS cluster
2. Deploying the Riva API service
Expand Down
2 changes: 1 addition & 1 deletion deploy-gke.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# How do I Deploy Riva at Scale on Google Cloud with GKE?

This is an example of deploying and scaling Riva Speech Skills on Google Cloud (GCP) Google Kuberenetes Engine (GKE)
This is an example of deploying and scaling Riva Speech Skills on Google Cloud (GCP) Google Kubernetes Engine (GKE)
with Traefik-based load balancing. It includes the following steps:
1. Creating the GKE cluster
2. Deploying the Riva API service
Expand Down