Software:Common Voice

Common Voice
Developer(s)	Mozilla Foundation
Initial release	June 19, 2017; 9 years ago
Repository	github.com/common-voice/common-voice
Available in	Multilingual (List of languages)
License	Creative Commons CC0
Website	commonvoice.mozilla.org

Common Voice is a crowdsourcing project started by Mozilla to create a free and open speech corpus. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences are collected in a voice database available under the public domain license CC0.^[1] This license ensures that developers can use the database for voice-to-text and text-to-voice applications without restrictions or costs.

Aims

Common Voice aims to provide diverse voice samples. According to Mozilla's Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents.^[2]

Voice database

The first dataset was released in November 2017. More than 20,000 users worldwide had recorded 500 hours of English sentences.^[3]

In February 2019, the first batch of languages was released for use. This included 18 languages such as English, French, German and Mandarin Chinese, but also less prevalent languages like Welsh and Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors.^[4]

By July 2020 the database had amassed 7,226 hours of voice recordings in 54 languages, 5,591 hours of which had been verified by volunteers.^[5]

In May 2021, following the work to add Kinyarwanda, the project received a grant to add Kiswahili.^[6]

At the beginning of 2022, Bengali.AI partnered with Common Voice to launch the "Bangla Speech Recognition" project that aims to make machines understand the Bangla language. 2000 hours of voice was collected.^[7]

In September 2022, it was announced that the Twi language of Ghana was the 100th language to be added to the database.^[8]

As of October 2022^[update], Mozilla Common Voice officially collects voice data for the following languages:^[9]

Abkhaz
Arabic
Armenian
Assamese
Asturian
Bashkir
Basaa
Basque
Belarusian
Bengali
Breton
Bulgarian
Catalan
Chinese (Cantonese and Mandarin varieties)
Chuvash
Czech
Danish
Dhivehi
Dutch
English
Esperanto
Erzya
Finnish
French
Frisian
Galician
Georgian
German
Greek
Guaraní
Hausa
Hakha Chin
Hindi
Hungarian
Indonesian
Interlingua
Irish
Italian
Japanese
Kabyle
Kazakh
Kinyarwanda
Korean
Kurdish (Central and Kurmanji varieties)
Kyrgyz
Latvian
Luganda
Macedonian
Malayalam
Maltese
Marathi
Mari (Meadow and Hill varieties)
Moksha
Mongolian
Nepali
Norwegian (Nynorsk)
Odia
Pashto
Persian
Polish
Portuguese
Punjabi
Romanian
Romansh (Sursilvan and Vallader varieties)
Russian
Sakha
Santali
Saraiki
Sardinian
Serbian
Slovenian
Spanish
Swahili
Swedish
Taiwanese Hokkien
Tamil
Tatar
Thai
Tigre
Tigrinya
Toki Pona
Twi
Turkish
Upper Sorbian
Ukrainian
Urdu
Uyghur
Uzbek
Vietnamese
Votic
Welsh

References

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Common Voice. Read more

[1] "Mozilla Common Voice" (in en). https://commonvoice.mozilla.org/en/datasets.

[2] "Why do we gender AI? Voice tech firms move to be more inclusive". The Guardian. 11 January 2020. https://www.theguardian.com/technology/2020/jan/11/why-do-we-gender-ai-voice-tech-firms-move-to-be-more-inclusive. Retrieved 19 April 2020.

[3] "Announcing the Initial Release of Mozilla’s Open Source Speech Recognition Model and Voice Dataset". November 29, 2017. https://blog.mozilla.org/blog/2017/11/29/announcing-the-initial-release-of-mozillas-open-source-speech-recognition-model-and-voice-dataset.

[4] "Mozilla updates Common Voice dataset with 1,400 hours of speech across 18 languages". February 28, 2019. https://venturebeat.com/2019/02/28/mozilla-updates-common-voice-dataset-with-1400-hours-of-speech-across-19-languages.

[5] "Mozilla Common Voice updates will help train the ‘Hey Firefox’ wakeword for voice-based web browsing". 1 July 2020. https://venturebeat.com/2020/07/01/mozilla-common-voice-updates-will-help-train-the-hey-firefox-wakeword-for-voice-based-web-browsing/.

[6] "Mozilla Common Voice Receives $3.4 Million Investment to Democratize and Diversify Voice Tech in East Africa" (in en). 2021-05-25. https://foundation.mozilla.org/en/blog/mozilla-common-voice-receives-34-million-investment-to-democratize-and-diversify-voice-tech-in-east-africa/.

[7] "Bengali.AI: Democratising AI research in Bangla" (in en). 2022-12-23. https://www.tbsnews.net/features/panorama/bengaliai-democratising-ai-research-bangla-556458.

[8] Onukwue, Alexander (23 September 2022). "Ghana’s most popular language is now on Mozilla Common Voice" (in en-us). https://qz.com/ghana-s-most-popular-language-will-be-available-to-more-1849572359.

[9] "Languages" (in en). https://commonvoice.mozilla.org/en/languages.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Anonymous

Search

Software:Common Voice

Namespaces

More

Page actions

Contents

Aims

Voice database

See also

References

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Software:Common Voice

Aims

Voice database

See also

References

Navigation

Wiki tools

Page tools

Other projects

Categories