Name: speechcolab/gigaspeech2
Creator: speechcolab
License: ["apache-2.0"]
Keywords: huggingface, task_categories:automatic-speech-recognition, multilinguality:multilingual, language:th, language:id, language:vi, license:apache-2.0, size_categories:10M<n<100M, format:webdataset, automatic-speech-recognition

Dataset Card for GigaSpeech 2 Dataset Description GigaSpeech 2 is an evolving, large-scale, multi-domain, and multilingual ASR corpus focusing on low-resource languages. GigaSpeech 2 raw comprises about 30,000 hours of automatically transcribed speech, across Thai, Indonesian, and Vietnamese. GigaSpeech 2 refine consists of 10,000 hours of Thai, 6,000 hours each for Indonesian and Vietnamese. Repository: https://github.com/SpeechColab/GigaSpeech2 Paper:… See the full description on the dataset page: https://huggingface.co/datasets/speechcolab/gigaspeech2.

speechcolab/gigaspeech2

Dataset details