HomeDatasetsspeechcolab/gigaspeech2
G

speechcolab/gigaspeech2

Automatic Speech Recognition · speechcolab· 41.8K
["apache-2.0"] 3.2 TB task_categories:automatic-speech-recognitionmultilinguality:multilinguallanguage:thlanguage:idlanguage:vi

Dataset Card for GigaSpeech 2 Dataset Description GigaSpeech 2 is an evolving, large-scale, multi-domain, and multilingual ASR corpus focusing on low-resource languages. GigaSpeech 2 raw comprises about 30,000 hours of automatically transcribed speech, across Thai, Indonesian, and Vietnamese. GigaSpeech 2 refine consists of 10,000 hours of Thai, 6,000 hours each for Indonesian and Vietnamese. Repository: https://github.com/SpeechColab/GigaSpeech2 Paper:… See the full description on the dataset page: https://huggingface.co/datasets/speechcolab/gigaspeech2.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull speechcolab/gigaspeech2

Dataset details

Task
Automatic Speech Recognition
Language
th
License
["apache-2.0"]
Size
3.2 TB
Creator
speechcolab
Downloads
41.8K
Source
huggingface_datasets
Updated
2026-03-26