Task
Automatic Speech Recognition
~186k unlabeled Russian-language podcast episodes scraped from the web, packaged as Parquet shards with the audio bytes embedded. The audio has no transcripts — this is an unsupervised / self-supervised audio corpus, suitable for ASR pre-training, speech-representation learning, TTS data mining, audio classification, and similar tasks.
~186k unlabeled Russian-language podcast episodes scraped from the web, packaged as Parquet shards with the audio bytes embedded. The audio has no transcripts — this is an unsupervised / self-supervised audio corpus, suitable for ASR pre-training, speech-representation learning, TTS data mining, audio classification, and similar tasks.