HomeDatasetsSymato/cc
C

Symato/cc

General · Symato· 288.6K
mit 88 GB language:vilicense:mitsize_categories:1K<n<10Kregion:us

What is Symato CC? To download all WARC data from Common Crawl then filter out Vietnamese in Markdown and Plaintext format. There is 1% of Vietnamse in CC, extract all of them out should be a lot (~10TB of plaintext).

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull Symato/cc

Dataset details

Task
General
Language
vi
License
mit
Size
88 GB
Creator
Symato
Downloads
288.6K
Source
huggingface_datasets
Updated
2023-07-11

About Symato/cc

What is Symato CC? To download all WARC data from Common Crawl then filter out Vietnamese in Markdown and Plaintext format. There is 1% of Vietnamse in CC, extract all of them out should be a lot (~10TB of plaintext).