HomeDatasetsepfml/FineWeb2-HQ
F

epfml/FineWeb2-HQ

Text Generation · epfml· 11.0K
odc-by 5.5 TB task_categories:text-generationlanguage:rulanguage:zhlanguage:delanguage:ja

FineWeb2-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb2, spanning 20 languages. It enables around 6x faster pretraining compared to the base dataset. FineWeb2-HQ was created by selecting the top 10% quality documents of FineWeb2 in each language, based on scores assigned by a deep learning classifier trained to identify structured and knowledge-rich sample

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull epfml/FineWeb2-HQ

Dataset details

Task
Text Generation
Language
ru
License
odc-by
Size
5.5 TB
Rows / images
380.1M
Creator
epfml
Downloads
11.0K
Source
huggingface_datasets
Updated
2025-02-19

About epfml/FineWeb2-HQ

FineWeb2-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb2, spanning 20 languages. It enables around 6x faster pretraining compared to the base dataset. FineWeb2-HQ was created by selecting the top 10% quality documents of FineWeb2 in each language, based on scores assigned by a deep learning classifier trained to identify structured and knowledge-rich samples using XLM-RoBERTa embeddings.