HomeDatasetsepfml/FineWeb-HQ
F

epfml/FineWeb-HQ

Text Generation · epfml· 108.4K
odc-by 14 TB task_categories:text-generationlanguage:enlicense:odc-bysize_categories:1B<n<10Bformat:parquet

FineWeb-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb. FineWeb-HQ was created by selecting the top 10% of FineWeb documents based on a deep learning classifier trained to identify structured and knowledge-rich samples. This classifier uses XLM-RoBERTa embeddings to score documents.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull epfml/FineWeb-HQ

Dataset details

Task
Text Generation
Language
en
License
odc-by
Size
14 TB
Rows / images
2.4B
Creator
epfml
Downloads
108.4K
Source
huggingface_datasets
Updated
2025-09-30

About epfml/FineWeb-HQ

FineWeb-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb. FineWeb-HQ was created by selecting the top 10% of FineWeb documents based on a deep learning classifier trained to identify structured and knowledge-rich samples. This classifier uses XLM-RoBERTa embeddings to score documents.