epfml/FineWeb2-HQ
FineWeb2-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb2, spanning 20 languages. It enables around 6x faster pretraining compared to the base dataset. FineWeb2-HQ was created by selecting the top 10% quality documents of FineWeb2 in each language, based on scores assigned by a deep learning classifier trained to identify structured and knowledge-rich sample
mlforge datasets pull epfml/FineWeb2-HQ
Dataset details
About epfml/FineWeb2-HQ
FineWeb2-HQ is a high-quality, model-filtered pretraining dataset derived as a subset of FineWeb2, spanning 20 languages. It enables around 6x faster pretraining compared to the base dataset. FineWeb2-HQ was created by selecting the top 10% quality documents of FineWeb2 in each language, based on scores assigned by a deep learning classifier trained to identify structured and knowledge-rich samples using XLM-RoBERTa embeddings.