HomeDatasetsairtrain-ai/fineweb-edu-fortified
F

airtrain-ai/fineweb-edu-fortified

Text Generation · airtrain-ai· 136.5K
odc-by 1.6 TB task_categories:text-generationlanguage:enlicense:odc-bysize_categories:100M<n<1Bformat:parquet

The composition of fineweb-edu-fortified, produced by automatically clustering a 500k row sample in Airtrain

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull airtrain-ai/fineweb-edu-fortified

Dataset details

Task
Text Generation
Language
en
License
odc-by
Size
1.6 TB
Rows / images
322.3M
Creator
airtrain-ai
Downloads
136.5K
Source
huggingface_datasets
Updated
2024-08-08

About airtrain-ai/fineweb-edu-fortified

--- language: - en license: odc-by taskcategories: - text-generation datasetinfo: - configname: CC-MAIN-2013-20 features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: filepath dtype: string - name: language dtype: string - name: languagescore dtype: float64 - name: tokencount dtype: int64 - name: score dtype: float64 - name: intscore dtype: int64 - name: embedding sequence: float32 - name: count dtype: int64 splits: - name: train numbytes: 71683996286 numexamples: 10800000 downloadsize: 55571546426 datasetsize: 71683996286 - configname: CC-MAIN-2013-48 features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: filepath dtype: string - name: language dtype: string - name: languagescore dtype: float64 - name: tokencount dtype: int64 - name: score dtype: float64 - name: intscore dtype: int64 - name: embedding sequence: float32 - name: count dtype: int64 splits: - name: train numbytes: 38878994623 numexamples: 5800000 downloadsize: 30087644388 datasetsize: 38878994623 - configname: CC-MAIN-2014-10 features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: filepath dtype: string - name: language dtype: string - name: languagescore dtype: float64 - name: tokencount