airtrain-ai/fineweb-edu-fortified
The composition of fineweb-edu-fortified, produced by automatically clustering a 500k row sample in Airtrain
mlforge datasets pull airtrain-ai/fineweb-edu-fortified
Dataset details
About airtrain-ai/fineweb-edu-fortified
--- language: - en license: odc-by taskcategories: - text-generation datasetinfo: - configname: CC-MAIN-2013-20 features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: filepath dtype: string - name: language dtype: string - name: languagescore dtype: float64 - name: tokencount dtype: int64 - name: score dtype: float64 - name: intscore dtype: int64 - name: embedding sequence: float32 - name: count dtype: int64 splits: - name: train numbytes: 71683996286 numexamples: 10800000 downloadsize: 55571546426 datasetsize: 71683996286 - configname: CC-MAIN-2013-48 features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: filepath dtype: string - name: language dtype: string - name: languagescore dtype: float64 - name: tokencount dtype: int64 - name: score dtype: float64 - name: intscore dtype: int64 - name: embedding sequence: float32 - name: count dtype: int64 splits: - name: train numbytes: 38878994623 numexamples: 5800000 downloadsize: 30087644388 datasetsize: 38878994623 - configname: CC-MAIN-2014-10 features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: filepath dtype: string - name: language dtype: string - name: languagescore dtype: float64 - name: tokencount