Name: airtrain-ai/fineweb-edu-fortified
Creator: airtrain-ai
License: odc-by
Keywords: huggingface, task_categories:text-generation, language:en, license:odc-by, size_categories:100M<n<1B, format:parquet, modality:tabular, modality:text, library:datasets, text-generation

--- language: - en license: odc-by taskcategories: - text-generation datasetinfo: - configname: CC-MAIN-2013-20 features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: filepath dtype: string - name: language dtype: string - name: languagescore dtype: float64 - name: tokencount dtype: int64 - name: score dtype: float64 - name: intscore dtype: int64 - name: embedding sequence: float32 - name: count dtype: int64 splits: - name: train numbytes: 71683996286 numexamples: 10800000 downloadsize: 55571546426 datasetsize: 71683996286 - configname: CC-MAIN-2013-48 features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: filepath dtype: string - name: language dtype: string - name: languagescore dtype: float64 - name: tokencount dtype: int64 - name: score dtype: float64 - name: intscore dtype: int64 - name: embedding sequence: float32 - name: count dtype: int64 splits: - name: train numbytes: 38878994623 numexamples: 5800000 downloadsize: 30087644388 datasetsize: 38878994623 - configname: CC-MAIN-2014-10 features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: filepath dtype: string - name: language dtype: string - name: languagescore dtype: float64 - name: tokencount

airtrain-ai/fineweb-edu-fortified

Dataset details

About airtrain-ai/fineweb-edu-fortified