HomeDatasetsocciglot/tokenizer-wiki-bench
T

occiglot/tokenizer-wiki-bench

General · occiglot· 40.2K
mit 215 GB language:aflanguage:arlanguage:bglanguage:calanguage:cs

This dataset includes pre-processed wikipedia data for tokenizer evaluation in 45 languages. We provide more information on the evaluation task in general this blogpost.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull occiglot/tokenizer-wiki-bench

Dataset details

Task
General
Language
af
License
mit
Size
215 GB
Rows / images
84.4M
Creator
occiglot
Downloads
40.2K
Source
huggingface_datasets
Updated
2024-04-23

About occiglot/tokenizer-wiki-bench

This dataset includes pre-processed wikipedia data for tokenizer evaluation in 45 languages. We provide more information on the evaluation task in general this blogpost.