Salesforce/wikitext

Name: Salesforce/wikitext
Creator: Salesforce
License: ["cc-by-sa-3.0","gfdl"]
Keywords: huggingface, task_categories:text-generation, task_categories:fill-mask, task_ids:language-modeling, task_ids:masked-language-modeling, annotations_creators:no-annotation, language_creators:crowdsourced, multilinguality:monolingual, source_datasets:original, text-generation, fill-mask

Text Generation · Salesforce· 1.3M

["cc-by-sa-3.0","gfdl"] 4.7 GB task_categories:text-generationtask_categories:fill-masktask_ids:language-modelingtask_ids:masked-language-modelingannotations_creators:no-annotation

Dataset Card for "wikitext" Dataset Summary The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.

Open in MLForge Sign up free Desktop app

# download instantly
mlforge datasets pull Salesforce/wikitext

Dataset details

Task

Text Generation

Language

License

["cc-by-sa-3.0","gfdl"]

Size

4.7 GB

Creator

Salesforce

Downloads

1.3M

Source

huggingface_datasets

Updated

2024-01-04

About Salesforce/wikitext

Table of Contents - Dataset Description - Dataset Summary - Supported Tasks and Leaderboards - Languages - Dataset Structure - Data Instances - Data Fields - Data Splits - Dataset Creation - Curation Rationale - Source Data - Annotations - Personal and Sensitive Information - Considerations for Using the Data - Social Impact of Dataset - Discussion of Biases - Other Known Limitations - Additional Information - Dataset Curators - Licensing Information - Citation Information - Contributions