The Pile (dataset)

From HandWiki
Short description: Training dataset for large language models


The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year.[1][2] It is composed of 22 smaller datasets, including 14 new ones.[1]

Creation

Training LLMs requires sufficiently vast amounts of data that, before the introduction of the Pile, most data used for training LLMs was taken from the Common Crawl.[3] However, LLMs trained on more diverse datasets are better able to handle a wider range of situations after training.[4] The creation of the Pile was motivated by the need for a large enough dataset that contained data from a wide variety of sources and styles of writing.[1][5] Compared to other datasets, the Pile's main distinguishing features are that it is a curated selection of data chosen by researchers at EleutherAI to contain information they thought language models should learn and that it is the only such dataset that is thoroughly documented by the researchers who developed it.[6]

Contents and filtering

Artificial Intelligences do not learn all they can from data on the first pass, so it is common practice to train an AI on the same data more than once with each pass through the entire dataset referred to as an "epoch".[7] Each of the 22 sub-datasets that make up the Pile was assigned a different number of epochs according to the perceived quality of the data.[1] The table below shows the relative size of each of the 22 sub-datasets before and after being multiplied by the number of epochs. Numbers have been converted to GB, and asterisks are used to indicate the newly introduced datasets.

Sub-datasets of the Pile[1][5]
Component Original Size Epochs Effective Size
Pile-CC 243.87 GB 1 243.87 GB
PubMed Central* 96.93GB 2 193.86 GB
Books3 108.40GB 1.5 162.61GB
OpenWebText2* 67.40GB 2 134.80GB
arXiv* 60.36GB 2 120.71GB
GitHub* 102.18GB 1 102.18GB
Free Law* 54.92GB 1.5 82.39GB
Stack Exchange* 34.57GB 2 69.14GB
USPTO Backgrounds* 24.59GB 2 49.19GB
PubMed Abstracts* 20.68GB 2 41.37GB
Gutenberg (PG-19) 11.68GB 2.5 29.20GB
OpenSubtitles 13.94GB 1.5 20.91GB
Wikipedia 6.85GB 3 20.54GB
DeepMind Mathematics 8.32GB 2 16.63GB
Ubuntu Freenode IRC logs* 5.93GB 2 11.84GB
BookCorpus2* 6.76GB 1.5 10.15GB
EuroParl 4.93GB 2 9.85GB
Hacker News* 4.19GB 2 8.38GB
YouTube Subtitles* 4.01GB 2 8.02GB
PhilPapers* 2.56GB 2 5.11GB
NIH ExPorter* 2.03GB 2 4.07GB
Enron Emails 0.95GB 2 1.89GB
Total 886.03GB 1346.69GB

EleutherAI chose the datasets to try to cover a wide range of topics and styles of writing, including academic writing, which models trained on other datasets were found to struggle with.[1]

All data used in the Pile was taken from publicly accessible sources. EleutherAI then filtered the dataset as a whole to remove duplicates. Some sub-datasets were also filtered for quality control. Most notably, the Pile-CC is a modified version of the Common Crawl in which the data was filtered to remove parts that are not text, such as HTML formatting and links.[1]

Some potential sub-datasets were excluded for various reasons, such as the US Congressional Record, which was excluded due to its racist content.[1]

Within the sub-datasets that were included, individual documents were not filtered to remove non-English, biased, or profane text. It was also not filtered on the basis of consent, meaning that, for example, the Pile-CC has all of the same ethical issues as the Common Crawl itself. However, EleutherAI has documented the amount of bias (on the basis of gender, religion, and race) and profanity as well as the level of consent given for each of the sub-datasets, allowing an ethics-concerned researcher to use only those parts of the Pile that meet their own standards.[1]

Use

The Pile was originally developed to train EleutherAI's GPT-Neo models[8][9][10] but has become widely used to train other models, including Microsoft's Megatron-Turing Natural Language Generation,[11][12] Meta AI's Open Pre-trained Transformers,[13] LLaMA,[14] and Galactica,[15] Stanford University's BioMedLM 2.7B,[16] the Beijing Academy of Artificial Intelligence's Chinese-Transformer-XL,[17] and Yandex's YaLM 100B.[18]

In addition to being used as a training dataset, the Pile can also be used as a benchmark to test models and score how well they perform on a variety of writing styles.[2][19][20]

DMCA takedown

The Books3 component of the dataset contains copyrighted material compiled from Bibliotik, a pirate website.[21] In July 2023, the Rights Alliance took copies of The Pile down through DMCA notices.[22][23]

See also

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". arXiv:2101.00027 [cs.CL].
  2. 2.0 2.1 "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". EleutherAI. 13 February 2020. https://pile.eleuther.ai/. 
  3. "Language Models are Few-Shot Learners". 22 Jul 2020. arXiv:2005.14165 [cs.CL].
  4. Rosset, Corby (13 February 2020). "Turing-NLG: A 17-billion-parameter language model by Microsoft". Microsoft. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/. 
  5. 5.0 5.1 Gao, Leo; Biderman, Stella; Hoppe, Travis; Grankin, Mikhail; researcher2; trisongz; sdtblck (15 June 2021). "The Pile Replication Code". https://github.com/EleutherAI/the-pile. 
  6. Khan, Mehtab; Hanna, Alex (13 September 2022). "The Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability". https://papers.ssrn.com/abstract=4217148. Retrieved 8 March 2023. 
  7. Brownlee, Jason (10 August 2022). "Difference Between a Batch and an Epoch in a Neural Network". https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/. Retrieved 2 June 2023. 
  8. "GPT-Neo 125M". 8 December 2022. https://huggingface.co/EleutherAI/gpt-neo-125m. 
  9. "GPT-Neo 1.3B". 8 December 2022. https://huggingface.co/EleutherAI/gpt-neo-1.3B. 
  10. "GPT-Neo 2.7B". 8 December 2022. https://huggingface.co/EleutherAI/gpt-neo-2.7B. 
  11. "Microsoft and Nvidia team up to train one of the world’s largest language models". 11 October 2021. https://venturebeat.com/ai/microsoft-and-nvidia-team-up-to-train-one-of-the-worlds-largest-language-models/. 
  12. "AI: Megatron the Transformer, and its related language models". 24 September 2021. https://lifearchitect.ai/megatron/. Retrieved 8 March 2023. 
  13. Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke (21 June 2022). "OPT: Open Pre-trained Transformer Language Models". arXiv:2205.01068 [cs.CL].
  14. "LLaMA: Open and Efficient Foundation Language Models". 27 February 2023. arXiv:2302.13971 [cs.CL].
  15. Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis; Poulton, Andrew; Kerkez, Viktor; Stojnic, Robert (16 November 2022). "Galactica: A Large Language Model for Science". arXiv:2211.09085 [cs.CL].
  16. "Model Card for BioMedLM 2.7B". https://huggingface.co/stanford-crfm/BioMedLM. Retrieved 5 June 2023. 
  17. Yuan, Sha; Zhao, Hanyu; Du, Zhengxiao; Ding, Ming; Liu, Xiao; Cen, Yukuo; Zou, Xu; Yang, Zhilin et al. (1 January 2021). "WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models". AI Open 2: 65–68. doi:10.1016/j.aiopen.2021.06.001. https://www.sciencedirect.com/science/article/pii/S2666651021000152. Retrieved 8 March 2023. 
  18. Grabovskiy, Ilya (2022). "Yandex publishes YaLM 100B, the largest GPT-like neural network in open source" (Press release). Yandex. Retrieved 5 June 2023.
  19. "Scaling Language Models: Methods, Analysis & Insights from Training Gopher". 21 Jan 2022. arXiv:2112.11446 [cs.CL].
  20. Lieber, Opher; Sharir, Or; Lenz, Barak; Shoham, Yoav (1 August 2021). "Jurassic-1: Technical Details and Evaluation". AI21 Labs. https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf. 
  21. "The Battle Over Books3 Could Change AI Forever". https://www.wired.com/story/battle-over-books3/. 
  22. "Rights Alliance removes the illegal Books3 dataset used to train artificial intelligence". Rights Alliance. https://rettighedsalliancen.com/rights-alliance-removes-the-illegal-books3-dataset-used-to-train-artificial-intelligence/. 
  23. "The Pile An 800GB Dataset of Diverse Text for Language Modeling". https://academictorrents.com/details/0d366035664fdf51cfbe9f733953ba325776e667.