High-quality deduplicated dataset for LLM training
Cerebras has introduced SlimPajama, the largest extensively deduplicated, multi-corpora open-source dataset designed for training large language models (LLMs). SlimPajama was developed by cleaning and deduplicating the original 1.21 trillion token RedPajama dataset from Together, resulting in a reduction of 49.6% in data size, from 1210 billion to 627 billion tokens. This dataset is expected to deliver high-quality training data, allowing for improved accuracy and efficiency when training models up to 627 billion tokens. Additionally, Cerebras is releasing the tools used to create SlimPajama, including methods for MinHashLSH deduplication, which have been optimized for distributed, multi-threaded environments. The release includes validation and test sets of 500 million tokens each, which have been decontaminated to ensure data integrity. SlimPajama is available under the Apache 2.0 license and can be accessed at Hugging Face.