
RedPajama
About
The RedPajama family of large language models (LLMs) represents an open-source initiative focused on developing high-performing and transparent models, spearheaded by Together AI in collaboration with leading figures in the open-source AI community 83. These models are trained on the extensive RedPajama dataset, encompassing over 100 trillion raw tokens, and a refined subset of 30 trillion tokens across various languages and domains 8. They are available in multiple sizes and configurations, such as base models, instruction-tuned versions for enhanced few-shot learning, and chat models tailored for interactive dialogues 38. An exemplar model, the RedPajama-INCITE-Instruct-3B-v1, is particularly optimized for few-shot applications using GPT-JT data, deliberately excluding tasks overlapping with HELM core scenarios 3. The initiative not only prioritizes model performance but also the transparency and accessibility of data and training methodologies 8.