Skip to content

Top-quality datasets, tools, and ideas for enhancing Large Language Models (LLMs).

Notifications You must be signed in to change notification settings

mattdepaolis/llm-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

💾 LLM Datasets: Unlocking the Potential of Large Language Models

🤗 Hugging Face • 💻 Blog
Top-quality datasets, tools, and ideas for enhancing Large Language Models (LLMs).

📑 Table of Contents

Introduction

Welcome to your ultimate resource for enhancing Large Language Models (LLMs) through top-quality datasets, cutting-edge tools, and innovative ideas. Whether you’re building a model from scratch or fine-tuning an existing one, the data you use is crucial. This guide will walk you through what makes a great dataset, provide curated lists of open-source datasets for various training stages, and introduce tools to help you create and manage high-quality data effectively.


🌟 The Essence of a Great Dataset

A high-quality dataset is the backbone of any successful LLM. But what exactly makes a dataset exceptional? Here are the key attributes:

• Accuracy: Information should be correct, relevant, and clearly articulated. Responses must directly address the given questions or instructions.

• Diversity: A wide range of topics, styles, and contexts ensures the model can handle different tasks and follow diverse instructions effectively.

• Complexity: Including challenging tasks that require multi-step reasoning or problem-solving helps the model manage more intricate queries.

Evaluating these aspects can be tricky. For example, checking accuracy is straightforward for math problems but less so for open-ended questions. Diversity can be measured by the range of topics covered, and complexity can be assessed using other language models as evaluators.


📚 Open-Source Datasets

⚙️ Pre-Training Datasets

Pre-training datasets provide the foundational understanding of language, context, and general knowledge that LLMs need. They enable models to learn useful representations and patterns that can be fine-tuned for various downstream tasks.

Dataset Size Authors Date Description
fineweb 46B HuggingFace July 2024 The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library.
fineweb-edu 3B HuggingFace August 2024 📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version.

🛠️ Supervised Fine-Tuning Datasets

After initial training, fine-tuning with specialized datasets transforms an LLM into a versatile assistant capable of answering questions and performing various tasks. These datasets consist of instruction-response pairs and are available under permissive licenses.

General-Purpose Datasets

Designed to make models versatile by exposing them to a broad spectrum of high-quality data, these datasets often combine real-world information with synthetic data generated by advanced models like GPT-4.

Dataset Size Authors Date Description
Buzz 31.2M Alignment Lab AI May 2024 Extensive collection using data augmentation and deduplication techniques.
WebInstructSub 2.39M Yue et al. May 2024 Derived from Common Crawl documents, extracting and refining QA pairs. MAmmoTH2 paper (subset).
The-Tome 1.75M Arcee AI Jul 2024 Filtered for instruction following. 100k subset.
Hercules v4.5 1.72M Sebastian Gabarain Apr 2024 Covers math, code, role-playing, etc. v4 for more details.
Dolphin-2.9 1.39M Cognitive Computations Apr 2023 Large-scale general-purpose dataset for Dolphin models.
WildChat-1M 1.04M Zhao et al. May 2023 Real conversations with GPT-3.5/4, including metadata. WildChat paper.
OpenHermes-2.5 1M Teknium Nov 2023 Large-scale dataset for OpenHermes models.
Infinity-Instruct 660k BAAI Jun 2024 Based on a curated collection of evolved instructions.
SlimOrca 518k Lian et al. Sep 2023 Curated subset of OpenOrca using GPT-4 to eliminate incorrect answers.
Tulu V2 Mix 326k Ivison et al. Nov 2023 Mix of high-quality datasets. Tulu 2 paper.
UltraInteract SFT 289k Yuan et al. Apr 2024 Focused on math, coding, and logic with step-by-step answers. Eurus paper.
NeurIPS-LLM-data 204k Jindal et al. Nov 2023 Winner of the NeurIPS LLM Efficiency Challenge.
UltraChat 200k 200k Tunstall et al., Ding et al. Oct 2023 Filtered version of UltraChat with 1.4M ChatGPT-generated dialogues.
WizardLM_evol_instruct_V2 143k Xu et al. Jun 2023 Latest Evol-Instruct version applied to Alpaca and ShareGPT data. WizardLM paper.
Synthia-v1.3 119k Migel Tissera Nov 2023 High-quality synthetic data generated with GPT-4.
oasst1 84.4k Köpf et al. Mar 2023 Human-generated assistant conversations in 35 languages. OASST1 paper and oasst2.
WizardLM_evol_instruct_70k 70k Xu et al. Apr 2023 Evol-Instruct applied to Alpaca and ShareGPT. WizardLM paper.
airoboros-3.2 58.7k Jon Durbin Dec 2023 High-quality uncensored dataset.
ShareGPT_Vicuna_unfiltered 53k anon8231489123 Mar 2023 Filtered ShareGPT dataset with real user-ChatGPT conversations.
lmsys-chat-1m-smortmodelsonly 45.8k Nebulous, Zheng et al. Sep 2023 Filtered lmsys-chat-1m with responses from multiple models.
Open-Platypus 24.9k Lee et al. Sep 2023 Deduplicated datasets using Sentence Transformers, includes an NC dataset. Platypus paper.
databricks-dolly-15k 15k Conover et al. May 2023 Created by Databricks employees with prompt-response pairs across eight instruction categories.

🧮 Math & Logic

LLMs often find mathematical reasoning and formal logic challenging. Specialized datasets help improve these areas by providing problems that require systematic thinking and multi-step reasoning.

Dataset Size Authors Date Description
OpenMathInstruct-1 5.75M Toshniwal et al. Feb 2024 Includes math problems from GSM8K and MATH with solutions from Mixtral-8x7B.
MetaMathQA 395k Yu et al. Dec 2023 Mathematical questions rewritten from multiple perspectives for deeper understanding. MetaMath paper.
MathInstruct 262k Yue et al. Sep 2023 Compiled from 13 math datasets, focusing on chain-of-thought and program-of-thought reasoning.
Orca-Math 200k Mitra et al. Feb 2024 Grade school math problems generated using GPT-4 Turbo. Orca-Math paper.

💻 Code

Enhancing coding capabilities in LLMs requires specialized datasets filled with diverse programming examples and challenges.

Dataset Size Authors Date Description
CodeFeedback-Filtered-Instruction 157k Zheng et al. Feb 2024 Filtered version combining Magicoder-OSS-Instruct and other datasets to ensure high code quality.
Tested-143k-Python-Alpaca 143k Vezora Mar 2024 Python code that has passed automated tests for accuracy.
glaive-code-assistant 136k Glaive.ai Sep 2023 Synthetic problems and solutions with about 60% Python content. v2 available.
Magicoder-Evol-Instruct-110K 110k Wei et al. Nov 2023 Cleaned version of evol-codealpaca-v1 following StarCoder's decontamination process. Magicoder paper.
dolphin-coder 109k Eric Hartford Nov 2023 Transformed from leetcode-rosetta.
synthetic_tex_to_sql 100k Gretel.ai Apr 2024 Synthetic text-to-SQL samples covering various domains.
sql-create-context 78.6k b-mc2 Apr 2023 Enhanced version of WikiSQL and Spider.
Magicoder-OSS-Instruct-75K 75k Wei et al. Nov 2023 Generated by gpt-3.5-turbo-1106. Magicoder paper.
Code-Feedback 66.4k Zheng et al. Feb 2024 Diverse Code Interpreter-like dataset with multi-turn dialogues and mixed text-code responses. OpenCodeInterpreter paper.
Open-Critic-GPT 55.1k Vezora Jul 2024 Uses a local model to create and identify bugs in code across various programming languages.
self-oss-instruct-sc2-exec-filter-50k 50.7k Lozhkov et al. Apr 2024 Created using seed functions from TheStack v1, self-instruction with StarCoder2, and self-validation. Blog post.

🗣️ Conversation & Role-Play

To excel in conversational settings, LLMs benefit from datasets that mimic real-life dialogues and role-playing scenarios.

Dataset Size Authors Date Description
Bluemoon 290k Squish42 Jun 2023 Cleaned posts from the Blue Moon roleplaying forum.
PIPPA 16.8k Gosling et al., kingbri Aug 2023 Deduplicated version of Pygmalion's PIPPA in ShareGPT format.
Capybara 16k LDJnr Dec 2023 Focuses on diverse information across multiple domains with multi-turn conversations.
RPGPT_PublicDomain-alpaca 4.26k Practical Dreamer May 2023 Synthetic dialogues of public domain characters in roleplay format using build-a-dataset.
Pure-Dove 3.86k LDJnr Sep 2023 Highly filtered multi-turn conversations between GPT-4 and real humans.
Opus Samantha 1.85k macadelicc Apr 2024 Multi-turn conversations with Claude 3 Opus.
LimaRP-augmented 804 lemonilia, grimulkan Jan 2024 Enhanced version of LimaRP with human roleplaying conversations.

🤖 Agent & Function Calling

Function calling allows LLMs to execute predefined functions based on user prompts, enabling integration with external systems and performing complex tasks.

Dataset Size Authors Date Description
glaive-function-calling-v2 113k Sahil Chaudhary Sep 2023 Instruction-answer pairs in multiple languages. Locutusque/function-calling-chatml variant available without conversation tags.
xlam-function-calling-60k 60k Salesforce Jun 2024 Created using a pipeline designed for verifiable function-calling data.
Agent-FLAN 34.4k internlm Mar 2024 Combines AgentInstruct, ToolBench, and ShareGPT datasets for training in tool use and function calling.
hermes-function-calling-v1 11.5k NousResearch August 2024 This dataset is the compilation of structured output and function calling data used in the Hermes 2 Pro series of models.

⚖️ Preference Alignment Datasets

Preference datasets for Direct Preference Optimization (DPO) are essential for aligning AI systems with human values and expectations. They improve performance, reduce biases, and enable personalization and effective evaluation.

Dataset Size Authors Date Description
ultrafeedback_binarized_cleaned 186k Allenai Nov 2023 One of the bits of magic behind the Zephyr model.
ultrafeedback-binarized-preferences-cleaned 61k Bartolome et al. March 2024 This dataset is the recommended and preferred dataset by Argilla to use when fine-tuning on UltraFeedback.
HelpSteer 37k Dong et al. Nov 2023 HelpSteer is an open-source Helpfulness Dataset (CC-BY-4.0) that supports aligning models to become more helpful, factually correct and coherent, while being adjustable in terms of the complexity and verbosity of its responses.
Capybara-Preferences 15k Argilla April 2024 This dataset builds on LDJnr/Capybara by creating a preference dataset from an instruction-following dataset, splitting the final assistant turn for alternative model responses, which are then critiqued by GPT-4 using UltraFeedback.
distilabel-intel-orca-dpo-pairs 12k Argilla Jan 2024 The dataset is a "distilabeled" version of the widely used dataset: Intel/orca_dpo_pairs.
Math-Step-DPO-10K 10k Xin et al. June 2024 Step-DPO is a method for improving the mathematical reasoning of large language models (LLMs).
py-dpo-v0.1 9k Jon Durbin Jan 2024 This DPO dataset is designed to improve Python coding skills by using tested responses from the Vezora/Tested-22k-Python-Alpaca dataset as "chosen" values, while "rejected" values, generated from airoboros-l2-13b-3.1 and bagel-7b-v0.1, are considered lower quality, with duplicates removed.
prm_dpo_pairs_cleaned 8k M4-ai April 2024 The dataset was created by cleaning and deduplicating M4-ai/prm_dpo_pairs, removing incorrect completions and about 3,000 duplicate examples, resulting in a high-quality dataset for training a robust math language model.
distilabel-capybara-dpo-7k-binarized 7k Argilla March 2024 DPO dataset built with distilabel atop the awesome LDJnr/Capybara.
distilabel-math-preference-dpo 2k Argilla Nov 2023 Math related DPO dataset by Argilla
contextual-dpo-v0.1 1,3k Jon Durbin Jan 2024 This is a dataset meant to enhance adherence to provided context (e.g., for RAG applications) and reduce hallucinations, specifically using the airoboros context-obedient question answer format
gutenberg-dpo-v0.1 1k Jon Durbin Jan 2024 This is a dataset meant to enhance novel writing capabilities of LLMs, by using public domain books from Project Gutenberg
truthy-dpo-v0.1 1k Jon Durbin June 2024 Truthy DPO is a dataset aimed at improving the truthfulness of LLMs while maintaining immersive roleplay by focusing on corporeal, spatial, temporal awareness, and correcting common misconceptions.
toxic-dpo-v0.2 541 Unalignment Jan 2024 The Toxic-DPO dataset contains harmful and toxic content intended to demonstrate how direct-preference-optimization (DPO) can de-censor a model, with usage restricted to lawful, non-malicious academic or research purposes, and users assuming full responsibility for its use.

🧠 Reasoning Datasets

These datasets focus on enhancing the reasoning capabilities of LLMs by providing distilled synthetic examples from advanced reasoning models such as DeepSeek AI R1, Qwen QwQ, or Google DeepMind Flash Thinking. Curated from Hugging Face’s Reasoning Datasets collection , they offer diverse challenges that improve chain-of-thought and problem-solving skills.

Dataset Size Authors Date Description
ServiceNow-AI/R1-Distill-SFT 1.7M ServiceNow-AI Jan 2025 1.7M samples distilled from DeepSeek-R1-Distill-Qwen-32B from 9 different source datasets (unfiltered yet).
open-thoughts/OpenThoughts-114k 114k open-thoughts Jan 2025 114k samples distilled from DeepSeek R1 on math, science, code, and puzzles.
bespokelabs/Bespoke-Stratos-17k 17k bespokelabs Jan 2025 17k samples distilled from DeepSeek R1; generated in 1.5 hours at a cost of $800.
EricLu/SCP-116K 116k EricLu Jan 2025 116k scientific problem-solution pairs, automatically extracted from web-crawled documents solved by QwQ and o1-mini.
cognitivecomputations/dolphin-r1 300k cognitivecomputations Jan 2025 300k samples distilled from DeepSeek R1 and Gemini 2.0 Flash Thinking with prompts from open-orca.
Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B 250k Magpie-Align Jan 2025 250k samples distilled from DeepSeek-R1-Distill-Llama-70B using the MagPie format (letting the model generate both the prompt and the reasoning).
AymanTarig/function-calling-v0.2-with-r1-cot 58k AymanTarig Jan 2025 58k distilled function call samples with reasoning (proposed distilled from DeepSeek-R1-Distill-Llama-70B based on a prompt format).

🛠️ Tools for Creating High-Quality Datasets

Building a valuable dataset is more about quality than quantity. Here are some tools and methods to help you curate effective datasets:

🧹 Data Deduplication and Cleaning

  • Exact Deduplication: Remove identical entries by normalizing data (e.g., converting text to lowercase), generating hashes (like MD5 or SHA-256), and eliminating duplicates.
  • Fuzzy Deduplication:
    • MinHash: Uses hashing, sorting, and Jaccard similarity for finding similar entries.
    • BLOOM Filters: Employs hashing and fixed-size vectors for approximate duplicate detection.
  • Decontamination: Filter out samples that are too similar to test sets using exact or fuzzy methods.

✅ Evaluating Data Quality

  • Rule-Based Filtering: Remove unwanted content using specific criteria, such as eliminating phrases like "As an AI assistant."
  • Argilla: An open-source platform for collaborative data filtering and annotation.
  • LLM-as-a-Judge: A Colab notebook to rate data quality using models like Mixtral-7x8B.
  • Data Prep Kit: A framework for preparing data for both code and language tasks, scalable from laptops to data centers.
  • DataTrove: A Hugging Face library for large-scale data processing, used in creating Fineweb.

🛠️ Generating Additional Data

Supervised Fine-Tuning (SFT) Datasets

  • Distilabel: Generates and augments data for SFT and DPO using techniques like UltraFeedback and DEITA.
  • Auto Data: Automatically creates fine-tuning datasets using API models.
  • Bonito: Generates synthetic instruction tuning datasets without GPT. Check out AutoBonito as well.
  • Augmentoolkit: Converts raw text into datasets using various models.
  • Magpie: Efficient pipeline for generating high-quality synthetic data by prompting aligned LLMs.
  • Genstruct: An instruction generation model that creates valid instructions from raw data.
  • DataDreamer: A Python library for prompting and generating synthetic data.

Pre-Training Datasets

  • llm-swarm: Generates synthetic datasets for pretraining or fine-tuning using local LLMs or Hugging Face Inference Endpoints.
  • Cosmopedia: Code for creating the Cosmopedia dataset.
  • textbook_quality: Generates textbook-quality data, inspired by Microsoft's Phi models.

🔍 Exploring and Visualizing Data

  • sentence-transformers: A Python library for working with language embedding models.
  • Lilac: Curates better data for LLMs, used by organizations like NousResearch, Databricks, Cohere, and Alignment Lab AI.
  • Nomic Atlas: Interact with and gain insights from instructed data while storing embeddings.
  • text-clustering: A Hugging Face framework for grouping similar textual data.
  • BunkaTopics: Tools for data cleaning and visualizing topic models.
  • Autolabel: Automatically labels data using popular language models.

🌐 Data Scraping

  • Trafilatura: A Python and command-line tool for extracting text and metadata from the web, used to create RefinedWeb.
  • Marker: Quickly converts PDFs into markdown text.

Conclusion

Building effective LLMs requires high-quality data at every stage, from pre-training to fine-tuning and preference alignment. By leveraging the datasets and tools mentioned in this guide, you can enhance your models’ capabilities and ensure they perform well across a variety of tasks.

Feel free to explore these resources and integrate them into your workflow to create more robust and capable language models. Happy modeling!

About

Top-quality datasets, tools, and ideas for enhancing Large Language Models (LLMs).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published