Mozilla and EleutherAI publish research on open datasets for LLM training

Tags
Data Extraction
Events
LLM

In June 2024, Mozilla and EleutherAI hosted the 'Dataset Convening' in Amsterdam, which brought together 30 experts from open source AI startups, nonprofit AI labs, and civil society organizations. The focus was on developing openly licensed and accessible datasets for training LLMs. The resulting research paper, “Towards Best Practices for Open Datasets for LLM Training' (released in January 2025), outlines possible tiers of openness, normative principles, and technical best practices for sourcing, processing, governing, and releasing open datasets for LLM training, as well as opportunities for policy and technical investments to help the emerging community overcome its challenges.

Read more about the the convening here.

Check out the 'Towards Best Practices for Open Datasets for LLM Training'  paper here.