The training data of language models like GPT-4 and Claude is often kept secret. Subsequently, AI2 released a new open dataset named Dolma. This dataset will serve as the foundation for their open language model (OLMo), which aims to provide the AI research community with free accessibility. According to AI2, if the model is intended to be free of use, so should the dataset used to build it.
AI2 is launching its first “data artifact.” The company produced this dataset for AI usage, which will also contribute to the development of OLMo. Luca Soldaini from AI2 explained in a blog post the sources and methods used to develop the data set.
OpenAI and Meta share some dataset information, but much remains restricted. This approach can deter scrutiny and hinder improvement, potentially stemming from ethical or legal concerns about the data collection methods. This can include the utilization of pirated book copies.
Due to intense competition, most companies want to protect their AI model training methods. However, this secrecy makes it difficult for external researchers to replicate their work. However, AI2’s Dolma data set aims to be transparent by revealing all sources and methods in a documented form. It will also encompass how the platform transformed it into English language texts.
AI2’s open data sets are the largest and simplest in usage and permissions. They employ the “ImpACT license for medium-risk artifacts”. Essentially, prospective Dolma users need to:
- Provide contact details and intended use cases
- Disclose any creation derived from Dolma Simplification.
- Distribute those derivatives under the same license
- Agree not to apply Dolma to various prohibited areas like surveillance or disinformation
If individuals are conscious of their data being in the database, they can utilize the “removal request” option provided by AI2. This form is only for special cases and not for “do not use my data” requests.
Access to Dolma is available via Hugging Face.