AI2 releases biggest open dataset for training language models

The training data of language models like GPT-4 and Claude is often kept secret. Subsequently, AI2 released a new open dataset named Dolma. This dataset will serve as the foundation for their open language model (OLMo), which aims to provide the AI research community with free accessibility. According to AI2, if the model is intended to be free of use, so should the dataset used to build it.

AI2 is launching its first “data artifact.” The company produced this dataset for AI usage, which will also contribute to the development of OLMo. Luca Soldaini from AI2 explained in a blog post the sources and methods used to develop the data set.

OpenAI and Meta share some dataset information, but much remains restricted. This approach can deter scrutiny and hinder improvement, potentially stemming from ethical or legal concerns about the data collection methods. This can include the utilization of pirated book copies.

Due to intense competition, most companies want to protect their AI model training methods. However, this secrecy makes it difficult for external researchers to replicate their work. However, AI2’s Dolma data set aims to be transparent by revealing all sources and methods in a documented form. It will also encompass how the platform transformed it into English language texts.

AI2’s open data sets are the largest and simplest in usage and permissions. They employ the “ImpACT license for medium-risk artifacts”. Essentially, prospective Dolma users need to:

Provide contact details and intended use cases
Disclose any creation derived from Dolma Simplification.
Distribute those derivatives under the same license
Agree not to apply Dolma to various prohibited areas like surveillance or disinformation

If individuals are conscious of their data being in the database, they can utilize the “removal request” option provided by AI2. This form is only for special cases and not for “do not use my data” requests.

Access to Dolma is available via Hugging Face.

Featured

K5 Future Retail Conference 2024: Unveiling the Future of E-Commerce in Berlin

Featured

Cognigy Secures $100 Million in Series C Funding to Propel AI-Driven Customer Service Automation

OpenStack Enhances AI Workloads Management with Caracal Release

Uber Eats Introduces TikTok-style Video Feed

Varda Space Industries Raises $90M in Series B Funding

AI2 releases biggest open dataset for training language models

About The Author

Mahnoor Kiani

Leave a reply Cancel reply

SPONSORED LISTING

Recent Posts

Follow Us

SPONSORED LISTING

Market segments

Quick links

Social Links

Get Latest news in your inbox

You have Successfully Subscribed!

Pin It on Pinterest

Featured

K5 Future Retail Conference 2024: Unveiling the Future of E-Commerce in Berlin

Featured

Cognigy Secures $100 Million in Series C Funding to Propel AI-Driven Customer Service Automation

AI2 releases biggest open dataset for training language models

About The Author

Related Posts

Leave a reply Cancel reply

SPONSORED LISTING

Recent Posts

Follow Us

SPONSORED LISTING

Get Latest news in your inbox

You have Successfully Subscribed!

Pin It on Pinterest