When you want to get good at any sport, you need to train. You try as many times as you need, make many mistakes along the way, and draw conclusions from your results.

Post

Oct 18, 2022

What are Pre-trained models?

While exploring the terms GPT-3 is built out of, we started by learning about generative models. So now let’s understand what are pre-trained models.

When you want to get good at any sport, you need to train. You try as many times as you need, make many mistakes along the way, and draw conclusions from your results.

With time you successively get better and better and at some point, you achieve the desired performance.

An AI model does something similar, but unlike a human athlete, it can run hundreds of thousands of practice sessions within minutes.

To create a model that performs well, you need to train it using a specific set of variables, called parameters.

A model parameter is a configuration variable that is internal to the model and whose value is estimated from the training data.

The process of determining the ideal parameters for your model is called training.

The model learns parameter values through successive training iterations.

A pre-trained model is a model created to solve a particular problem.

Instead of building a model from scratch to solve your problem, you use the model trained on another problem as a starting point.

You can take the pre-trained model and give it more specific training in the area of your choice.

A pre-trained model may not be 100% accurate, but it saves you from reinventing the wheel, saving time and improving performance.

In machine learning, a model is trained on a dataset. The size and type of data samples vary depending on the task you want to solve.

GPT-3 is pre-trained on a corpus of text from five datasets: Common Crawl, WebText2, Books1, Books2, and Wikipedia.

Common Crawl

The Common Crawl corpus comprises petabytes of data including raw web page data, metadata and text data collected over eight years of web crawling. OpenAI researchers use a curated, filtered version of this dataset.

WebText2

WebText2 is an expanded version of the WebText dataset, which is an internal OpenAI corpus created by scraping web pages of particularly high quality.

To vet for quality, the authors scraped all outbound links from Reddit which received at least 3 karma (an indicator for whether other users found the link interesting, educational, or just funny).

WebText contains 40 gigabytes of text from these 45 million links, and over 8 million documents.

Books1 and Books2

Books1 and Books2 are two corpora, or collections of text, that contain the text of tens of thousands of books on various subjects.

Wikipedia

A dataset including all English-language articles from the crowdsourced online encyclopedia Wikipedia. (As of this writing, there were 6,358,805 English articles.) This corpus altogether includes nearly a trillion words.

Since GPT-3 is pre-trained on an extensive and diverse corpus of text, it can successfully perform a surprising number of NLP tasks without users providing any

additional example data.

Glue Labs

Helping innovators succeed

Followers30