Update: I wrote a part 2 on user-owned foundation models
Today, GPT-3 is trained on publicly scraped data from the internet. What if it were also trained on private data contributed directly by users, allowing for access to datasets like all Facebook messages, Instagram posts and Gmail emails that are usually siloed within a platform?
I originally presented this memo at Vana all-hands. I'm including this slack message for context:
GPT-3 is trained on these text datasets: Common Crawl (publicly scraped webpages), WebText2 (publicly available URLs posted on reddit with more than 3 upvotes), Books1 and Books2 datasets (unclear which exact book dataset used), and Wikipedia.
Source: Wikipedia GPT-3 page
But people estimate <0.1% of the internet is publicly scrapable. The rest of the internet requires permissions or a sign-in to access, so it’s not used as training data today. I would like to explore including permissioned data in training models by directly asking users to contribute their platform data. Proposed data sources:
From a representative sample of the US population.
These private datasets contain a rich amount of information on how people really communicate with each other, but are not usually available because the companies who capture the data don't want to share the information. With Vana, we empower our users to export their data from siloed platforms and allow them to rent or sell their data to contribute to machine learning models. This data would allow GPT-3 to better understand human language that takes place in non-public parts of the internet.
Open questions