User-Owned Foundation Models

Update: I wrote a part 2 on user-owned foundation models

Today, GPT-3 is trained on publicly scraped data from the internet. What if it were also trained on private data contributed directly by users, allowing for access to datasets like all Facebook messages, Instagram posts and Gmail emails that are usually siloed within a platform? 

I originally presented this memo at Vana all-hands. I'm including this slack message for context:

GPT-3 is trained on these text datasets: Common Crawl (publicly scraped webpages), WebText2 (publicly available URLs posted on reddit with more than 3 upvotes), Books1 and Books2 datasets (unclear which exact book dataset used), and Wikipedia

Source: Wikipedia GPT-3 page

But people estimate <0.1% of the internet is publicly scrapable. The rest of the internet requires permissions or a sign-in to access, so it’s not used as training data today. I would like to explore including permissioned data in training models by directly asking users to contribute their platform data. Proposed data sources: 

  • Personal message data (Instagram, iMessage, Text, WhatsApp, Messenger, Telegram)
  • Work-related message data (Email, Slack)
  • Social media posts and comments (Facebook, Instagram, TikTok)

From a representative sample of the US population. 

These private datasets contain a rich amount of information on how people really communicate with each other, but are not usually available because the companies who capture the data don't want to share the information. With Vana, we empower our users to export their data from siloed platforms and allow them to rent or sell their data to contribute to machine learning models. This data would allow GPT-3 to better understand human language that takes place in non-public parts of the internet. 

Open questions

  • How many people’s worth of data is required for a minimum sample? 
  • Which data sources would help GPT-3 most? 
  • How do we filter data to ensure safety? 
  • How do we ensure the data is representative of what we want to capture?