In 2022, I wrote a proposal for a user-owned foundation model, trained on private data rather than publicly scraped data from the internet. I believed that while public data (e.g. Wikipedia, 4Chan) could be used to train foundation models, to bring them to the next level, you would need high quality private data which only exists in siloed platforms (e.g. Twitter, personal messages, company information) that require permissions or sign-in to access.
This prediction is starting to play out. As companies like Reddit and Twitter have realized the value of their platform data, they have locked down their developer APIs (1,2) to prevent other companies from freely training foundation models on their text data.
This is a dramatic shift from even two years ago. Sam Lessin, a VC, summarizes the change: "[The platforms] were just chucking all this trash out the back and no one was guarding it and then all of a sudden you're like oh shit, that trash is gold, right? We got a heck of a lot. We gotta lock down the trash bins." GPT-3, for example, was trained on WebText2, which aggregates text from all Reddit submission links with at least 3 upvotes (3,4). This is no longer possible with Reddit's new API.
The internet is becoming less and less open, with siloed platforms building bigger walls to keep their valuable training data in.
Although developers can no longer access this data at scale, individuals can still access and export their own data across platforms due to data privacy regulation (5, 6). This fact — that platforms are locking down developer APIs while individual users still have access to their data — offers an opportunity: could 100M users export their platform data to create the world’s largest data treasury? This data treasury would aggregate all user data collected by big tech and other companies, who are usually unwilling to part with it. It would be the largest and most comprehensive training dataset to date, 100x bigger than those used to train today’s leading foundation models.1
Users could then create a user-owned foundation model, trained on more data than any single company has been able to aggregate. Training foundation models requires a huge amount of GPU compute. But each user could help train a small piece of the model with their own hardware, then merge the pieces together to create a larger, more capable model (7, 8, 9).2 When incentives are right, users can bring together a lot of compute. For example, ethereum miners’ combined compute is 50 times greater than that used to train leading foundation models.
Users who contribute to the model would collectively own and govern it. They could get paid when the model is used, or even paid proportionally to how much their data improved the model. The collective could set usage rules, including who is allowed to access the model and what sort of controls should be in place. Maybe users from each country would create their own model, representing their ideologies and culture. Or maybe a country isn't the right dividing line, and we'll see a world where each network state has its own foundation model based on its members' data.
I encourage you to spend time thinking about what foundation models you'd want to own part of, and what training data you could contribute from platforms you use. You may have more than you realize — your research papers, your unreleased artwork, your Google docs, your dating profile, your medical records, your slack messages. One way to bring this data together is through a personal server, which makes it easy to use your private data with local LLMs. In the future, your personal server could also train your piece of the user-owned foundation model.
Foundation models tend towards monopolies, as they require huge upfront investments in the form of data and compute. It’s tempting to resign ourselves to the easy option: do the best we can with hand-me-down, open source models that are a few generations behind, the leftovers from large AI companies. But we shouldn’t settle for being a few generations behind and only eating leftovers! We as users should create our own best model — we have the data and compute to make it possible.
Thank you to Casey Caruso, Packy McCormick, Jessy Lin, Owen, and my mom for early feedback on this writing.
1 Data quality is very important, but is not measured in this table. Having more data to train on will not create a better model unless it is good data. I also left out the reinforcement learning from human feedback (RLHF) component - the data I report is only that used in pre-training, which makes up the majority of the training.
2 A user-owned foundation model is still possible without distributed training - users could contribute data and capital to pay for compute, rather than contributing compute directly.
3 It’s hard to compare a data center with colocated GPUs to a set of consumer GPUs spread out all across the world. I think flops are the best “back of the envelope” way to measure this, but there are other limiting factors, like the memory of a consumer GPU.