Synthetic Data Will Train 80% of AI by 2028

Last month, a founder pitched me their “revolutionary” AI startup. Fifteen minutes in, I asked the obvious question: “Where’s your training data coming from?”

They froze.

Turns out, they’d been scraping Reddit and hoping nobody would notice. Classic move. Also, increasingly illegal.

Here’s the thing nobody wants to admit: we’ve basically used up the internet. Every book’s been scanned. Every article’s been crawled. Every Reddit argument about whether hot dogs are sandwiches has been fed into an LLM somewhere.

And now we’re hitting what I call the Data Wall.

But instead of panicking, some of the smartest teams I know are doing something counterintuitive. They’re not finding more data. They’re making it.

Why Real Data Is Starting to Suck

I’ve been building AI products for 17 years. I’ve seen plenty of trends come and go. But this one’s different.

The traditional playbook was simple: scrape everything, clean it, train on it, ship it. Worked great when the internet was mostly human-written content. Now? Not so much.

Look, I love real-world data as much as the next guy. But let’s be honest about its problems:

It’s messy as hell. Half the data you scrape is wrong, biased, or just weird. I once audited a training dataset that included 12,000 variations of the same spam email. Grea

t for learning spam patterns, terrible for learning anything else.

It’s expensive. Want to train a healthcare AI? Cool, go pay data labelers $50/hour to annotate medical images. Need financial data? That’ll be $100K for a single dataset license. And you still don’t know if it’s any good.

It’s a legal nightmare. Gartner predicts that by 2027, 60% of data and analytics leaders will encounter failures in managing synthetic data, but honestly? The bigger risk right now is getting sued for using real data. GDPR in Europe, lawsuits from the New York Times, artists suing over image training—the legal bills are piling up faster than the models improve.

It doesn’t have what you actually need. This is the one that kills me. You need your AI to handle a specific edge case—like a car detecting a child running into the street at night in snow. How much real data do you have of that? Zero, because if it happened, someone would be dead. You can’t train on disasters that haven’t occurred yet.

So yeah, real data has been great. But it’s not scaling anymore.

Enter Synthetic Data: What It Actually Is

Every time I mention synthetic data, someone says: “So you’re training AI on fake information?”

Let me clear this up.

Synthetic data is artificially generated data that has the same statistical properties as real data, but doesn’t correspond to actual people or events. Think of it like this: if real data is a photograph, synthetic data is a high-quality painting that captures all the important details without being a replica.

Here’s why that matters.

Microsoft proved this with their Phi-1 model. They took a tiny 1.3 billion parameter model—nothing compared to GPT-4’s rumored trillion parameters—and trained it mostly on synthetic “textbook quality” data generated by GPT-3.5.

The result? It beat models 10x its size on coding benchmarks. Not because it had more data. Because it had better data.

That paper’s title says it all: “Textbooks Are All You Need.” They essentially had GPT-3.5 write perfect Python textbooks and exercises, then trained Phi-1 on that curated synthetic content instead of the messy garbage scraped from Stack Overflow.

The model learned faster, performed better, and didn’t inherit all the bad coding habits from random internet forums.

Why the Numbers Actually Matter

Let me show you where this is actually heading.

Gartner predicted back in 2023 that by 2024, 60% of data for AI would be synthetic. They’ve since said by 2030, synthetic data will completely overshadow real data in AI models.

That’s not a trend. That’s a takeover.

And it’s already happening. NVIDIA doesn’t train robots on real-world trials anymore—too slow, too dangerous, too expensive. They use Omniverse Replicator to generate synthetic training environments. They can simulate a humanoid robot falling down a thousand different ways in a physics engine, then have it learn from those synthetic failures without breaking a single real robot.

The result? Robots that learn in days what would take months in the real world.

Here’s what I’m seeing in the companies I work with: if you’re still web scraping in 2026, you’re burning cash. The teams winning right now are the ones generating proprietary synthetic datasets that their competitors can’t access.

The Model Collapse Problem (And How to Avoid It)

Okay, real talk. There’s a legitimate concern here, and I’m not going to pretend it doesn’t exist.

It’s called model collapse.

A paper from Oxford and Cambridge researchers showed that if you naively train AI on AI-generated output over and over, the models degrade. They literally published a paper called “The Curse of Recursion: Training on Generated Data Makes Models Forget.”

The problem is simple: AI makes subtle mistakes. If you train the next generation of AI on those mistakes, they compound. Do that enough times and your model forgets rare edge cases and converges on average, boring outputs.

Think of it like making a photocopy of a photocopy. Each generation gets a little worse.

But—and this is crucial—that only happens if you’re sloppy about it.

The fix? Curate your synthetic data.

A Nature paper in 2024 confirmed that indiscriminate use of synthetic data causes collapse. But Microsoft’s Phi-1 experiment proved the opposite: curated synthetic data makes models smarter.

The difference is quality control. You don’t just generate synthetic data and dump it in. You:

  1. Filter it for correctness
  2. Balance it to avoid bias amplification
  3. Mix it with real data as a foundation
  4. Verify outputs systematically

Do it right, and synthetic data isn’t a bug—it’s a feature.

Three Ways Synthetic Data Creates Competitive Advantage

Let me get practical. Here’s where synthetic data creates actual competitive advantage:

1. The Privacy Hack

You want to build a healthcare AI, but you can’t touch patient records without drowning in GDPR/HIPAA compliance. Classic Catch-22.

Solution? Generate synthetic patient data that has the same statistical patterns as real data, but doesn’t correspond to any actual person.

Tools like Gretel can create synthetic medical records that pass privacy audits because there’s no real patient to protect. You get the training data you need without the legal risk.

I’ve seen fintech startups do the same thing with transaction data. Synthetic fraud patterns let you train detection models without exposing customer data.

2. The Edge Case Engine

Remember that self-driving car problem? You need to train your AI on a kid running into traffic at night in the snow.

You can’t wait for that to happen. And you definitely can’t stage it.

But you can simulate it. Thousand times. With different weather. Different lighting. Different speeds.

This is what Waymo and Tesla do. They don’t just collect real dashcam footage. They generate synthetic scenarios that almost never occur naturally, then train their models to handle them.

It’s not about replacing real data. It’s about filling the gaps where real data doesn’t exist.

3. The David vs. Goliath Play

This one’s my favorite.

Big companies have massive datasets collected over years. You’re a two-person startup. How do you compete?

You don’t out-scrape them. You out-generate them.

I advised a startup that used Gretel to create a proprietary synthetic dataset for industrial inspection. They generated millions of variations of manufacturing defects that their competitor—who only had real inspection photos—couldn’t match.

Six months later, their model was more accurate than the incumbent’s. Not because they had more data, but because they had more relevant data.

That’s the play. Generate the data you need, not the data you can find.

What You Need to Know Before You Start

Here’s where it gets messy.

Synthetic data isn’t magic. It’s only as good as your generation process. If you use GPT-4 to generate training data, you’re inheriting GPT-4’s biases and limitations. Garbage in, garbage out still applies—you’ve just moved the garbage upstream.

You still need real data as a foundation. Purely synthetic data from scratch doesn’t work. You need real-world ground truth to calibrate against. Think of synthetic data as scaffolding, not the building.

Model collapse is a real risk if you’re careless. I mentioned this earlier, but it’s worth repeating. The research from Oxford shows that recursive training on synthetic data causes models to forget rare edge cases. You need quality filters and verification loops.

The tools aren’t all mature yet. Some synthetic data platforms are great (NVIDIA’s Omniverse, Gretel, Mostly.ai). Others are basically random noise generators with good marketing. Do your homework.

The 2028 Prediction (And Why It Matters Now)

By 2028, Gartner estimates 80% of AI training will use synthetic data. That’s less than three years away.

Which means the window to build competitive advantage here is right now.

The companies that figure out synthetic data generation in 2026 will have datasets their competitors can’t replicate in 2028. That’s a moat.

But the founders who wait? They’ll be licensing the same commodity synthetic datasets as everyone else, wondering why their models perform the same as their competitors’.

Here’s my advice after working with 100+ SaaS companies on this:

Start small. Don’t try to generate your entire training set synthetically. Pick one specific use case where real data is expensive, scarce, or legally problematic.

Use it as augmentation, not replacement. Real data is still your baseline. Synthetic data fills the gaps and balances the distribution.

Invest in quality control. This is the difference between “it worked” and “it failed spectacularly.” Build verification pipelines. Test outputs systematically. Don’t trust the generator blindly.

Think about IP strategy. Proprietary synthetic datasets can be patentable or trade secret protected in ways that scraped data can’t be. Talk to your lawyers.

Where This Actually Leaves You

Most founders don’t want to hear this, but I’ll say it anyway: your competitive advantage probably isn’t in having more data. It’s in having better data.

And “better” increasingly means “synthetic.”

The internet ran out of words. But that’s fine. We’ll just make new ones—higher quality, perfectly tailored, legally clean.

The teams that figure this out first aren’t just building better AI. They’re building defensible businesses that can’t be copied by scraping the same public datasets.

That’s the actual opportunity here.

The Data Wall isn’t a problem. It’s a filter. And if you’re still trying to climb over it with web scrapers, you’re already behind.


Want to stay ahead of the AI infrastructure shifts that matter? Join 210,000+ founders and tech leaders getting my weekly insights on building defensible AI businesses. Subscribe here.


Key Takeaways

Let me break this down simply:

  • The internet’s training data is basically exhausted. We’ve scraped everything worth scraping. Collecting more real data is expensive, legally risky, and often doesn’t have what you actually need.
  • Synthetic data isn’t “fake”—it’s engineered. It has the same statistical properties as real data without privacy concerns or legal risk. Microsoft proved with Phi-1 that small models trained on quality synthetic data can beat large models trained on messy real data.
  • Gartner predicts 80% of AI training will use synthetic data by 2028. This isn’t a distant future—it’s happening now. Teams that build synthetic data capabilities in 2026 will have moats their competitors can’t replicate.
  • Model collapse is real but avoidable. Naive recursive training on synthetic data degrades models, but curated synthetic data with quality controls actually improves them.
  • Three killer use cases: Privacy-compliant datasets (healthcare, fintech), edge case engineering (autonomous systems), and proprietary data generation (startup competitive advantage).
  • It’s not about replacement—it’s about augmentation. Real data remains the foundation. Synthetic data fills gaps, handles edge cases, and scales where real data can’t.

The advantage goes to founders who stop scraping and start generating. The data you create is more defensible than the data you collect.

Swarnendu De Avatar