AI has a big problem – data shortage, and it could quickly gobble up innovation, writes Satyen K. Bordoloi as he outlines the solutions being cooked in the pressure cookers called AI companies


Data is the new oil, they said, so they scraped websites old and new. Then they came for Reddit threads, Facebook Posts and Twitter feeds. When that wasn’t enough, they even took YouTube videos, e-books and newspapers. To do what: create ‘big data’ to train bigger AI. But guess what, despite burning fossil fuels for hundreds of years, we haven’t yet run out. But data to train, run, and code AI? we’re running on fumes there. Yep, despite the quintillion cat videos and lunch photos you post relentlessly.

Big deal, you scoff; big dadas will figure out big solutions for big data. Well, it’s… complicated.

Houston, We Have a Big Data Problem: Here’s a mind-blowing stat: It has been said that while GPT-3.5 was trained on 175 billion parameters, GPT-4 likely surpassed 100 trillion parameters, indicating a substantial increase of over 57,000%. More parameters mean more complexity. Guess what is usually required for both? Yep, an even bigger appetite for data!

Picture this: Artificial Intelligence models are like hungry teenagers who’ve raided the fridge, eating everything in sight, and are still asking, “What’s for dinner?” These data-hungry beasts have already chomped their way through everything online – legally and wink, wink – not so legally. They’re still hungry, and no old data will do them.

We are surrounded by an ocean of data

But here’s where it gets interesting (and a bit scary): Many researchers and observers have pointed out that the amount of high-quality, diverse data needed to train cutting-edge AI models has been increasing at a rapid clip. It’s like trying to fill an Olympic-sized swimming pool that keeps getting bigger while your garden hose stays the same size. Yikes!

Why Should We Care?: Think of it this way: If AI systems are trained on limited or biased data, they’re like someone who’s only watched romantic comedies trying to predict how real relationships work. Not great, right? This can lead to some serious facepalm moments, like facial recognition systems that work on one group (read Whites) and not on others.

Or language models that sound like they learned English exclusively from Twitter arguments; remember Microsoft’s Twitter AI bot Tay, which ended up being their Blair Witch Project? AI assistants that know a lot about a lot of things but are low on common sense, like the time ChatGPT told me it is possible to walk the English Channel.

It’s like the writers in the famous paper called “On the Dangers of Stochastic Parrots”, while talking about Large Language Models (LLMs), said something to the tune of: “Hey, these AI models are just fancy copycats, and we need to watch what they’re copying!”

Modern AI systems require exponentially more data to achieve incremental improvement

The Cool Solutions Squad: Solutions are at hand. Some border on the laughable: researchers are scouting libraries to scan books. This is a tad insipid because scanning takes time and is labour-intensive; even with a vast workforce, how much ‘data’ can you create? Meanwhile, clever folks in lab coats (and probably hoodies) have been cooking up some workable solutions as nifty as the problem itself.

Data Gymnastics, aka Data Augmentation: Imagine you have one photo of a cat. Now, flip it, rotate it, zoom in, and add some filters. Boom! You’ve got multiple training examples from one image. It’s like meal prepping but for AI! This trick helps squeeze more juice out of existing data.

According to research done by different researchers, smart data augmentation can reduce the data required for training a machine learning model by up to 60%. They proved that with the right augmentation tricks, you can train models that perform almost as well as those trained on massive datasets.

The world is being reshaped by AI, and what is shaping AI, is data

Fake It Till You Make It, aka Synthetic Data: The magic mantra of Silicon Valley, it turns out, can also be applied to AI training. Using a fancy tech called GANs (Generative Adversarial Networks), researchers are creating fake data that looks real. It’s like having a 3D printer for data! Need photos of rare medical conditions? Traffic accidents that haven’t happened? No problem – just generate them using what is already there!

NVIDIA’s been crushing it in this space with their GauGAN2 system (yes, the name’s a pun to the post-Impressionist painter Paul Gauguin) that can turn a simple written phrase or sentence into a photorealistic masterpiece. Synthetic data from the system has fooled even experts.

So, is synthetic data the climax of the data scarcity problem? Maybe not. Remember my Sify article titled: Copy Of A Copy: Content Generated By AI, Threat To AI Itself. As I pointed out in the article, synthetic content, after a point, could lead to model collapse. So, nope, no final solution yet. We march forward.

Data augmentation: One image becomes many through creative transformations

Team Players, aka Federated Learning: Think of this like you would a massive multiplayer game with each player keeping their cards close to their chest. Different organisations can train AI models together without sharing their secret sauce (aka sensitive data). For example, hospitals can work together to create better medical AI without sharing patient records. Pretty neat, isn’t it?

As with many cool things AI, Google introduced the concepts of federated learning and has also been leading the charge. If you own an Android phone, you’ve been a beneficiary when Gboard makes next-word predictions without ever ‘seeing’ your embarrassing text messages. Thus, instead of a local server, AI can be trained on tens, hundreds, thousands or even millions of devices, like Google’s research team claims they’ve done, while also keeping data local.

These are not the only solutions in the AI soup kitchen. The next big things range from the commonsensical to the wildly exciting.

Self-Learning Superstars: Imagine AI systems that can learn like humans do – by observing and figuring things out without being explicitly taught. That’s what self-supervised learning is all about. It’s like giving AI systems the ability to watch YouTube tutorials and actually learn from them!

Facebook AI Research (now Meta AI) showed off with their SEER model, which learned from a billion random Instagram images without any labels. The cool part? It performed better than models trained on carefully labelled datasets. SEER generated data labels through relationships between unlabelled images, which is seen as key to developing AI with “common sense,” according to Yann LeCun, Facebook AI’s chief scientist. Take that, traditional training methods!

Google’s Quantum Computer: Quantum computing might offer new solutions to AI’s data hunger

Mix and Match, aka Transfer Learning: This is like teaching someone to ride a bike and then saying, “Hey, these skills will help you ride a motorcycle!” AI models can take what they’ve learned from one task and apply it to another, needing less new data to master new skills.

Sebastian Ruder, a research scientist at Meta, Berlin, in his 2019 PhD thesis, argued that transfer learning can reduce the need for task-specific data by a considerable margin. Anyone love to prod through a 329-page seminal thesis to understand how? Click this link, and read away.

The Wild Card: Agentic AI: Some big brains in the field, like Ilya Sutskever and Yoshua Bengio think the future might be AI systems that can learn more independently, almost like how animals adapt to new environments. Bengio’s work on “System 2 Deep Learning” suggests we could create AI that reasons more like humans do, requiring less brute-force data and more actual understanding. It’s like teaching AI to fish instead of just feeding it fish!

Quantum Computing to the Rescue?: Plot twist – quantum computing, as I have written previously, might be the secret sauce AI needs! Google’s Quantum AI lab has been experimenting with quantum machine learning algorithms that could potentially learn from smaller datasets. Dr. John Martinis, their former chief scientist, suggests quantum advantages could reduce data requirements by orders of magnitude in AI systems based on quantum computing. Though, let’s be honest, quantum computing is still more “future tech” than “next week’s release.”

One team’s data scarcity is another team’s wellspring of creativity. The data shortage in AI is pushing many to get creative and rethink how we train these systems. From creating synthetic data to teaching AI to learn more efficiently, some awe-inspiring innovations are emerging. So, this data diet that could lay siege to the world may not be so bad after all. It might just help us build systems that are not just bigger, but also smarter.

In case you missed:

Satyen is an award-winning scriptwriter, journalist based in Mumbai. He loves to let his pen roam the intersection of artificial intelligence, consciousness, and quantum mechanics. His written words have appeared in many Indian and foreign publications.

Leave A Reply

Share.
© Copyright Sify Technologies Ltd, 1998-2022. All rights reserved