Sustaining the Future of LLMs: Overcoming the Data Challenge

Update (2026): a lot has changed since this was written — read our follow-up, Synthetic Data, Revisited.

Introduction

A few years after their emergence, today, there is little doubt about the potential of Large Language Models (LLMs) and their impacts on shaping our future.

They disrupted the AI field, creating a multibillion-dollar industry, which is expected to grow at a CAGR of 33.7%, as well as a growing dependency on them from many other industries across the globe. Despite the gloomy future that some skeptics draw, looking through positive yet realistic lenses, LLMs present a vision of a more advanced and prosperous future in which they will have changed how we work in many professions, helping humanity solve some of its essential economic, environmental, and medical challenges more effectively and at a higher speed.

The Upcoming Data Crisis for LLMs

To maintain their development momentum and widespread reach, however, LLMs need training data. But, analysts, experts, and even business leaders are increasingly warning us about the diminishing data supply.

Human-generated web data that was accumulated over decades and was initially used to train models like ChatGPT and Claude will run dry in the next few years. Epoch AI, a San Francisco-based research institute investigating key trends in AI, predicts in a 2024 study, "If the current trends continue, language models will fully utilize the existing data stock between 2026 and 2032". According to a 2024 The Economist article, the stock of high-quality text-based data on the internet will all have been entirely used by 2028.

On the other hand, using LLMs themselves to create training data has proven harder than its early promise suggested. Studies consistently show that naively training models on unfiltered, machine-generated data leads to poor results — degraded quality, inherited biases, and in the extreme, model collapse. See, for instance: AI models collapse when trained on recursively generated data (Nature, 2024), GPT is Not an Annotator: The Necessity of Human Annotation in Fairness (ACL 2024), and Large Language Models for Data Annotation and Synthesis: A Survey (EMNLP 2024). Synthetic data is not a free lunch: without careful curation, verification, and human oversight, it compounds the very quality problems it is meant to solve.

Domain-Specific LLMs: Also Battling the Data Bottleneck

Not only is the training data the cornerstone of general-purpose LLMs, but it is also the most critical element when it comes to fine-tuning an LLM to domain-specific use cases. It's a necessity for data-driven enterprises as well as generative AI startups who are building their competitive edge on domain-specific LLMs.

Without the right tools and technology, turning a company's domain-specific raw data into the training data for any AI model is extremely hard and nearly impossible.

With the rise of generative AI and LLMs, the field of AI has undergone a significant change. The tools and technologies used to create training data for traditional AI models are no longer sufficient. LLMs are different from traditional AI models in the sense that, firstly, they are multi-purpose. Secondly, they must be trained via new techniques like instruction-tuning and reinforcement learning from human feedback (RLHF). Third, they are based on architectures such as mixture-of-experts and have an emerging agentic aspect that needs addressing. Concepts that did not exist in traditional AI and hence are not inherently supported in existing data annotation tools that were developed and optimized over almost a decade for traditional models.

A New Era for Training Data: Beyond Traditional Tools

Curating and maintaining training data for LLMs at the scale and quality that are sufficient for today's standards requires new technologies and tools, without which we may witness a future of overly general and stagnating LLMs with repetitive and irrelevant responses. Tools that can support LLM-specific areas such as instruction-tuning, RLHF fine-tuning, and agentic behaviors, as well as new aspects of data evaluation such as response quality, bias, and safety — aspects that are specific to generative AI and were rarely an issue in traditional machine learning.

We built Calibrion to address this problem. A platform for creating training datasets for LLMs without the limitations of traditional machine learning. Unlike traditional data annotation tools that were built for traditional AI, Calibrion natively supports instruction-tuning, RLHF, AI agents, evaluation for generative AI data, and everything else in between, which is necessary for creating, updating, and fine-tuning an LLM.

Conclusion

The continued development and fine-tuning of LLMs depend on addressing the challenges posed by the dwindling availability of high-quality training data and the inadequacy of traditional annotation tools. As the AI landscape evolves, so must the tools and methodologies that underpin it. By advancing the techniques, technologies, and tools used to create training sets, we can ensure that these transformative technologies continue to unlock new possibilities for society and industries.