logo
Community

Research Programs

BlogForum
Back to blog

June 25, 2025

How Today’s Developers Are Using Web Data to Train AI Models
byErika BallainCommunity

Even though we’re only two or so years into AI’s mainstream adoption, today we’re seeing something of an arms race in the enterprise world, with many companies rushing to develop the best AI model for the needs of their users. 

For developers, this means building, training, and fine-tuning AI models so that they meet their company’s business objectives. As well as requiring a lot of time, AI model development demands large amounts of training data, and developers prefer to acquire it from the open web. 

Data for AI 2025, a new report from Bright Data, found that 65% of organizations use public web content as their primary source for AI training data, and 38% of companies already consume over 1 petabyte of public web data each year. Apparently, developers are seeing the advantages of using dynamic, real-time data streams, which are continuously updated and customized. 

What’s more, demand for public web data is growing rapidly. According to the Bright Data survey, information needs are expected to grow by 33% and budgets for data acquisition to increase by 85% in the next year. The report maps the growing importance of web data in AI engineering workflows, and how developers are drawing on it to maximize model reliability. 

Improving Model Accuracy

As organizations increasingly rely on AI insights for both operational and strategic decision-making, accuracy is crucial. AI models play important roles in tasks such as assessing applicants for insurance or managing quality control in manufacturing, which don’t allow much margin for error. AI-driven market intelligence also requires accurate models fed the most recent information, and is one of the top use cases cited by participants in the survey. 

Training models to recognize patterns, apply rules to previously unseen examples, and avoid overfitting, demands vast amounts of data, which needs to be fresh to be relevant to real-world use cases. Most traditional data sources are outdated, limited in size, and/or insufficiently diverse, but web datasets are enormous and constantly updated.

When asked about the main benefits of public web data, 57% said improving AI model accuracy and relevance. Over two-thirds of respondents use public web data as their primary source for real-time, connected data.

Optimizing Model Performance

Enterprises seeking the best AI model are looking not only for accuracy but also for model performance, which includes speed, efficiency, and lean use of resources. Developers are well aware that performance optimization relies at least as much on data as on model improvements, with 92% agreeing that real-time, dynamic data is critical to maximizing AI model performance.

When asked about the source of their competitive edge in AI, 53% said advances in AI model development and optimization, and the same number pointed to higher quality data. Reliable, fresh, dynamic data fits models to make better, faster predictions without increased compute resources. 

Finding that data can be challenging, which is why 71% of respondents say data quality will be the top competitive differentiator in AI over the next two years. Live web data is the only way for developers to get hold of quality data in the quantities they need.

Enabling Real-Time Decision-Making

Developers are under rising pressure to produce models that deliver real-time outcomes, whether for decision-making such as medical diagnoses; predictions like evaluating loan applications; or reasoning as part of an agentic AI system. 

Producing real-time responses while preserving accuracy requires feeding AI models a constant diet of context-rich data that’s as close to real time as possible. 

Only public web data can deliver quality data at this kind of speed, which would be why 96% of organizations indicated that they collect real-time web data for inference.

Scaling Up AI Capabilities

As organizations grow, they have to scale up AI capabilities to efficiently handle growing numbers of users, tasks, and datasets. 

Scalability is vital for consistent performance, cost-effectiveness, and business growth, but scaling up models to handle more queries, more quickly, requires more diverse, relevant data. 

Without scalable data sources, AI models can’t adapt to the rising demands placed upon them. Only web data is an immediately scalable source of flexible, fresh, and instantly available information. The report found that 52% of participants see scaling AI capabilities as one of the main benefits of public web data. 

Acquiring Diverse Data

It’s not enough for training data to be plentiful and up-to-date; it also needs to be diverse. When AI models are fed on diverse data, they produce more accurate predictions, fewer mistakes, and more trustworthy AI systems. 

Web data encompasses many types of content media, including text, video, and audio. Some 92% of organizations turn to vendor partnerships to improve data variety, and their desire for data is wide-ranging. 

While 80% of all businesses collect textual training sets, 73.6% also gather images; 65% video; and 60% audio. Compared to enterprises and small businesses, startups consume the greatest range of data types, with more than 70% saying they collect image, video, audio, and text. 

Advancing Personalization and Automation

Personalization tailors AI outputs to individual user needs, which is especially important for customer-facing digital products that incorporate AI. 

Bringing in automation makes the models more efficient, enabling them to adjust automatically to diverse users and contexts without manual adjustments and corrections. These twin goals were cited as the main benefits of public web data by 49% of survey participants.

Web data empowers developers to ramp up both personalization and automation by connecting them with the diverse real-world information that they need. Updated, relevant data about user behavior, trends, and preferences allows AI models to make smarter, self-improving responses that are relevant to each use case, with minimal manual input. 

Public Web Data Is AI Developers’ New Must-Have

As developers work hard to produce AI models that meet rapidly evolving business needs, public web data has become indispensable. Bright Data’s survey underlines that web data has become their best source of real-time, reliable, relevant, and diverse data, giving developers the training sets they need for fine-tuning, scaling, and generally preparing models for any requirement. 

AI modelsdevelopersweb data

Recent Posts

Best Practices for Integrating External Data APIs Into Your Application

June 25, 2025

Best Practices for Integrating External Data APIs Into Your Application 

See post

June 25, 2025

How Today’s Developers Are Using Web Data to Train AI Models

See post

9 Questions to Ask Before You Integrate an Embedded Analytics Solution With Your App

June 25, 2025

9 Questions to Ask Before You Integrate an Embedded Analytics Solution With Your App

See post

Contact us

Swan Buildings (1st floor)20 Swan StreetManchester, M4 5JW+441612400603community@developernation.net
HomeCommunityDN Research ProgramDN Surveys ProgramBlog

Resources

Knowledge HubPulse ReportReportsForumEventsPodcast
Code of Conduct
SlashData © Copyright 2025 |All rights reserved
Cookie Policy |Privacy Policy