Developer Nation Community

Machine learning developers and their data

July 28, 2021

Machine learning developers and their data

The data science (DS), machine learning (ML), and artificial intelligence (AI) field is adapting and expanding. From the ubiquity of data science in driving business insights, to AI’s facial recognition and autonomous vehicles, data is fast becoming the currency of this century. This post will help you to learn more about this data and the profile of the developers who work with it.

The findings shared in this post are based on our Developer Economics 20th edition survey, which ran from December 2020 to February 2021 and reached 19,000 developers.

Before you dive into the data, our new global developer survey is live now. We have a set of questions for machine learning and AI developers. Check it out and take part for a chance to have your say about the most important development trends and win prizes.

It takes all types

The different types of ML/AI/DS data and their applications

We ask developers in ML, AI, and DS what types of data they work with. We distinguish between unstructured data — images, video, text, and audio — and structured tabular data. The latter group includes tabular data that they may simulate themselves.

With 68% of ML/AI/DS developers using unstructured text data, it is the most common type of data these developers work with; however, developers frequently work with multiple types of data. Audio is the most frequently combined data type: 75-76% of those that use audio data also use images, video, or text.

“Unstructured text is the most popular data type, even more popular than tabular data”

Given the most popular applications of audio data are text-to-speech generation (47%) and speech recognition (46%), the overlaps with video and text data are clear. Image data, like audio, overlaps heavily with video data: 78% of those using video data also use image data. The reverse dependence isn’t as strong: only 52% of those using image data are also video data users. The top two applications of both these data types are the same: image classification and facial recognition. These are two key application fields driving the next generation of intelligent devices: improving augmented reality in games and underpinning self-driving cars, in home robotics, home security surveillance, and medical imaging technology.

The types of data ML/AI/DS developers work with

69% of ML/AI/DS developers using tabular data also use unstructured text data

With 59% usage, tabular data is the second most popular type of data. 92% of the tabular data ML/DS/AI developers use is observed, while the other 8% is simulated. The two most common use-cases for this data is workforce planning — 39% of developers who use simulation do this — and resource allocation, also at 39%.

Structured tabular data is least likely to be combined with other types of data. Although uncommon to combine this type of data with audio or video data, 69% do combine tabular data with unstructured text data. The top application of both tabular data and unstructured text is the analysis and prediction of customer behaviour. This is the sort of analysis often done on the data nuggets we leave behind when searching on retail websites — these are key inputs to algorithms for natural language and recommender systems.

Keeping it strictly professional?

The professional status of ML/AI/DS developers

The professional / hobbyist / student mix in the ML/AI/DS ecosystem

ML/AI/DS developers engage in their fields at different levels. Some are professionals, others students or hobbyists, and some are a combination of the above. The majority (53%) of all ML/DS/AI developers are professionals — although they might not be so exclusively.

Of all the data types, audio data has the highest proportion of professional ML/DS/AI developers. 64% of ML/AI/DS developers who use this type of data classified themselves as a professional; and the majority (50%) of these professionals are applying audio data to text-to-speech generation. The high proportion of professionals in this field might be a byproduct of co-influencing variables: audio data is the data type most frequently combined with other types, and professionals are more likely to engage with many different types of data.

Data types popular with students include image, tabular, and text data. Between 18-19% of developers who work with these types of data are students. There are many well-known datasets of these types of data freely available. With this data in hand, students also favour certain research areas.

Image classification, for example, is popular with developers who are exclusively students: 72% of those students who use image data use it for this application, in contrast to just 68% of exclusive professionals that do. In applying unstructured text data, 38% of exclusive students are working in Natural Language Processing (NLP), while 32% of exclusive professionals are. As these students mature to professionals, they will enter industry with these specialised skills and we expect to see an increase in the practical applications of these fields, e.g. chatbots for NLP.

“65% of students and 54% of professionals rely on one or two types of data”

Besides differences in application areas, students, hobbyists, and professionals engage with varying types of data. 65% of those who are exclusively students use one or two types of data, while 61% of exclusively hobbyists and only 54% of exclusively professionals use one or two types. Developers who are exclusively professionals are the most likely to be relying on many different types of data: 23% use four or five types of data. In contrast, 19% of exclusively hobbyists and 15% of exclusively students use four to five types. Level of experience, application, and availability of datasets all play a role in which types of data an ML/AI/DS developer uses. The size of these datasets is the topic of the next section.

Is all data ‘big’?

The size of ML/AI/DS developers’ structured and unstructured training data

The hype around big data has left many with the impression that all developers in ML/AI/DS work with extremely large datasets. We asked ML/AI/DS developers how large their structured and unstructured training datasets are. The size of structured tabular data is measured in rows, while the size of unstructured data — video, audio, text — is measured in disc size. Our research shows that very large datasets aren’t perhaps as ubiquitous as one might expect.

“14% of ML/AI/DS developers use structured training datasets with less than 1,000 rows of data, while the same proportion of developers use data with more than 500,000 rows”

The most common size band is 1K – 20K rows of data, with 25% of ML/AI/DS developers using structured training datasets of this size. This differs by application type. For example, 22% of those working in simulation typically work with 20K – 50K rows of data; while 21% of those working with optimisation tools work with 50K – 100K rows of data.

Dataset size also varies by professional status. Only 11% of exclusively professional developers use structured training datasets with up to 20K rows, while 43% of exclusively hobbyists and 54% of exclusively students use these small datasets. This may have to do with access to these datasets — many companies generate large quantities of data as a byproduct or direct result of their business processes, while students and hobbyists have access to smaller, open-source datasets or those collected via their learning institutions.

A further consideration is, who has access to the infrastructure capable of processing large datasets? For example, those who are exclusively students might not be able to afford the hardware to process large volumes of data.

Non-tabular data is a useful measure for comparisons within categories: for example, 18% of image datasets are between 50MB-500MB, while only 8% are more than 1TB in size. The measure doesn’t, however, allow for cross-type comparisons, since different types of data take up different amounts of space. For example, 50MB of video data takes up a considerably shorter length of time than 50MB of audio data.

The size of unstructured training data ML/AI/DS developers work with

The categorisation of the different data sizes was designed to take into account the steps in required processing power. For most ML/AI/DS developers, we expect that a 1-25GB dataset could be handled with powerful, but not specialised, hardware. Depending on the language and modelling method used, 25GB on disc relates to the approximate upper bound in memory size that this type of hardware could support.

We see that 26% of ML/AI/DS developers using text data and 41% using video data will require specialised hardware to manage their training. The high level of specialized hardware manifests as a barrier-to-entry: data analysis on these large datasets is beyond an achievable scope without the backing of deep pockets supplying cloud-based technology support or infrastructure purchases.

Want to know more? This blog post explores where ML developers run their app or project’s code, and how it differs based on how they are involved in machine learning/AI, what they’re using it for, as well as which algorithms and frameworks they’re using.

ai data science machine learning