Kaggle Survey Of Data Scientists
Written by Janet Swift   
Monday, 08 February 2021

Kaggle's survey of data scientists reveals that the vast majority of data scientists are under 35 years of age, two-thirds have a graduate degree, and most have less than 10 years coding experience. It also finds that Scikit-learn is the most popular machine learning framework and JupyterLabs the preferred IDE.

Kaggle bills itself as the world's largest data science and machine learning community. Now owned by Google, it was founded in 2010 as a platform for predictive modelling and analytics competitions and over time has morphed into a platform for learning about and engaging in machine learning and data analytics. Kaggle passed the milestone of more than 5 million members in July 2020 and in October over 20,000 members of the community participated in a survey. Kaggle's State of Data Science and Machine Learning 2020 focuses on the 13% of respondents, a total of 2675, who identified their job title as "data scientist."

With regard to demographics Kaggle findings reflected the prevailing gender gap in computer science with 82% of this group identifying as men and only 16% as women. In terms of age the vast majority of data scientists are under 35 with the report commenting:

There are signs of the numbers skewing even younger, as generation Z gets more involved. Nearly 7% of data scientists are aged 18-21, an increase from last year’s 5%.

Fewer than 5% of data scientists have no degree beyond a high school diploma, while over 68% have either a Master’s or doctoral degree. Moreover 93% of them continue learning, 30% of them in "traditional" university courses, but many more via online options, with Coursera leading the field with 63% of respondents using it as an ongoing resource. Many respondents chose multiple resources in the survey, with an average of 2.8 selected.


When it comes to programming experience the report states: 

Most Kaggle data scientists have at least a few years of experience under their belt. Just over 8% of data scientists have been programming since the 20th century! That’s not to say there aren’t newcomers, however. Over 9% have taken up programming in the last year. Just under 2% of data scientists claim to have never written code at all. Compared to the global audience, United States data scientists have significantly greater programming experience. In the US, 37% have been programming 10 or more years, versus 22% worldwide.

On the other hand most Kaggle data scientists are newer to machine learning than programming. Slightly more than 55% of data scientists have less than three years experience and less than 6% of professional data scientists have been using machine learning for a decade or more. As with programming, US data scientists have more machine learning experience than the global respondents.

The survey also looked into the methods and tools favored by respondents discovering that the most commonly used algorithms were linear and logistic regression (84%), followed closely by decision trees and random forests (78%). Of more complex methods, gradient boosting machines (61%) and convolutional neural networks (43%) were the most popular approaches. Generative Adversarial Networks (GANs) were used by only 7%.kagglemeth

Python-based tools dominate the machine learning frameworks. Scikit-learn, described in the report as a swiss army knife applicable to most projects, was the most popular with 83% data scientists using it. TensorFlow and Keras, notably used in combination for deep learning, were each selected by 50%. Gradient boosting library xgboost is come fourth (48%) and  PyTorch was in 5th place at 31%,


There was a clear leader for development environment - JupyterLab (74%) However the report noted that this was a notable decrease from 83% in 2019. Visual Studio Code came in the second spot with just over 33% noting

This is the first year it has been separated out from Visual Studio. The two combined for over 43% this year, versus under 30% in 2019. 

Kaggle  also reported that more data scientists are using the cloud overall. In 2019, about 25% had not adopted cloud computing, which decreased to 17% in 2020. Unsurprisingly Amazon Web Services was the preferred platform (48%), followed by Google Cloud Platform (35%), and Microsoft Azure (29%). Regarding databases, there isn't a clear favorite among data scientists. MySQL was mentioned most often (35.6%), followed by PostgreSQL (28.86%) and SQL Server (24.93%).


More Information

State of Data Science and Machine Learning 2020

Related Articles

Kaggle Enveloped By Google Cloud

What is a Data Scientist and How Do I Become One?

What Skills Do Data Scientists Need

Data Scientist Best Paying Entry-Level Job Says Glassdoor



To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


Azure AI And Pgvector Run Generative AI Directly On Postgres

It's a match made in heaven. The Azure AI extension enables the database to call into various Azure AI services like Azure OpenAI. Combined with pgvector you can go far beyond full text search. Let's  [ ... ]

Query Your Oracle Autonomous Database With Natural Language

Select AI is a new feature of the Oracle Autonomous Database that transforms your mother language to SQL. This is a big boon for non-developers in extracting value out of their data silos.

More News

raspberry pi books



or email your comment to: comments@i-programmer.info


Last Updated ( Monday, 08 February 2021 )