Data Scientists survey results teaser

Earlier this year, several of us from the DC2 community (Harlan Harris –that’s me, Marck Vaisman, and Sean Murphy) conducted a web-based survey of Data Scientists, with the goal of better understanding the varieties of people, skills, and experiences that fall under this rather broad buzzword. We have analyzed the results from over 250 respondents, and are excited to share some initial findings here!

The first task in the survey was to rank a set of 21 skill categories. We used the technique of non-negative matrix factorization to find five underlying dimensions of variation among the rankings. We found that Data Scientists have skills that tend to be associated together, and by grouping those skills, we can provide people with a useful shorthand. Here are the skills groups, with category names that we think clarify what we as Data Scientists bring to the table:

  • Programming: Back-end Programming, Front-end Programming, Systems Administration
  • Stats: Classical Statistics, Data Manipulation, Science, Spatial Statistics, Surveys and Marketing, Temporal Statistics, Visualization
  • Math: Algorithms, Bayesian/Monte Carlo Statistics, Graphical Models, Math, Optimization, Simulation
  • Business: Business, Product Development
  • Machine Learning/Big Data: Big and Distributed Data, Machine Learning, Structured Data, Unstructured Data

Clearly not everyone who is strong in some aspects of these categories will be expert in every area. But, as a general rule, these skill groups co-occur. Equally important, a Data Scientist who may have skills in Machine Learning and Big Data may have little expertise in Surveys or Front-End Programming.

We performed a similar NMF analysis on a series of self-evaluation questions near the end of the survey. Respondents gave “Completely Agree” to “Completely Disagree” responses to statements that started with “I think of myself as a(n)…” We view the Self-Identification groups that fell out of the NMF analysis as being critical to clarifying the diverse backgrounds and interests of Data Scientists. Here are how the responses to these questions grouped, along with category names that we feel are useful:

  • Data Businessperson: Business person, Leader, Entrepreneur
  • Data Creative: Artist, Jack-of-All-Trades, Hacker
  • Data Researcher: Scientist, Researcher, Statistician
  • Data Engineer: Engineer, Developer

Many people responded to many of these self-ID questions positively, but the analysis shows underlying dimensions of variation that can inform peoples’ career paths and interests. Even more fascinating, the two groupings we identified, skills and self-ID, correlate in ways that we think are highly valuable to Data Scientists and organizations that need our skills. The below graph shows how survey participants, labeled by their primary (by strongest factor loading) skill group and their primary self-ID group, arrange themselves in a cross-tabulation table (click to see larger).

As we further dive into these results, we will be stressing the point that our data shows substantial variation in skills and interests among Data Scientists. The field is quite diverse, and a Data Creative who can build an amazing Javascript tool to visualize data from a set of disparate sources may be very different from a Data Businessperson who starts a data-related business or a Data Researcher who uses advanced mathematical tools to bring insight to organizations or a Data Engineer who integrates enterprise databases with predictive or optimization systems.

We’d love to share more results with you! If you are in the Washington, DC area on August 27th, please come see us talk about the survey results at the Data Science DC Meetup! And if you’ll be attending DataGotham in New York City on September 14th, we’ll be presenting highlights there too! Otherwise, stay tuned for future presentations and publications. If you have any specific questions that we might be able to answer as we further explore the data, please email us!

Harlan (harlan at datacommunitydc.org)
Sean (seanm at datacommunitydc.org)
Marck (marck at datacommunitydc.org)

ps. If you are one of those Data Creatives or Engineers with Javascript skills to burn and a bit of free time, we’d love your help putting together a web-based tool related to this project. Please drop us a line.

The following two tabs change content below.
Harlan Harris has a PhD in Computer Science (Machine Learning) from the University of Illinois at Urbana-Champaign, and post-doctoral work in Cognitive Psychology at several universities. He currently is Senior Solutions Architect at Sentrana, Inc., and co-organizes Data Science DC.

Latest posts by Harlan Harris (see all)

This entry was posted in Projects. Bookmark the permalink.

10 Pingbacks/Trackbacks