Papers
Topics
Authors
Recent
Search
2000 character limit reached

How do Data Science Workers Collaborate? Roles, Workflows, and Tools

Published 18 Jan 2020 in cs.HC, cs.AI, cs.LG, cs.SE, and stat.ML | (2001.06684v3)

Abstract: Today, the prominence of data science within organizations has given rise to teams of data science workers collaborating on extracting insights from data, as opposed to individual data scientists working alone. However, we still lack a deep understanding of how data science workers collaborate in practice. In this work, we conducted an online survey with 183 participants who work in various aspects of data science. We focused on their reported interactions with each other (e.g., managers with engineers) and with different tools (e.g., Jupyter Notebook). We found that data science teams are extremely collaborative and work with a variety of stakeholders and tools during the six common steps of a data science workflow (e.g., clean data and train model). We also found that the collaborative practices workers employ, such as documentation, vary according to the kinds of tools they use. Based on these findings, we discuss design implications for supporting data science team collaborations and future research directions.

Citations (230)

Summary

  • The paper reveals that data science collaboration involves interdisciplinary roles beyond data scientists, including engineers, managers, and domain experts.
  • It employs an online survey of 183 participants to map six workflow phases and examine tool usage, notably Jupyter Notebook and GitHub.
  • Findings underscore critical documentation gaps during data cleaning and feature engineering, suggesting a need for innovative AI-driven tool designs.

An Expert Analysis of "How do Data Science Workers Collaborate? Roles, Workflows, and Tools"

The paper "How do Data Science Workers Collaborate? Roles, Workflows, and Tools" by Amy X. Zhang, Michael Muller, and Dakuo Wang addresses the intricacies of collaboration within data science teams. Through an online survey of 183 participants from varied data science backgrounds, the study explores the roles, workflows, and tools that facilitate collaboration in data science projects.

Overview of Findings

Empirical data reveal that data science activities are inherently collaborative. The study highlights that collaboration is not limited to data scientists alone, but includes varied roles such as engineers, managers, and domain experts. Each of these roles collaborates across different stages of the data science workflow, suggesting that the effective execution of data science projects relies on a composite skill set and interdisciplinary collaboration.

Roles and Collaborative Dynamics

The analysis delineates the roles typically found in data science teams and their collaborations during different project phases. A nuanced discovery is the existence of asymmetries in role collaboration perception, such as communicators who perceive high collaboration levels with managers, which is not reciprocally acknowledged. This suggests a potential gap in communication flow or role acknowledgment, underscoring the importance of resolving perceptual disparities to enhance collaboration efficiency.

Workflows and Tools

The paper identifies a standard data science workflow comprising six critical phases: problem understanding, data access and cleaning, feature selection, model training, model evaluation, and results communication. Notably, it marks a lack of robust documentation practices, particularly during data cleaning and feature engineering stages. Such gaps point to potential knowledge loss, highlighting an area ripe for tool innovation.

Tool analysis presents a spectrum of preference, where coding environments like Jupyter Notebook and collaborative platforms like GitHub dominate usage. Interestingly, it denotes the underutilization of collaboration tools during pivotal project phases like data cleaning, implicitly suggesting a need for integrated tools fostering documentation and knowledge sharing.

Implications and Future Directions

This research provides vital insights into the collaborative framework of data science work, offering implications for tool design and workflow structuring. Recognizing the intricate role interplay and the chronic lack of documentation suggests future developments in AI could focus on creating tools that facilitate not just collaboration but also rich, context-aware documentation and automation of repetitive documentation tasks.

The study also hints at future AI developments where interdisciplinary collaboration is central to addressing complex problems that single-discipline efforts cannot solve efficiently. Given the increasing importance of explainability and accountability in AI systems, innovations aimed at fostering transparency and enhancing documentation during all workflow stages would be invaluable.

Conclusion

In sum, the paper meticulously outlines the collaborative landscape in data science, drawing attention to the intricacies of roles, workflows, and tools. It challenges future research and development to address identified gaps in documentation and tool integration, paving the way towards more efficient and transparent data science workflows. This exploration enriches our understanding and underlines the quintessential role of cross-disciplinary collaboration in advancing data science practices.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.