Data science is considered one of the most lucrative jobs of the 21
st century. The primary reason behind this is the impact data science tools when used effectively, can make on the business. However, as many of you who are actively trying to look for data science talent will vouch, finding true data science talent is not an easy task. A true data scientist is strategic analytics professional, who has a lot of cross-disciplinary insight, deep analytical skills, and good strategic acumen. This specialist's task is to determine the ideal formula for leveraging advanced analytics tools and methodologies to make a critical business impact. Considering the unique skill set that data scientists have and the challenges that organizations face to recruit them, it is imperative that you make sure that the data science talent that you have is focused on tasks that they need to focus on. And this is where the collaboration aspect of data science with other teams comes into play. Due to the nature of their role, data scientists need to collaborate with many different stakeholders and departments. Recently, there has been a lot of focus on the need for data scientists to collaborate with business functions that will leverage the solutions designed by data scientists. While this collaboration aspect is certainly important, a critical collaboration aspect that is overlooked is the collaboration of data science teams with software engineering. The fact is that to successfully industrialize data science solutions; data scientists need to work closely with software engineers. At any stage of the project, both data scientists and engineers must feel responsible for the problem and be capable of contributing their skills towards solving the problem, collaboratively. There needs to be continuous communication, which allows for identifying potential inconsistencies early on. The purpose of this article is to examine some areas where software developers and data scientists can collaborate.
Collaboration Opportunities
If you think about goals from the perspective of the enterprise strategy, both data science teams and software engineering teams are focused on two common goals: improving products for customers and improving business decisions. Hence, while in the realm of organizational friction, it may seem like you need to align goals, the fact is that these teams are already working towards common goals. However, even when you drill down, the granular tasks and responsibilities open collaboration opportunities as well. Let us explore some of them.
Data Source Identification and Integration
Fragmentation of data is a key initial challenge when data scientists start thinking about a project. They may end up realizing that their data sits across a plethora of point and legacy systems. While the “data science fix” for such a scenario is to create an ad-hoc data lake for the project, this is a great area to foster initial collaboration since the challenges of these fragmented sources are many. Lack of documentation, inconsistent schemas, and multiple interpretations of data labels can make the data difficult to understand as well. The benefits of this early involvement are many:
- Provides an initial, low-risk collaboration opportunity between data scientists and software engineering where they can learn about the culture and work style of each other.
- Exposes software engineering team to the business problem that data scientists are trying to solve so that they can keep that in perspective when developing a production version of the tool
- Get an understanding of the key data sources that they will have to be cognizant of when working on the production version
Managing Data Quality
Data quality is always one of the biggest challenges that data scientists face. While data scientists are encouraged to develop skills to troubleshoot data quality issues, the fact is that the kind of skill set a data scientist has, it may not be worth their time to get too deep into that. If your organization has someone who has true data science skills, you will want to present them with good quality data so that they can use their invaluable time prudently and jump directly into creating value for the organization. This is an area where software engineering can start collaborating early on, helping manage data quality issues for data scientists. Data scientists follow the Garbage-In-Garbage-Out (GIGO) principle when working with data: if they deal with incorrect data, even the most sophisticated algorithms will produce inaccurate results. Software engineers help data scientists by building pipelines for processing, cleaning, and transforming data, so that data scientists can work with high-quality data. By collaborating closely with engineers, data scientists can create out-of-the-box machine learning algorithms. Engineering must also focus on scalability, data reuse, and ensuring that there are pipelines for each project that are aligned with the global architecture. Just like the previous section, there are many benefits of collaboration for both teams:
- Software engineers get to understand what the typical data quality pain points are. They can keep these in perspective when building a production version.
- Data science teams can focus on building the solution vs. wasting a bulk of their time troubleshooting data quality issues
Model Testing and Validation
At the end of it, a Machine Learning (ML) or Artificial Intelligence (AI) algorithm is being written to improve a business process or address a business issue. Hence, what is critical is to evaluate how accurate the model is in terms of delivering the results. In a production environment, there needs to be a standardized way to view and analyze model accuracy, so this is another opportunity for the software engineering team to get involved. The involvement will be in terms of building a prototype dashboard in collaboration with data scientists to analyze accuracy metrics. Some of the benefits are:
- Software engineers get an understanding of what the model is about, the business problem it is trying to solve, and how what is the metric to evaluate the accuracy
- They also get an understanding of how the accuracy metric needs to be captured, what are the lower and upper thresholds and other specific parameters, that they need to work upon when developing a production solution
What Does This Mean for SAPinsiders?
In addition to the aspects mentioned above, there are some additional ways data science teams and software engineering teams can collaborate to create synergies that will help develop world-class data science tools:
- Build a permanent data hub for data science initiatives. It is possible for data scientists to have either an insufficient or overly large level of access to production data when transferring production data to them. They generally create ad-hoc data lakes or repositories to handle this data. However, this is an area where software engineering can help build a live data hub type of setup. This data hub will have a live data connection to all the frequently accessed data points and hence acts as a readymade data lake for all data science initiatives. This will help accelerate the production of the solution in the later stages of the project.
- Standardize and automate data science tasks. Scientists usually work with one-off scripts that contain, for example, SQL queries or Python code. Copying the data from the earlier script into the new script for the next job is possible and frequently done. And this is where the development team can help build some solutions to make the task of data science teams easier and accelerate the data science operations tasks. Creating a library of standard data science transformations is an example. A data scientist's work results in algorithms that extract information from raw data. These codes, if built into standardized libraries, essentially create mini AutoML modules that will make the task of any new data scientist easier.
- Create socialization and co-location opportunities. While work is one opportunity to collaborate, true synergies are created through synergies between individuals. Try to create as many opportunities as you can for the two teams to interact outside the normal project work. This can be in the form of team outings or co-locating both teams so that the members interact with each other frequently and develop closer relationships with each other.