A couple of years ago, I was contracted to design a complete online data science program for a university. After some early discussions and planning meetings, it became quite clear that the administrators didn’t have a clear understanding of the difference between data scientists and data engineers. I knew that this important distinction was important to the success of the program, so I sought out to develop two parallel programs that catered to the needs of future students. The programs addressed the specific skillsets required for these two distinct career paths.
When developing the programs I considered the strengths and weaknesses of each position to focus on the base skills that were required. Here are the core competencies of data scientists and data engineers along with overlapping areas:
[Related Article: Machine Learning vs. Statistics]
Data scientists – mathematics & statisticscomputer science, machine learning plus AI/deep learning, advanced analytics, and data storytelling
Data engineers – production-level programming, distributed systems, data transformation, data analytics, and data pipelines.
Overlapping – data analytics, and programming
Let’s dive down into these areas to better understand the differentiators:
Skills for Data Scientists
Data scientists generally come from an applied mathematics and/or statistics background coupled with computer science. Machine learning is based on the mathematical foundations of statistical learning. Trying to excel in data science without the mathematics will lead to an incomplete perspective.
Data scientists will also need to interact with business domain experts in order to cultivate the desired insights. Data scientists also need to analyze data (exploratory data analysis) to help the business utilize their data assets.
Data scientists will also have the background to choose appropriate machine learning algorithms, train them, and devise methods for testing their accuracy.
Additionally, data scientists must be well-versed in the art of data storytelling when the results of a data science project need to be conveyed to the business stakeholders in an understandable fashion. This effort requires the ability to verbally and visually communicate complex results and observations in a way that the stakeholder can understand and act on them.
Data scientists will have also developed coding skills out of necessity, most settling on either the R or Python language environments. The programming skills of a data scientist aren’t typically at the level that you’d see for a data engineer – nor should they be!
Skills for Data Engineers
Data engineers come from a programming background, possibly as a result of a computer science degree. Their background is generally in languages like Python, Java, or Scala. Their emphasis is with distributed systems and big data. Compared to data scientists, their programming skills are more advanced and specifically suited for building high-availability production systems.
Using these programming skills, data engineers create data pipelines at scale. This involves integrating a number of big data technologies. Data engineers are tasked with deciding which tools are right for the job. Data engineers also have an in-depth understanding of data technologies and frameworks and how to integrate them with data pipelines. Further, data engineers work closely with personnel responsible for clusters, DevOps and DataOps.
Data engineers also implement machine learning algorithms chosen by data scientists for a production environment. For example, this may involve deploying a classification algorithm used by the data scientist in R to a more robust production platform.
Overlapping Skills
Certainly, there are overlapping skills with respect to programming, although a data engineer’s programming skills often outweigh those of a data scientist. For example, having a data scientist program a production data pipeline may be an overreach, whereas this kind of task is directly in the wheelhouse of a data engineer. Here, the skills are complementary since the data scientist may design the data pipeline and the data engineer will program and maintain it. A data scientist should generally not be expected to program data pipelines.
Another area of overlap is with data analytics. The data scientist’s analytics skills are usually much more evolved than the analytic skills of a data engineer. Data engineers may be able to do some basic analytics but would not be able to address the needs of more advanced analytics that a data scientist would easily do.
Misalignments in the Enterprise
Many enterprises make mistakes with respect to aligning the above skillsets with the actual job title. First and foremost, don’t fall into the rabbit hole of trying to find one person, known as a unicornwho can do the job of both data scientist and data engineer. Sure, there may be a few unicorns out there, but they’re in very high demand and command a very high salary. Plus, what happens if you hire a unicorn and he/she decides to leave?
Another mistake is having data scientists do the work of a data engineer. Creating a data pipeline is not easy and it requires advanced knowledge of production programming frameworks. A data scientist may be able to acquire these skills, but this is not the most efficient use of this resource. Data scientists are not engineers who build production systems, create data pipelines, and expose machine learning results.
On the flip side, it is a mistake having data engineers do the work of a data scientist, although this is far less common. Some data engineers work to widen their skills by improving their mathematics and statistics knowledge, and correspondingly their machine learning skills. This career path sometimes results in yet another job category, the “machine learning engineer.”
Machine learning engineers typically come from data engineering backgrounds, but they’ve become proficient at certain aspects of data science and sit on the fence between data science and data engineering. This category really isn’t a unicorn, but rather a data engineer who understands how to operationalize and optimize machine learning. Machine learning engineers take what a data scientist creates and makes it production ready.
[Related Article: Data Scientists Versus Statisticians]
Conclusion
In summary, it is important to realize how data scientists and data engineers complement one another. Talented data science teams consist of both skillsets. It is a waste of good resources to have a data scientist doing the job of a data engineer and vice versa. It is highly improbable that you will be able to find a unicorn – one person who is both a skilled data engineer and an expert data scientist. Therefore, you will need to build a team, where each member complements the other’s skills and is able to work well together.