Why would anyone want to be a data engineer? Is it because you'll earn an average of $117,000 a year, and sometimes as much as $160,000? Is it because you’re excited to build data systems that boost your company’s annual profits by tens (or hundreds) of millions of dollars?
Regardless of the reasons you’re interested in data engineering, one thing is clear: This job isn’t going anywhere. According to Dice, businesses are hungrier than ever to hire data engineers in 2019. In fact, between mid-2018 and mid-2019, demand for data engineers grew by 88.3% and this growth is just the beginning.
With the excellent pay scale and job demand, the decision to become a data engineer is a practical one (i.e., mom and dad approve!). Nevertheless, you might want to know the following before committing to the career:
- What Are the Responsibilities of Data Engineers?
- What's the Difference Between Data Engineers and Data Scientists?
- What Skills Should a Data Engineer Have?
- How Do I Learn to Be a Data Engineer?
What Are the Responsibilities of Data Engineers?
Data engineers set up and maintain the data infrastructures that support business information systems and applications. They might work with something small, like a relational database for a mom-and-pop business—or something big, like a petabyte-scale data lake for a Fortune 500 company.
As a part of their responsibilities, data engineers design, build, and implement the data systems that fuel machine learning and AI analytics. They also develop information processes for data acquisition, data transformation, data migration, data verification, data modeling, and data mining.
Whether it's a one-person show or a larger team, the field of data engineering includes the following positions:
- The data architect: Data architects create, build and design data management systems across an entire organization—or specific parts of an organization. Their work allows data systems to ingest, integrate, and manage all of the required sources of data for business insights and reporting. The work of a data architect may require in-depth knowledge of SQL, NoSQL, XML, Pig, Spark, Hive, and other systems and tools.
- The database administrator: Database administrators help design and maintain database systems so they function seamlessly for all users in an organization. They optimize databases to function faster, ensure that updates don't interfere with workflow, and secure the sensitive information in databases.
- The data engineer: Data engineers understand a wide variety of programming languages used in data science—such as Java, Python, and R. They know the ins-and-outs of SQL and NoSQL database systems. They also know how to use distributed systems such as Hadoop. This allows them to work with data architects, database administrators, data scientists—and sometimes perform all of these roles—to build a robust, integrated data infrastructure for an organization.
What's the Difference Between Data Engineers and Data Scientists?
To understand the difference between data engineers and data scientists, first, you need to know what data scientists do.
What Data Scientists Do:
Data scientists carry out many responsibilities in modern business—like helping Facebook show you targeted ads, teaching robotic vehicles to drive themselves, and helping Netflix recommend the perfect movies. Their work gives companies tremendous competitive advantages. For example, Netflix is saving $1 billion a year due to better customer retention through data analytics.
Data scientists specialize in statistical modeling and machine learning technology. They develop graphical displays, graphs, dashboards, and other methods to share vital business intelligence with decision-makers in their organizations. Nevertheless, as important as this work is, every data scientist knows that they couldn't do their jobs without access to quality data. This is where data engineers come into the picture.
How Data Engineers Support Data Scientists:
Data engineers create data pipelines that connect data from one system to another, and they transform data from one format to another. That way, different data systems can communicate so the data scientist can pull the analytical insights s/he desires. Even though data engineers aren't as visible as data scientists, they're just as important (if not more so).
An Illustrating of Difference:
If data scientists are train conductors, data engineers are the architects/builders of the railways that get the trains from A to B. Let's say the train conductor wants to deliver a payload somewhere that doesn't have an established railway. The conductor needs the railway architects/builders to connect the train to the new destination. The railway architects will study the terrain. They'll decide if it's better to go around, over or tunnel through mountains. They'll build bridges over rivers. They'll use all the tools available to build a railway that connects the train to the new destination.
Ultimately, data scientists interact with data by writing queries, creating dashboards, and developing machine-learning strategies. They also work directly with decision-makers to understand their information needs, and develop strategies for meeting them. Data engineers build and maintain the data infrastructures that connect an organization’s data ecosystems to make the data scientist's work possible.
What Skills Should a Data Engineer Have?
Data engineers need to acquire a variety of skills related to programming languages, databases, operating systems, and data-related tasks. As a newbie, the learning process will feel overwhelming, guaranteed. Therefore, you need to keep the following in mind: As a data engineer, you'll never feel like you know everything, but you will know "enough." More importantly, you'll know how to find information and acquire new skills when needed.
Ultimately, the acquisition of skills and knowledge is a career-long process. Yes, you'll need to be an expert in certain topics and programming languages (as your job requires). But you also need to be an expert at looking up information. For example, you might need a SQL statement to perform a specific action, so you look it up on SQLZoo. Or, you might need to brush up on MapReduce when analyzing a large data set featuring a parallel, distributed algorithm on a cluster.
With that being said, here are 12 knowledge areas you'll develop during the course of your career as a data engineer:
(1) Programming Languages Used in Data Science
Data engineers need expertise in the following programming languages (at a minimum):
- SQL: To set up, query, and manage databases systems. SQL is not a "data engineering" language per se, but data engineers will need to work with SQL databases on a regular and continual basis.
- Python: To create data pipelines, write ETL scripts that extract, transform, and load data from one system to another, and to set up statistical models and perform analysis. Like R, this is an important language for data science and data engineering—and it's particularly important for ETL, data analysis, and machine learning applications.
- R: To analyze data, and set up statistical models, dashboards, and visual displays. Like Python, this is an important language for data science and data engineering—and it's especially useful for data analysis and machine learning applications.
Knowledge of these scripting languages allows data engineers to troubleshoot, update, tweak, and improve the database, business insights tools, and machine-learning systems they’re working with. Data engineers could also benefit from being familiar with Java, NoSQL, Julia, Scala, MATLAB, and TensorFlow.
(2) Relational And Non-Relational Database Systems:
Data engineers need to know how to work with a wide variety of data platforms, especially SQL-based relational database systems (RDBMSs) like MySQL, PostgreSQL (a hybrid SQL and NoSQL database), and Microsoft SQL Server, and Oracle Database. For example, they should feel comfortable using SQL to build and set up database systems to store, maintain, and query data on these systems. Data engineers should also develop skills working with NoSQL databases such as MongoDB, Cassandra, Couchbase, Oracle NoSQL Database, and others.
(3) ETL Solutions:
Data engineers should be comfortable using ETL (extract, transform, load) systems, like Xplenty, that assist with extracting, transforming, and loading of data into data warehouses. They should also understand how to use ETL solutions to assist with the transformation and migration of data from one storage system or application to another.
(4) Data Warehouses:
After extracting information from various business systems, data engineers may need to prepare the information for integrating it with a data warehouse system—if they want to query it for deep insights and other information. This could involve transforming the data with an ETL (extract, transform, load) tool like Xplenty.
Since cloud-based data warehouses form the backbone of most advanced business intelligence and corporate data systems, it’s important that data engineers understand how to set up and a cloud-based data warehouse, connect a wide variety of data types to it, and optimize those connections for speed and efficiency.
(5) Data Lakes:
Data warehouses can only work with structured information—such as information in a relational database where the data is organized into clearly-identified columns and rows. Meanwhile, data lakes can work with any type of data—including unstructured information—and BI solutions can hook up to them to derive valuable insights. For this reason, many companies are incorporating data lakes into their information infrastructures.
If you’d like to apply machine learning algorithms to unstructured data—like emails, social media, or app data—you’ll need to understand how to integrate unstructured data in a data lake and connect this information to a business intelligence platform.
Data engineers develop the essential data pathways that connect various information systems together. Therefore, it’s vital that data engineers understand data pipelines and how they help different parts of an information network communicate with each other. For example, they should be able to work with REST, SOAP, FTP, HTTP and ODBC—and understand additional strategies for connecting one information system or application to another as efficiently as possible.
(7) Data Ingests:
A data ingest refers to the extraction of data from different sources. During the extraction process, the data engineer needs to pay close attention to the formats and protocols that apply to the situation—all while extracting the data swiftly and seamlessly.
(8) Configuring Business Intelligence Systems:
Once the data has been stored, data scientists establish the important connections between information sources and data warehouses, data marts, and data lakes. This could involve exposing the company’s data to advanced machine-learning algorithms that provide AI analytics, forecasting, general business intelligence, and more. It's important that data engineers understand how this process works to support data scientists in their jobs.
(9) Building Dashboards to Display Insights and Analytics:
Many business intelligence and machine learning platforms allow users to develop beautiful, interactive dashboards that showcase the results of queries, AI forecasting, and more. These tasks are usually the responsibility of data scientists; however, data engineers may assist the data scientists in this process. Although many BI platforms and RDBMS solutions allow users to create dashboards via a drag-and-drop interface, knowledge of SQL, R, and Python allows a data engineer to assist the data scientist to set up dashboards that fit their needs.
(10) Machine Learning:
Machine learning is primarily the domain of data scientists. However, because data engineers are the ones who build the data infrastructures that support machine learning systems, it’s important that they feel comfortable with statistics and data modeling. Also, not all organizations will have a data scientist, so it’s good to understand how to set up BI dashboards, deploy machine learning algorithms, and extract deep insights independently.
(11) UNIX, Solaris, and Linux Systems:
The machine learning systems of the future will likely be UNIX-based, due to requirements for hardware root access and the need for additional functionality that Windows and Mac OS don’t provide. Therefore, data engineers will want to get familiar with these operating systems now if they haven’t done so already.
How Do I Learn to Be a Data Engineer?
There's no clear path to becoming a data engineer, nor is the process easy. Although most data engineers learn by developing their skills on the job, you can acquire many of the skills you need through self-study, university education, and project-based learning.
Ethan Lyon, a data engineer writing for Seer, described the process of learning to be a data engineer like this:
"I’m not going to sugar coat it. The journey was hard, frustrating and had many times where I wanted to throw in the towel. I’m still challenged and don’t feel like I’m going to make it but guess what? I have and I will.
My boss recently asked me what I wish I would have known when I first started and I couldn’t get the idea out of my head. The main theme was the journey. A lot of people have the image of action-packed coding sessions … matrix-style 3D models flowing on 10 monitors. Yeah … more like getting an error in what looks like a 30-year-old DOS terminal then going to StackOverflow—a Q&A site for programmers. Copying the code in an answer, seeing if it works … rinse and repeat."
Whether you learn to be a data engineer at a university or on your own, there are many ways to reach your goal. Let's take a look at four ways people develop data engineering skills:
(1) University Degrees:
A university education isn't necessary to become a data engineer. Nevertheless, getting the right kind of degree will help. If you have the time and resources, getting a degree will make you more attractive to employers while giving you the data engineering skills and knowledge you need.
For a data engineer, the following degrees would serve your purposes: Engineering, computer science, physics, applied mathematics, or another type of degree. A bachelor's degree in these topics is sufficient, but a master's degree in computer engineering or computer science would be ideal. One of the reasons to spring for a master's degree is that it will help you compete against other job applicants—even if you don't have prior work experience as a data engineer.
(2) Free and Inexpensive Online Coursework:
Some of the best data engineers are self-taught via free and inexpensive online-learning programs. Believe it or not, you could probably learn most of what you need to know by watching videos on YouTube. This article highlights a number of excellent YouTube videos that help lay the ground for becoming a data engineer.
Here are some free online courses to learn the basics of data engineering:
- A Beginner’s Guide to Data Engineering (Part 1), (Part 2), (Part 3): These articles on Medium will help you understand the basics of data engineering and data science, help you understand data modeling, data partitioning, and strategies for extracting, transforming, and loading (ETL) data. If you want to go deeper than we have time for in this article, this guide is the best place to start.
- Free Data Engineering E-Books: These e-books from O'Reilly are another great resource for developing the foundation you need to become a data engineer.
- Udacity's Data Engineering Nanodegree: Udacity is a company that offers high-quality, free, online education around mathematics and technology, and they an entire track dedicated to teaching data engineering.
As you get deeper into your learning you'll need to master a variety of coding languages, operating systems, and information systems. This guide from Pranav Dar lists free resources for learning the following skills (Dar's guide might provide all you need to become an excellent data engineer):
- How to use Linux, CS401, and Rasberry Pi
- How to code in Python, SQL, and NoSQL
- How to use Hadoop, MapReduce, Apache Spark and Machine Learning
(3) Project-Based Learning (Our Favorite!):
Finding the motivation to complete online data engineering coursework can be difficult—and many would-be data scientists quit before getting their feet wet. If that happens to you, consider the project-based learning approach.
In this Medium article, Data engineer Josh Temple outlines how he gave up on the boring coursework and self-study programs in favor of "project-based learning."
Here's what Temple says about the approach:
"In my experience as a self-taught data engineer, taking dozens of massive open online courses (MOOCs) is not the best approach. It didn’t work for me.
I didn’t have hours every night and weekend to spend studying. The lectures didn’t feel practical enough to launch me from a non-technical field to a job in data. The projects didn’t usually align with my interests, and course-after-course, I quickly ran out of motivation to continue."
Temple says that self-guided, project-based learning is the best way to learn data engineering. Considering that most data engineers learn on the job (i.e., they learn by doing different projects), Temple's approach makes a lot of sense. Here are a few more tips from Temple (make sure to read the whole article; it's excellent):
"Choose a project that’s interesting to you and requires skills you’d like to learn. As you build each unit of the project, learn the necessary skills to complete that unit. Project-based learning is more efficient, more practical, and more fun."
To add a lot more fuel to the project-based learning approach, consider writing about your work and research. Open a Medium account and devote some time to creating a few "how-to" articles on the topic of data engineering. Also, post your personal projects to Github, and contribute to open projects on Github. These actions will showcase your abilities—while boosting your data engineering street cred—to potential employers.
(4) Professional Certifications:
There are many professional certification courses for data science and data engineering. A quick search on the internet will find the following certifications:
- Vendor Specific Certifications: Oracle, Microsoft, IBM, Cloudera and many other data science technology companies provide training for valuable certifications in their products.
- Certified Data Management Professional (CDMP): Data Management Association International (DAMA) developed the CDMP program as a credential for being a general database professional.
- Cloudera Certified Professional (CCP) Data Engineer: The Cloudera CCP designation is a certification for professional data engineers, covering topics like data transformations, staging and storing information, data ingestions, data analysis from Parquet, JSON, Avro and more.
- Google Cloud Certified Professional Data Engineer: Applicants can receive the Google Cloud data engineer certification after successfully passing a two-hour exam.
However, these courses may not be as valuable as you think. Data engineering is something you learn by doing, and companies hiring data engineers know this. This quora user offers the following critique of such certifications:
"Taking up such classes/certifications online may look good on your resume and may give you a little advantage over others but landing a full-time data science job is very unlikely [...] Experience in the data science field is essential."
If your employer is sponsoring you to get one of these certifications, excellent. If you're learning on your own, remember that learning by doing is infinitely more valuable than a certification—so focus on that.
Xplenty: The Perfect ETL Tool for Data Engineers
We'd like to thank you for reading this guide, and we sincerely hope it supports to achieve your data engineering career goals. As you move forward in this field, you'll probably discover how important data integration (ETL) tools are to doing your job. You'll also learn that not all ETL tools are the same. Compared to others, some are vastly easier to use and more powerful, like Xplenty.
The Xplenty ETL platform has hundreds of out-of-the-box integrations that allow you to extract, transform, and load data from your favorite database systems and cloud-based applications into your data warehouse solutions instantly. In fact, our visual, drag-and-drop interface is so easy to use, you might feel like you're cheating! If you'd like to learn more about our products and technology, contact the Xplenty team now.