The boom in Big Data has created an insatiable demand for data professionals at all levels. Analysts, DBAs, engineers, security consultants – employers are crying out for people with the right skills and experience. Perhaps the most sought-after of all these professionals is the Big Data architect. 

    Table of Contents

    1. What is a Big Data Architect?
    2. Why Does ETL Matter to a Big Data Architect?
    3. How Big Data Architects Use ETL
    4. Big Data Architects Needs Xplenty

    What is a Big Data Architect? 

    In the world of construction, architects are a bridge between clients and engineers. The client might have a sketch of their dream house, but the engineers can only start working when they have detailed blueprints. Architects take the client's sketch and create a functional blueprint for the house. 

    Data architects work in exactly the same way. Enterprise stakeholders, who know what the business needs from its data, even if they don't know all the technical ins and outs of databases. The architect sits with these stakeholders and asks questions like: 

    • What data sources are available?
    • Who will use the data?
    • When will they use the data?
    • What kind of data processing will we perform?
    • Which repository stores the data?

    When the requirements are clear, the architect then creates a blueprint that covers things like: 

    • Data entities and their relationships
    • Data processing models, including pipelines between disparate systems
    • Components required for processing data according to business needs 

    Big Data architects work the same way as relational data architects, except that they face a more complex set of problems. It's not just that the data is bigger (although it is generally bigger by several orders of magnitude.) Big Data architects also have to create data strategies that account for requirements like:

    • Handling unstructured data at scale
    • Getting fast results from distributed file systems
    • Working with innovative data repository structures
    • Maintaining data quality and eliminating data swamps

    It's an extraordinary challenge, although you'll find it easier if you have strong Big Data skills and the right tools. 

    Why Does ETL Matter to a Big Data Architect?

    ETL (Extract, Transform, Load) is a foundational tool in data architecture. The ETL process, which first emerged in the 70s, involves three key steps: 

    • Extract: The ETL process pulls data from disparate sources, such as production databases and cloud services.
    • Transform: Data passes through a transformation process. For example, ETL will transpose a relational database table into a different table structure. 
    • Load: Once data is in a standardized format, the ETL process loads it into a target repository, such as a data warehouse. 

    Data has evolved since the 70s, and so too has the ETL process. Data architects now have access to sophisticated, cloud-based ETL platforms that can move data in several ways. For a Big Data architect, ETL may just be a single tool in their toolkit. But it's an essential tool.

    Integrate Your Data Today!

    Try Xplenty free for 7 days. No credit card required.

    How Big Data Architects Use ETL

    Mention Big Data, and most people think of ELT (Extract, Load, Transform), which populates data lakes with unstructured data. While a Big Data architect may use ELT, there are several use cases where ETL is the correct option. 

    Data Pipelines

    Data strategy often comes down to a simple problem. What's the most efficient way to get data from A to B? The answer is generally some variation on ETL. You extract data, put it through an integration process, and deliver it to its destination.

    Older versions of ETL involved a manual or semi-automated process, such as a batch import run by a scheduled Cron job. Modern cloud-based ETL solutions have allowed architects to build fully automated pipelines. This will push data from source to destination via a staging database where transformations happen.  

    Another advantage of a cloud-based ETL is that they often come with a library of integrations. This means that the Big Data architect doesn't need to allocate resources for developing and testing a hand-coded integration. Instead, they can trust that their ETL solution will connect automatically to any supported services. Even better, the host will update these integrations if there are any API changes. 

    Data Lakehouse

    One drawback of ETL is that it only supports structured data. Most Big Data architects are working with unstructured repositories such as data lakes. This is why many architects rely on ELT (Extract, Load, Transform) with on-demand transformation schemas. 

    However, data lakes are not without drawbacks either. There is a processing overhead on queries, and some data lake platforms are effectively read-only. One compromise between these two structures is the data lakehouse: a data warehouse built on top of a data lake. 

    The advantage of this approach is that you can use a fast-paced ELT process to populate your lake, and then you can fill individual data warehouses with cleansed and integrated data. How do you do this? An ETL process that extracts directly from the lake, applies your required schema, and then loads it to the data warehouse. 

    ETLT

    One of the main challenges in data architecture is decoupling storage from processing. For example, in a traditional ELT process, all transformations happen on the staging database. This may represent a waste of resources if, for example, you need to perform a JOIN statement with data held in the target repository. 

    The ETLT approach spreads the transformation workload across the ETLT process. The first transformation, which takes place within the ETL process, might involve tasks such as data validation, enrichment, and harmonization. The semi-transformed data then goes to a repository, such as a data warehouse. Once complete, you can use the repository's processing resources to integrate incoming data with the existing data. 

    ETLT is a useful compromise between ETL and ELT. You can use this approach to pass all data through ETL and transform elements such as metadata. Further operations can then happen on the repository side. 

    Streaming Analytics

    Businesses are increasingly dependent on analytics. Leaders and managers need access to real-time dashboards that offer a 360 view of everything, from customers to logistics. From a data architecture perspective, you have to centralize all business-critical data, but you have to do it as quickly and efficiently as possible. 

    Cloud ETL services can help. A platform like ETL can act as a messaging service between the source database and the target repository, effectively allowing a push publication of data. For example, if an admin creates an order on the ERP, the order data immediately enters the data pipeline and ends up in a data repository. From there, it's a matter of giving business users access to the right business intelligence tools. 

    There's a big difference between real-time and near-real-time. You'll need to build data pipelines that are fast, responsive, and never miss trigger events. There's also a trade-off decision in the level of pre-processing to carry out before ingestion. Complex transformations may slow down your streaming analytics, but they could lead to better results. 

    Cloud Security

    Data is most vulnerable when it's in transit, which is why Big Data architects always have to take a security-first approach. The task is even more difficult in an age when there are few truly on-premise networks left in the world. Most organizations are either cloud-based or, more commonly, have a hybrid stack with cloud and on-premise components. 

    Cloud ETL adds an extra layer of security when transferring data, no matter where it's from. The originating data source has a one-to-one connection with the ETL platform. This connection is modular, so a problem with one source won't affect any others. The ETL platform itself also has a one-to-one relationship with the data repository. You can configure this relationship independently of the sources. 

    During the transformation stage, the ETL host is responsible for security. Security levels vary between providers, but that means that you can simply shop around for whoever has the best package. 

    Metadata and Master Data Management

    Perhaps the biggest challenge for a Big Data architect is applying structure to unstructured data. How do you impose any order on a repository that's filling up with data from all enterprises? 

    The answer is metadata and master data. Good architects will design a robust metadata policy. This creates consistency across the entire enterprise, making things easy to catalog and search. Master data management is another important strategy. This allows you to create a Single Version of Truth (SVOT) for data entities such as customers or products. You can then use the SVOT to validate the contents of your lake. 

    Metadata and master data are often suitable for storage in relational database tables. This means a data pipeline that extracts metadata or master data from the central repository. ETL will then cleanse and harmonize this data before moving it to a central data warehouse. 

    Integrate Your Data Today!

    Try Xplenty free for 7 days. No credit card required.

    Big Data Architects Need Xplenty

    Xplenty is the premier cloud-based enterprise ETL solution. With Xplenty, you'll find an extensive library of integrations and rock-solid security. It's an essential addition to any data network. 

    Schedule an intro call with our customer support staff to discuss your needs and obtain a 7-day risk-free pilot to see how Xplenty can complement your Big Data skills.