Xplenty interviews Igor Elbert, Principal Data Scientist at Gilt.com on recent data projects, challenges and his approach to ETL.
On data architecture and challenges
Xplenty: Let’s talk data architecture. What type does Gilt use?
Igor Elbert: We use Aster from Teradata for our data warehouse. While most of Gilt's services and production database are on the cloud, the data warehouse runs on hosted hardware.
Xplenty: Speaking of data, what are your biggest challenges?
Igor Elbert: The main challenges are data quality and availability. For example, I would love to have richer metadata on the products we sell – it would help with demand prediction, personalization, etc.
Xplenty: What are the most exciting projects you’ve worked on recently?
Igor Elbert: One of our recent projects involved finding optimum prices for products. Now, the algorithm has become a trusted partner for our merchandisers.
They specify a combination of goals, like revenue, margin, sell-through, etc., as well as other constraints. Then, the algorithm suggests the best combination of prices. It considers hundreds of product and sale attributes to predict expected demand for all possible prices. Most of the products it deals with are new, but we have a rich history of similar products so we are able to fit an accurate model.
The data preparation is done in Aster using SQL and SQL-MapReduce. The modeling, scoring and optimization is done in R. Most of the tasks are run in parallel on workers from our database cluster.
Xplenty: What were your most important considerations when designing the solution and architecture to implement it on?
Igor Elbert: Our Aster database was a good match for the project. It allows us to quickly process large amounts of data, run hundreds of ad-hoc queries during the discovery stage, and hide the complexity of the solution by presenting the result of an SQL-MapReduce project a table. This way, we use SQL for what SQL does best – data manipulation – and we use R for modeling, scoring and optimization.
Xplenty: What were the criteria you applied when choosing the solution?
Igor Elbert: Aster is our MPP database of choice at the moment. It is our default choice for large-scale data manipulations. R was selected because of the multitude of available packages including interfaces to popular modeling and optimization libraries. R's dplyr package, in particular, is a pleasure to work with.
On Data preparation
Xplenty: How much of your team’s time is spent on “data preparation” versus analyzing data and drawing insights?
Igor Elbert: Fortunately, our data team includes dedicated engineers and a homegrown framework that allows very quick integration of data sources. Still, my teams spends about 50-60% of our time understanding, transforming and cleaning the data. A couple of years ago, this percentage was higher – around 80% – but now we can often reuse data preparation queries from previous projects. Some aspects of the data preparation have become part of our ELT process, which also reduces the effort.
Xplenty: Why is the data preparation such a significant pain point for data scientists?
Igor Elbert: The main factor is that data preparation is an inherently complicated problem. We need to understand all the concepts that data represents and the interrelationships between them. Somebody needs to make a judgement call about the quality of the data, detect and fix multiple issues, think about meaningful way to deal with missing values and data points, outliers, and more. That takes time and some aspects are hard to automate.
Xplenty: Are you using ETL tools for your data preparation?
Igor Elbert: We have a homegrown ELT framework, which allows us to quickly bring data into our data warehouse, clean it and make it available for analysis.
On the future of the field
Xplenty: What’s the best advice you can give to someone facing a similar use case?
Igor Elbert: Make it possible for business users to be as self-sufficient as possible. They should be able to tweak the parameters and get results.
Xplenty: What changes have you seen take place in the job requirements of a data / BI pro since you first started in this field?
Igor Elbert: We’re expected to handle both technical and business aspects like never before. At the same time, a new breed of technically savvy business users has emerged. It raises the bar for collaboration, as well as the requirements for data quality.
Xplenty: What are the most important tools for companies to implement if they want to succeed in fulfilling the data missions of tomorrow?
Igor Elbert: Every successful tool needs to advance the idea of “data democratization” – making data easily available to everyone who makes any kind of decisions or answers any questions where data access would support them.
Next, tools need to make data-driven decision-making possible for companies that are currently only “data-aware” at best.
Then, as a next step after BI, practical machine learning should become more readily available to a wider audience.
What advice would you give our readers?
Igor Elbert: Design for change. Business rules, constraints and goals that you started with will change before you’ve even finished coding. Business is very dynamic – a good solution should accommodate for that.
To read the full interview go to The Data Janitors Blog on Medium