What are the main challenges of building an end to end data integration platform at petabyte scale?
Max Schultze [MS]: At Zalando building a Data Lake was not the very first thing the company had in mind. Throughout the growth of the company from a startup to the size of thousands of employees that it is now, the technical landscape grew organically and such did the data landscape. Realizing that classical analytical processing was no longer sufficient to sustain the company’s growth – not even speaking of future-oriented use cases like machine learning – the first team of data engineers was found. Soon after the very first challenge towards a data integration platform became apparent, and it was not even a technical one: Bring order to the chaos.
Understanding your data landscape and coming up with a proper and company-wide accepted set of rules for data responsibility and data hygiene is a gigantic challenge but will propel you forward drastically when mastered. Setting priorities properly will decide between success and failure of your project.
Moving forward three years into the project many things have been built. As obvious as that sounds, one of the biggest challenges is to scale what you have. In a growing company, the amount of data produced is constantly growing, too. Over time that will put stress on every part of your infrastructure. Your management database will run full, your self build data ingestion pipeline will start failing more often, and you will even discover limitations in open source systems you are using. Be ready to throw away what you know and stay open for new ideas and new technologies that are developed by others.
What are the best practices?
MS: Be aware that you will always have fewer people than you have work to do. To keep moving forward, you have to be very efficient in how you build and maintain your infrastructure. One of the best practices we follow very closely is to automate operations as much as possible. Build self-healing systems to minimize manual interventions. Whenever you observe manual tasks being executed to often, which might even be the manual execution of already automated steps, keep automating.
To build a successful Data Lake, keep your users close. Understand what matters and focus on the biggest needs and pain points. There will always be plenty of feature requests and feedback. Directly engage with your users on a regular basis. You can identify needs such as centralization of processing capacities, or pain points, like your ad-hoc analytical system not performing well enough, or your integration pipelines being late too often. Sometimes this can lead to easy fixes, sometimes it will result in fundamental shifts in your thinking and the company-wide analytical setup.
What technology tools would you recommend for this task and why?
MS: Leverage the offerings of cloud providers and do less yourself. Serverless is the next shift in paradigm where you are much more focusing on the “what” than the “how”. No longer will you be able to understand every execution step of your backend, but no longer do you have to worry about system maintenance yourselves. Buying into these services gives you strong guarantees from the cloud providers which will lead to many more quiet nights. Our first entirely serverless pipeline went to production in Summer 2018 and we are yet to have an infrastructure related incident.
Embrace infrastructure as code. Infrastructural setup usually includes a lot of resources put together, dependencies across them and parameters being provided and tuned. The more you take care of that by hand, the higher the risk for human error. Recently we started adapting AWS Cloud Development Kit, which is an open source framework for defining cloud infrastructure as code. It is a tremendous help in defining resources with only the parameters that really matter for you by using smart defaults. Additionally, coding your infrastructure in an IDE informs you about errors while you type them and not after 10 minutes of compilation and deployment.