Engineering Lead that has led the development of Stream Processing platform for business insights at Netflix shares a recipe for building an infrastructure for the data-driven culture. An interview with Monal Daxini.
What is Keystone Stream Processing Platform, and what services it offers?
Monal Daxini [MD]: Keystone Stream Processing Platform is an essential part of Netflix’s data infrastructure for ingesting and processing streaming events for business insight. The platform comprises of two key offerings: Firstly, self-serve, and declarative Data Ingest Pipelines to collect, process, and route events in near real-time to different data stores. Secondly, Stream Processing as a Service (SPaaS) platform to enables engineers to build & operate custom managed stream processing applications, allowing them to focus on business application logic while platform provides the scale, operations, and domain expertise.
You decided to build Keystone with Apache Flink, a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Why did you choose Flink?
MD: Compared to some other stream processing engines, Flink seemed to provide the core functionality to support majority of our existing and upcoming use cases. Following are some of those core functionalities: data flow programming model to effectively work with streaming data; support for messaging semantics – at-most-once, at-least-once, exactly-once; fault-tolerant processing & state management; scalable Processing and Flexible Deployment models; open source & active community.
Today through the Keystone infrastructure is flowing trillion events and a petabyte of data. What are the lessons learned from your stream processing journey?
MD: First of all, working with streaming data and developing stream processing application requires a change in mindset. In addition to a capable stream processing engine, it’s very important to offer a useable platform so that the user can focus on business application logic while platform provides the scale, operations, and domain expertise. I will tell more about lessons learned in the talk at Big Data Tech Warsaw Summit.
Data Scientist at Twitter on collaboration with data engineers. An interview with Mateusz Fedoryszak.
What is the secret of successful collaboration between data scientist and data engineer?
Mateusz Fedoryszak [MF]: Respect and humility. When you’re working with people of complementary skill set, it’s easy to think: ‘We’re solving the real problems; a high school student could complete their tasks.’ You often don’t realise why deploying a small service or plotting a simple chart might be challenging. On the other hand, even people who don’t fully understand your domain might provide valuable suggestions and feedback.
Is the strict separation of responsibilities a solution? Sometimes part of the problem is forcing data scientists to do engineering tasks – the reverse probably happens less often?
MF: We had the opposite problem: it seems that both teams wanted to do everything. Highly talented people often have problems with handing tasks over: they don’t believe anyone can do as well as themselves. Also, as a data scientist, I can probably perform engineering tasks, but it will not be effective: the specialist will do it better.
How to optimally divide tasks?
MF: One of the important aspects is making sure everyone’s interesting vs. boring tasks ratio is similar. Everyone contributes to the creative process but also everyone does their chores.
Could you give two examples of such interesting creative tasks and boring duties?
MF: Designing a new architecture or developing a new algorithm or model is creative and fun. Looking for memory leaks and data cleaning is rather boring.
Are the right tools part of the solution?
MF: Tools are important, but even more important are the processes and mutual understanding. It always surprises me how crucial is the human aspect of my work.
How were these processes designed in your current organisation?
MF: When we started working with a data engineering team from another department, we wrote down exactly how much time each team is going to devote to this project and what each team will work on. This looks simple, but it allowed us to agree on the expectations. It is extremely important to remember that we are all humans: as one team works in London and the other in New York, it is important to meet face to face once in a while, often video conferences are not enough.
Data Engineer at Zalando on the data integration at petabyte scale, best practices, and technology tools. An interview with Max Schultze.
What are the main challenges of building an end to end data integration platform at petabyte scale?
Max Schultze [MS]: At Zalando building a Data Lake was not the very first thing the company had in mind. Throughout the growth of the company from a startup to the size of thousands of employees that it is now, the technical landscape grew organically and such did the data landscape. Realizing that classical analytical processing was no longer sufficient to sustain the company’s growth – not even speaking of future-oriented use cases like machine learning – the first team of data engineers was found. Soon after the very first challenge towards a data integration platform became apparent, and it was not even a technical one: Bring order to the chaos.
Understanding your data landscape and coming up with a proper and company-wide accepted set of rules for data responsibility and data hygiene is a gigantic challenge but will propel you forward drastically when mastered. Setting priorities properly will decide between success and failure of your project.
Moving forward three years into the project many things have been built. As obvious as that sounds, one of the biggest challenges is to scale what you have. In a growing company, the amount of data produced is constantly growing, too. Over time that will put stress on every part of your infrastructure. Your management database will run full, your self build data ingestion pipeline will start failing more often, and you will even discover limitations in open source systems you are using. Be ready to throw away what you know and stay open for new ideas and new technologies that are developed by others.
What are the best practices?
MS: Be aware that you will always have fewer people than you have work to do. To keep moving forward, you have to be very efficient in how you build and maintain your infrastructure. One of the best practices we follow very closely is to automate operations as much as possible. Build self-healing systems to minimize manual interventions. Whenever you observe manual tasks being executed to often, which might even be the manual execution of already automated steps, keep automating.
To build a successful Data Lake, keep your users close. Understand what matters and focus on the biggest needs and pain points. There will always be plenty of feature requests and feedback. Directly engage with your users on a regular basis. You can identify needs such as centralization of processing capacities, or pain points, like your ad-hoc analytical system not performing well enough, or your integration pipelines being late too often. Sometimes this can lead to easy fixes, sometimes it will result in fundamental shifts in your thinking and the company-wide analytical setup.
What technology tools would you recommend for this task and why?
MS: Leverage the offerings of cloud providers and do less yourself. Serverless is the next shift in paradigm where you are much more focusing on the “what” than the “how”. No longer will you be able to understand every execution step of your backend, but no longer do you have to worry about system maintenance yourselves. Buying into these services gives you strong guarantees from the cloud providers which will lead to many more quiet nights. Our first entirely serverless pipeline went to production in Summer 2018 and we are yet to have an infrastructure related incident.
Embrace infrastructure as code. Infrastructural setup usually includes a lot of resources put together, dependencies across them and parameters being provided and tuned. The more you take care of that by hand, the higher the risk for human error. Recently we started adapting AWS Cloud Development Kit, which is an open source framework for defining cloud infrastructure as code. It is a tremendous help in defining resources with only the parameters that really matter for you by using smart defaults. Additionally, coding your infrastructure in an IDE informs you about errors while you type them and not after 10 minutes of compilation and deployment.
Flink committer on the new generation big data framework and processing engine, project development plans and why it is great to contribute to the open source projects and community – an interview with Dawid Wysakowicz, Software Engineer at data Artisans.
Why can Apache Flink be considered the best choice for processing data streaming?
Dawid Wysakowicz [DW]: One reason is that it addresses all streaming use cases: bounded stream – aka batch, streaming analytics, event-driven applications, etc. It also has the best of class support for state management and event time. It is also industry proven as it runs at Netflix-scale.
In contrast to other open-source stream processors, Flink provides not only true stream processing at the event granularity, i.e., no micro batching, but also handles many batch use cases very well.
What are the main directions of Flink development?
DW: Right now Flink community focuses on true unification of batch and streaming, both in the sense of query semantics, but also on the runtime level as well. That means it will be possible to perform operations like, e.g., bounded-to-unbounded stream join, enrichment with static data, etc. in an optimized way. Later on, this will expand to more use cases such as machine learning on data streams.
Another area of focus is to expand the scope and features of SQL support. We plan to fully support TPC-DS in the nearest future but also further define industry-wide standards for Streaming SQL. Moreover, you can expect significant performance improvements in the future.
What are you working on right now as a Flink committer?
DW: Recently I was driving the effort of supporting the MATCH_RECOGNIZE clause that allows querying for row sequence patterns with SQL, which I’m giving a talk about. Right now I am involved in the efforts of further improving SQL support.
How did you join the community and started contributing to Apache Flink?
DW: I started contributing because I wanted to improve my coding skills and found Apache Flink’s community the most welcoming one among open-source projects I looked into. I was also lucky enough that I met people that allowed me to work on this project as part of my job, first at GetInData and now at data Artisans.
Contributing to open-source projects is something I would recommend to everybody as it a great way to meet and work with super intelligent people on extremely interesting problems.