Big Data 2019 - report

Panel discussion

Evolution of search: From a complicated problem to...-Priyanka Prakash, Arihant Gupta (Booking.com)

The Data Analytics Platform or... - Krzysztof Adamski, Rob Keevil (ING)

Cloudera interview

Big Data Technology Warsaw 2019 Recap: from technology to people

The rise of the Kubernetes, open source in the cloud, market consolidation and a shortage of data science and data engineering skills top Big Data Technology Warsaw Summit 2019 takeaways

Big Data has always been evolving fast. Not so long ago the Hadoop and open source revolution have reshaped the data analytics landscape. But big data and AI technology landscape are still changing quite rapidly. Today we see new megatrends that might completely change the Big Data landscape: containerisation, hybrid, and public cloud, and ML/AI adoption.

The world today looks entirely different in comparison to 2013. Services offered by global giants are build of hundreds of microservices. Enterprises face challenging new requirements such as cloud-ready deployments and frictionless progressions of Machine Learning models into production while ensuring proper data governance and security.

Every company wants to be data-driven, powered by AI and to monetise its data and that is why this year’s edition of Big Data Technology Warsaw Summit, the fifth one, was the biggest and most successful so far.

The event brought together some 500 participants from many European countries. They gathered to hear about the latest tools and tech and to share new ideas on Big Data. Over 60 outstanding speakers who work with Big Data at top data-driven companies like Cloudera, Zalando, Slack, Amazon Web Services, Booking.com, and Twitter, shared their experience from the fields of Big Data analysis, scalability, storage, and search. True practitioners who work at top data-driven companies shared their recommendations, tools, models, successes, and failures.

After the plenary session, the conference forked into four technical tracks – Architecture, Operations and Cloud, Data Engineering, Artificial Intelligence, and Data Science, Streaming and Real-Time Analytics – that cover the most essential and up-to-date aspects of Big Data, including deep learning, real-time stream processing, and the cloud. One of the event’s highlights was the roundtable sessions. Twenty-seven discussions were hosted and moderated by industry experts and engaged participants in exchanging opinions and experiences about a specific issue. Part of the event was also technical and practical workshops in which 140 attendees took part.

Living in a post-Hadoop world

Most significant megatrends in the big data landscape today are open source technologies and cloud with its different flavours. It is also an important theme of how to implement better machine learning and AI projects in the cloud.

“I analysed the biggest trends in open source over last years. 2013 was the best year for Hadoop, the next year was a big year for Spark. In 2015 and 2016 Kafka became a mainstream technology. A year after that, stream processing with Flink, Beam, Cloud Dataflow, became extremely popular. Last year, 2018, was the year when Kubernetes took off,” said Adam Kawa, CEO, and Co-founder, GetInData.

For many years Hadoop was the big data platform. It was an essential technology, but it started declining as the cloud grew. At Big Data Technology Warsaw Summit many speakers and attendees were saying that original Hadoop is dying out. Some were saying that Cloudera and Hortonworks merger was proof of that. If the two most prominent vendors in the Hadoop space are merging, the business must be tough.

Is Hadoop dead? It seems that the original Hadoop technology is obsolete. But the jury is still out. The Hadoop as an ecosystem seems healthy and family products like YARN, MapReduce, Hive, or Spark will be used for years to come.

The new Cloudera tries to reshape the enterprise data landscape by providing a unified, open-source, hybrid-cloud service offering to customers. “We are still 100% committed to open source. Unity release will combine the best of both worlds of CDH and HDP,” said Gergely Devenyi, Director of Engineering at Cloudera. The company delivers cloud-native machine learning and analytics “from the Edge to AI”. Modern data architecture enables on-premise, multi-cloud and private cloud deployments to work alike within a single distribution service.

Cloudera was on stage at the Big Data Technology Warsaw Summit with the partner, 3Soft S.A. Together they demonstrated highlights of the ongoing technological innovation and provided real-life examples from the field, to showcase the relevancy and urgent need for such a unified platform.

Kubernetes on the rise

The new post-Hadoop world is container-based. It needs more agile adoption of new technologies, and cloud-readiness is an integral part of analytics frameworks. Enterprises are leveraging cloud-native technologies to transform their businesses. The cloud-native promise is the ability to go from idea to production quickly.

Perhaps the strongest message at the Big Data Technology Warsaw 2019 event was that Kubernetes is becoming the framework for enterprises to deploy software infrastructure. It is an approach that more and more enterprises have begun to embrace.

Kubernetes has gained momentum and matured enough to meet the needs of enterprises. It is growing incredibly fast. The benefits of Kubernetes are undeniable. The technology is stable. Users are running Kubernetes on the public cloud, on premises, and across different infrastructures. We will see stronger demand in 2019. In a few years, Kubernetes actually might become more prominent than virtualisation.

“Kubernetes allows companies to choose where they want to run their computation analysis. Whether it could be on-premise, or in Google cloud or maybe a different cloud. With Kubernetes it’s effortless to go to the cloud like Google, but it’s also effortless to go out of the cloud,” said Adam Kawa.

Kubernetes (k8s) is a portable, extensible open-source platform for managing containerised workloads and services. It orchestrates computing, networking, and storage and enables portability across infrastructure providers. Google open-sourced the project in 2014. Now it has a growing ecosystem that provides services, support, and tools. Kubernetes community is vibrant. People and organisations share their experiences and knowledge, and everybody tries to contribute.

“Kubernetes is at the core of pretty much everything we do. We use Kubernetes as our container orchestration framework. And by that I mean we use it in all typical use cases. We take stateless microservices and redeploy them onto our cluster. We monitor their load, and we also scale them to keep our services the way we like. We want to run absolutely everything we do as a container. This allows us to be both cloud-ready and also cloud agnostic,” said Rob Keevil, Data Analytics Platform Lead at ING.

Kubernetes gets substantial attention from Google. Last year the company announced a project called Kubeflow, which in the future can probably be a separate topic of the conferences like the Big Data Technology Warsaw Summit. “We want to make containers more approachable from the data science or data engineer perspective. Kubeflow is the project that we are investing a lot in. We’ve got a lot of attention also from the rest of the industry. So I believe that from Google’s perspective, this is one of those ideas I believe is going to grow significantly this year,” said Michał Żyliński, Cloud Customer Engineer at Google.

Andrzej Michałowski, Head of AI Research & Development at Synerise agreed that a lot of Kubeflow projects will probably emerge this year but mentioned another interesting project also based on Kubernetes: Kubeless. “I believe companies will invest more and more in serverless approach. Companies that were afraid of vendor lock-in could not use AWS Lambda or Azure Functions. Now they can take this approach and go with it. Kubernetes and cloud solutions give you one very important thing: freedom. Freedom to develop systems. Freedom to scale. Freedom to experiment. My team is using Azure Machine Learning Compute to be able to experiment with our approaches, with new models, with new ideas. Instead of days, we can do it in hours or minutes,” said Andrzej Michałowski.

Future in the clouds

The cloud is a no brainer for enterprises. There are obvious advantages. You can quickly spin up your development environment. You can elastically scale the workloads you need. There are clear technical benefits, and those benefits are only getting better in the time. In the cloud, you can develop quicker than on the infrastructure inside an organisation.

“Leverage the offerings of cloud providers and do less yourself. Serverless is the next shift in paradigm where you are much more focusing on the “what” than the “how”. No longer will you be able to understand every execution step of your backend, but no longer you will have to worry about system maintenance yourselves. Buying into these services gives you strong guarantees from the cloud providers which will lead to many more quiet nights. Our first entirely serverless pipeline went to production in summer 2018, and we are yet to have an infrastructure-related incident,” said Max Schultze, Data Engineer at Zalando SE.

It’s pretty clear that the cloud is the future. According to Rob Keevil, the only mitigating thing is that the legal situation is against it at the moment. There are new laws in Europe and in the US. It makes moving to the cloud quite difficult. It is going to be a really interesting challenge for 2019.

There is no competition between open source projects and the cloud although there are some challenges. “We do not compete with the cloud. Actually, Flink is offered as a service by some companies. Although in the last months there was also a bit of discussion in the open source communities to what extent cloud providers should be allowed to use open source without contribution back. A few companies added or extended licenses for their projects,” said Fabian Hueske, Software Engineer at Ververica (former data Artisans).

The need for data professionals

The companies have to face significant challenges to have the cutting-edge big data tools. But first of all, to successfully be able to leverage the data, they need a world class team. It’s not a simple task. Recruiting talents in every part of the world is one of the biggest challenges for Big Data vendors and software houses.

“I’ve marked where the members of our team come from on the map, and a world-class team is by definition an international team,” said Rob Keevil. “The world-class team is also very much embedded into the world of open source. We’re not just taking these big data tools and deploying them. We always have to write integrations. We always have to modify the original projects. And we never want to branch these tools because we will swap one maintenance challenge with another. We’ll build a flexible feature. So all our developers have to be in the open source world.”

The team members also must be multidisciplinary. Because to deliver a comprehensive platform it takes development, operations, front and back end. Sometimes more unusual professionals are required like nonfinancial risk specialists or ethical hackers.

Building such a world-class team is especially tough for small and medium companies. Competing for talent with Google is very hard. People choose Google to test what it is like to work for the tech giant that is pioneering a lot of groundbreaking technologies. It is hard, but it is not impossible. One of the things smaller companies can do is investing in new cloud-native technologies, like Kubernetes. It is also a proven strategy to engage people in a lot of different activities and projects to let them develop their careers. It is especially important for new generations like Millennials.