Big data, even bigger challenges - Big Data Technology Warsaw Summit

We all know it: data is the new oil, and we live in a data age. We all understand the value of data in optimizing the business and pursuing new opportunities. The question is how to use it efficiently to become more successful and more agile, how to become a data-driven organization? The best way to look for the answers is Big Data Tech Warsaw Summit, an independent annual conference focusing on data science and engineering, AI and machine learning, data strategies and monetization, DevOps and the cloud.

Data truly is today's most valuable resource. Every company in the world today wants to analyze data to improve their internal processes, enhance the way how they work with customers or how they collaborate with suppliers and partners. But it is not trivial, and companies need to face a few challenges to be successful. It requires the right skill set, which is hard to find and appropriate technology in the increasingly sophisticated technology space.

Thomas Scholz, Sales Engineering
Manager for EMEA, Snowflake

"There are a few challenges that companies have to deal with to be really successful today. Issue number one is definitely the growth of the data and staying on top of this growth. Data expanded exponentially over the last few years. It looks like it will continue to grow exponentially over the next years. It's a real challenge for everyone to deal with the sheer amount of data before trying to understand it" said Thomas Scholz, Sales Engineering Manager for EMEA at Snowflake, talking about challenges of modern analytics during the opening keynote of Big Data Tech Warsaw Summit (BDTWS) 2020.

He argued that the companies also need to break down rigid and centralized legacy architectures. Apart from the size and a large number of data sources, today's distributed nature of date requires agile and data-centric architectures that evolve with the business environment and allow real-time access, real-time ingestion, and real-time results. This is the only way to achieve the needed insights and required acceleration.

Issue number two is complexity. It is tough for companies to find the right technology. They need expert guidance because today they spent too much time managing different bits and pieces, tuning, updating, and rebooting. The third challenge is data diversity - different types of data, structured and semi-structured, also different data silos. The business wants to understand the customer deeply. So it needs to harness and analyze all available sources, internal and external, open and commercially available. It is also tough to secure the data and protect it from malware. So issue number four is security and access. And then there is a challenge number five - costs and cost of failure.

Facing these challenges and reaching for the ultimate success is easier with new technologies, like cloud. Taking advantage of new technologies and the explosion of data, the companies can drive a significant impact on how they operate and compete in the market. "First, when you are able to get your arms around all of your data, it's far easier to make better, quicker business decisions.

Second, the ease of use and flexibility that comes with working with data in a new way allows us reduce the overall cost of how we scale data management and analytics. Third we can eliminate excessive costs and focus on delivering a great customer experience through data. Whether it's improving how we interact with customers, building better quality products, or making data accessible to both internal and external customers that need it, by leveraging data to better understand how they intersect with your business, we can dramatically improve the overall customer experience" – convinced Thomas Sholz.

Challenges of AI and ML

Marek Wiewiórka, Big Data
Architect, GetInData

Another challenge is how to run large scale Big Data analytics projects efficiently. During BDTWS 2020, the panel of experts explored challenges that companies face running artificial intelligence (AI) and machine learning (ML) solutions. It is still an issue despite the spread of dedicated AI platforms, ready-to-use ML libraries, and tons of data available.
"I think that it's always about data. Because no matter how good tools you have, how good algorithms you can build upon it, if you cannot access all the data you need, if you cannot find the right data within your organization, you actually cannot do anything relevant" – said Marek Wiewiórka, Big Data Architect at GetInData.
"Data scientists spent like 50% of their time looking for the data and not looking at the data looking for the data" - added the panel host Marcin Choiński, Head of Big Data & Analytics Ecosystem at TVN.

Marcin Choiński, Head of Big Data
& Analytics Ecosystem, TVN

In many cases, when starting a machine learning project, there is a high number of potential data sources. The first step is always trying to unify them, provide unified layers so that users can access the data in the same way. Only then it is possible to integrate, start prototyping and crunch the data, and build some machine learning model.

"The tools available on the market today provide easy access to data modeling. You don't have to be a Data Scientist to run sophisticated analysis. The greatest challenge is to find the right data and assess the quality of the data, not modeling itself because it can also, to some extent, be automatic" – summed up Marek Wiewiórka.

Paweł Zawistowski, Lead Data Scientist,
Adform, Assistant Professor,
Warsaw University of Technology

Once the model is built, machine learning experts have to be able to explain it. Today a lot of models are sort of easy to make, but very difficult to explain. This is not sustainable, and that is why it is becoming more and more critical. "We try to stay away from those models, that we cannot completely explain, because from time to time we get support tickets. People ask us what happened there and we have to explain it. It is a real problem when we do not know how to debug the model," – said Paweł Zawistowski, Lead Data Scientist at Adform and Assistant Professor at Warsaw University of Technology.

"I guess one of the biggest challenges that we deal with, really the reason that our teams exist working on infrastructure, is because we have this huge problem with building out production scale models. In our original machine learning experience it was really the result of three or four engineers pushing forward and getting something out for production. We wanted to employ many systems at scale and to improve them, and iterate on them. But it hard to find so many top engineers to keep that up" – said Josh Baer, Product Lead, Machine Learning Platform at Spotify.

Spotify has been using Machine Learning in our product for nearly a decade, but it's only been in the last few years that they have invested in building tools aimed at making the lives of internal ML practitioners easier and more productive.

Migrating to cloud

Spotify was only one of the brightest stars of today's global digital economy present at BDTWS 2020. They shared their use-cases, recommendations, tips, successes, and failures. One of the trends that were easy to see was using cloud platforms.

Spotify is an international media services provider with a headquarter in Stockholm. The company has around 3500 employees and 271 mln users with close to half paying for the service. As a consumer-facing organization is mostly using ML to optimize for better user experience. But it has circa 70 different unique machine learning use cases - they range from improving ad targeting to maximizing the experience of some internal service.

Josh Baer, Product Lead,
Machine Learning Platform,
Spotify

At this scale, Spotify is almost 100% on the cloud. The company did a big migration in 2016 and 2017 to the Google Cloud Platform. "In our case, the cloud handles some of the work that we used to see as pretty interesting, but now we might see as boring. We would always have to think about where are we going to open up our next data center? How do we make sure that we're provisioning enough machines? Now we don't have to worry about that as much, we don't have to worry as much around managing our own databases. For example, we can use these services that cloud providers use and work with that," – said Josh Baer.

Fouad Alsayadi
Senior Data Engineer,
Truecaller

Another company that shared their journey to the cloud was Truecaller, a smartphone app and service of caller-identification, call-blocking, flash-messaging, and call-recording. Truecaller has 150 mln active users that generate 30 bln events a day. The company also migrated on-premise data centers to Google Cloud Platform. "We needed to rethink our original on-premise data architecture. The cloud quickly become an option. We considered storage-computing decoupling, maintenance, and also cloud cost and offerings," – said Fouad Alsayadi, Senior Data Engineer at Truecaller. Mixing on-premise, hybrid, and native cloud technologies, Truecaller built a robust, self-service architecture lowering costs and making data scientists happier.
Our business is data

Konrad Wypchło
Senior Chapter Lead,
ING Tech Poland

Members of the ING Tech Poland team shared their story about new technology platforms and methodologies used in practice on a global scale and how open source technologies and modern machine learning methods are changing regulatory credit risk. The bank invested in a new platform to build models faster for a global organization under strict regulations.
"Models and data are a key strategic asset sustaining a competitive advantage. Better modelling drives a differentiating customer experience, more business and lower capital requirements, improved risk profile and higher efficiency. Credit risk is one of the biggest real business use case for machine learning and especially at ING we embrace that" – said Konrad Wypchło, Senior Chapter Lead at ING Tech Poland

It was also interesting to hear about the journey of Orange Polska from single-use cases to an advanced ecosystem of data. It was a great example of implementing the environment to personalize real-time customer relationships. "Omnichannel personalization in Orange Polska means 21 synced contact channels, one trusted data, one ecosystem, and a short time to market. We have 200 campaigns, more than 30 event triggers and 35 machine learning models. ROI is estimated 3 to 6 times better" – said Tomasz Burzyński, Business Insights Director at Orange

Tomasz Burzyński
Business Insights
Director, Orange

Ketan Gupta, Product Leader at Booking.com shared with BDTWS 2020 attendees how to build products using data and machine learning. Growing from a small startup to one of the world's leading travel companies, Booking.com has more than 28 mln listings and manages 1,5 mln bookings per day. "Technology advancements and huge amounts of data open to new possibilities. Data and machine learning help us find what to build next, from scratch, to solve users' problems with efficiency and scalability," – said Ketan Gupta.

Looking at data, the company saw there was a spike in customer service tickets and a trend in users checking more properties than usual before they book. The analysis showed there are two sets of travelers facing different challenges: one group was not able to book the room because it was sold out and second who booked the room, which the first group wanted but thought about changing it. Solving the supply problem is not just about having more supply to offer but also to create the right match between what partner offerings and traveler needs. To create a perfect match gets tricky with more than 100 mln traveler base, which means the variance in demand gets wider.

Ketan Gupta
Product Leader,
Booking.com

"To bridge the gap, you need a guiding light, and that's where data comes to your rescue. Based on historic reservations data, you can understand the traveller behaviour pattern based on from where they are travelling, to where, when do they book, how often they engage with their reservations and more. This helps form a proxy. But, you are missing out on the most relevant data set on which users like to upgrade their room and why, and that's the biggest challenge while building new products, we don't always have the right set of data. Key here is to launch a proxy and closest possible product and collect more feedback and data," – explained Ketan Gupta.

Having gathered more data, the company built a simple model that predicted which facilities would be the most valuable for a specific traveler type. This helped to serve better upgrade options. This simple model sets a baseline. Adding more data, about price and traveler groups, enabled to provide higher valued recommendations to travelers. The result was higher conversion rates and selling more unsold rooms for hotel partners.

Getting bigger and bigger

Przemysław Gamdzyk
CEO & Meeting Designer,
Evention

"The conference is growing each year. We have started six years ago with 200 people and we are now about 650 attendees, maybe more, because of the last-minute registrations. We have participants from all over the world, from East, West, North and South. Today it's a truly international event" – said Przemysław Gamdzyk, CEO & Meeting Designer at Evention, the organizer of the conference
The presentations came from three distinct sources: eight came from directly invited speakers, experts carefully identified in the community, eight came from the sponsors and partners, and over half of all the talks came from a call for presentation process.

Adam Kawa
CEO and Co-founder,
GetInData

"This is intentional because we want this conference to be open for the community, and we want to make sure that everyone who has a good story to share can speak at the conference. We received 73 submissions and we were very impressed by the number and especially the quality of the presentations. It was challenging to select the best presentations and, at the same time, reject many, outstanding ones" - said Adam Kawa, CEO, and Co-founder at GetInData, the co-organizer of the conference.

In the agenda, attendees could found many modern but also battle-proven Big Data technologies, like Kafka, Flink, Airflow, Elastic search, Google Cloud Platform, Kubernetes. It is because the conference is focused on solutions that work on a production. As every year, at BDTWS 2020, there were also a few presentations that introduced new cool technologies like Hudi, Amundsen or rising stars like Snowflake to show how they are challenging the status quo and how they are even solving the problems that no one has solved before.

Those proven and new technologies are used by speakers and their companies to build powerful Big Data platforms for batch, for real-time processing, for deploying large scale machine learning projects, on-premise, and in the cloud. That is why the presentations can be treated as lessons learned from real-world use cases. It is always better to learn from the experiences of others.

One year after: new era of the cloud

One of the main themes of the last year's BDTWS edition was Cloudera and Hortonworks merger and what would happen next. "The new Cloudera, a year after the merger, is a truly global company. We are present in almost a hundred countries. We have thousands of employees and customers with our big data platform based on open source," - said Marton Balassi, Manager, Streaming Analytics at Cloudera.

Marton Balassi
Manager, Streaming Analytics,
Cloudera

The new company was expected to reshape the enterprise data landscape by providing a unified, open-source, hybrid-cloud service offering to customers, and they delivered. The new Cloudera Data Platform, the best of both worlds of Cloudera CDH and Hortonworks HDP, with new features and new functionalities, leveraging the cloud environment is the industry's first enterprise data cloud, delivering a comprehensive set of multi-function analytic experiences or any data, anywhere, with collective security, governance, and control.

"The Cloudera secret sauce is in the governance, the security, a bit boring bits around the really cool open-source part. But that's really the unique opportunity of bridging that gap between open source and the enterprise," - said Marton Balassi.

Cloudera was on stage with the polish partner, 3Soft S.A. Together they demonstrated the new Cloudera solutions for a hybrid cloud environment, they talked about Apache Flink integration to the Cloudera Data Platform and showed some real-life challenges solving based on the use cases from the polish market.