A decade ago, only a few companies ran their Big Data infrastructure and pipelines in the public cloud (Netflix was one of such companies) At that time, the most popular way to build Big Data solutions was to use on-premise infrastructure and an ecosystem of open-source components.. In 2012-2013, we even had examples of companies that tried public cloud solutions, but quickly returned to building Big Data infrastructure with their own data-centres. The reason was primarily high costs, issues with elasticity, and service unavailability.
A year is definitely a long enough time to see new trends or technologies that get more traction. The Big Data landscape changes increasingly fast thanks to a lot of innovation, competition, and use of technologies that become now critical to almost all companies on this planet. Let’s read about the 5 current trends that will be described in detail by selected presentations at the incoming edition of Big Data Tech Warsaw 2021 (February 25-26th).
How to create an extensible Big Data platform for advanced analytics? How to scale data platform from a few terabytes to over 100 petabytes while reducing the data latency from 24+ hours to minutes. Reza Shiftehfar, Hadoop Platform teams leader at Uber, at Big Data Tech Warsaw Summit 2020, told a story of building and growing Uber’s reliable and scalable Big Data platform. It serves petabytes of data in a real-time fashion utilizing technologies such as Apache Hadoop, Apache Hive, Apache Kafka, Apache Spark, and Presto.
What is your role at Uber?
I manage the new platform team at Uber. That’s the team that provides Big Data as a service to the rest of the company. I’ve been with Uber since 2014. That was the time when we were a small start-up; our data was small enough to fit in a single box traditional Postgres database. For the past seven years, we grew our data from a few terabytes to over hundreds of petabytes. We have to build a horizontally scalable infrastructure.
What was the growth of the company at that time and how it affected data infrastructure?
It took us almost six years to complete 1 billion trips on the platform. But when we did that, we did the second 1 billion in six months. We did another one in five months, and another one in four months. If you look at this from an infrastructure perspective, you need to basically start planning for your next architecture because this growth is exponential, and scaling the infrastructure is not that easy.
Let’s look at the range of the products that we provide. We started in 2010 by providing just trips. It’s a pretty straightforward product. You have one driver, one rider, and a trip is happening at the same time. Then we launched UberPool. It’s a much better experience for the user. It’s cheaper, but it’s much more complicated: you have one driver, two riders, and two trips that are overlapping with each other.
Then we had Uber Eats. It’s even more complicated. We have three parties involved: a person that placed the order, a restaurant partner, and you have someone to make a pick-up and deliver. It requires two trips: one to get to the restaurant, pick up the food, and the other to deliver the order. Then we had Uber Freight – even more parties involved. So the maturity and complexity of the products are increasing. The amount of data generated is also increasing. You have to have a pretty reliable data infrastructure to be able to support an even better experience for the users.
How data is used at Uber?
Data is at the heart of our company. We use the previous data, the experience of the previous users to create a much better and much smoother experience for the next people taking the ride. Internally looking at the data users in the company, from an infrastructure perspective, we have three main categories of users.
We have thousands of city operators. These are on-the-ground crews in every city that manage and scale our transportation network in that market. This is like a small start-up in Uber. Their goal is to increase the number of trips in that city. They are operational teams. They do not have a technical background, and they do not write code, only some simple SQL statements. Technical skills are low, but dependency on the data is high.
We also have hundreds of data scientists and analysts. They look back at the previous data, and they try out different ideas, different products. They do have technical skills. They can write queries, they can write code, and they understand the data content. They use a large amount of historical data.
We also have hundreds of engineers across the company, that relies on data. They build services that give, for example, estimated delivery time for food. They have high coding skills, and the dependence on data is pretty high.
What Uber’s scale is in terms of numbers?
In terms of some numbers, we have over 2 trillion unique events generated every day, over 20,000 physical hosts in each of our data centers, hundreds of petabytes of data in HDFS clusters.
We generate over 200 terabytes of new data per day, and we have several hundred thousands of queries that are running on this data every day. We have over 2,000 unique dashboards at every single time that are used by different parts of the company to see how the business is operating.
What was your philosophy when building an infrastructure platform for petabyte-scale?
We knew that no single company could build everything from scratch. There are very few companies in the world that are operating at this scale with this amount of data. The best way to build a reliable platform is to cooperate with them and rely on the open source community to build such a platform. We are heavy users of very popular Big Data platform open source products. We are heavy users of Spark, Hive, Presto, HDFS, Parquet.
At the same time, many of these products do not work out of the box for such a scale. When we start using them, we start to see all these data limitations. So we try to re-architecture, redesign those components to be able to scale. And again, we try our best to contribute back to the open-source community. The good examples from those are the observer nodes in HDFS or geospatial support in Presto.
Finally, from time to time, we see that there’s a gap in the open-source community – there’s no single product that addresses our need, or there are way too many products but not for our scale. That is when we start our own project from scratch. We try our best to open source them as well. A few examples, those that are from my team, are the Apache Hudi or Marmaray.
We all know it: data is the new oil, and we live in a data age. We all understand the value of data in optimizing the business and pursuing new opportunities. The question is how to use it efficiently to become more successful and more agile, how to become a data-driven organization? The best way to look for the answers is Big Data Tech Warsaw Summit, an independent annual conference focusing on data science and engineering, AI and machine learning, data strategies and monetization, DevOps and the cloud.
Data truly is today's most valuable resource. Every company in the world today wants to analyze data to improve their internal processes, enhance the way how they work with customers or how they collaborate with suppliers and partners. But it is not trivial, and companies need to face a few challenges to be successful. It requires the right skill set, which is hard to find and appropriate technology in the increasingly sophisticated technology space.
Thomas Scholz, Sales Engineering
Manager for EMEA, Snowflake
"There are a few challenges that companies have to deal with to be really successful today. Issue number one is definitely the growth of the data and staying on top of this growth. Data expanded exponentially over the last few years. It looks like it will continue to grow exponentially over the next years. It's a real challenge for everyone to deal with the sheer amount of data before trying to understand it" said Thomas Scholz, Sales Engineering Manager for EMEA at Snowflake, talking about challenges of modern analytics during the opening keynote of Big Data Tech Warsaw Summit (BDTWS) 2020.
He argued that the companies also need to break down rigid and centralized legacy architectures. Apart from the size and a large number of data sources, today's distributed nature of date requires agile and data-centric architectures that evolve with the business environment and allow real-time access, real-time ingestion, and real-time results. This is the only way to achieve the needed insights and required acceleration.
Issue number two is complexity. It is tough for companies to find the right technology. They need expert guidance because today they spent too much time managing different bits and pieces, tuning, updating, and rebooting. The third challenge is data diversity - different types of data, structured and semi-structured, also different data silos. The business wants to understand the customer deeply. So it needs to harness and analyze all available sources, internal and external, open and commercially available. It is also tough to secure the data and protect it from malware. So issue number four is security and access. And then there is a challenge number five - costs and cost of failure.
Second, the ease of use and flexibility that comes with working with data in a new way allows us reduce the overall cost of how we scale data management and analytics. Third we can eliminate excessive costs and focus on delivering a great customer experience through data. Whether it's improving how we interact with customers, building better quality products, or making data accessible to both internal and external customers that need it, by leveraging data to better understand how they intersect with your business, we can dramatically improve the overall customer experience" – convinced Thomas Sholz.
Challenges of AI and ML
Marek Wiewiórka, Big Data
Another challenge is how to run large scale Big Data analytics projects efficiently. During BDTWS 2020, the panel of experts explored challenges that companies face running artificial intelligence (AI) and machine learning (ML) solutions. It is still an issue despite the spread of dedicated AI platforms, ready-to-use ML libraries, and tons of data available.
"I think that it's always about data. Because no matter how good tools you have, how good algorithms you can build upon it, if you cannot access all the data you need, if you cannot find the right data within your organization, you actually cannot do anything relevant" – said Marek Wiewiórka, Big Data Architect at GetInData.
"Data scientists spent like 50% of their time looking for the data and not looking at the data looking for the data" - added the panel host Marcin Choiński, Head of Big Data & Analytics Ecosystem at TVN.
Marcin Choiński, Head of Big Data
& Analytics Ecosystem, TVN
In many cases, when starting a machine learning project, there is a high number of potential data sources. The first step is always trying to unify them, provide unified layers so that users can access the data in the same way. Only then it is possible to integrate, start prototyping and crunch the data, and build some machine learning model.
"The tools available on the market today provide easy access to data modeling. You don't have to be a Data Scientist to run sophisticated analysis. The greatest challenge is to find the right data and assess the quality of the data, not modeling itself because it can also, to some extent, be automatic" – summed up Marek Wiewiórka.
Paweł Zawistowski, Lead Data Scientist,
Adform, Assistant Professor,
Warsaw University of Technology
Once the model is built, machine learning experts have to be able to explain it. Today a lot of models are sort of easy to make, but very difficult to explain. This is not sustainable, and that is why it is becoming more and more critical. "We try to stay away from those models, that we cannot completely explain, because from time to time we get support tickets. People ask us what happened there and we have to explain it. It is a real problem when we do not know how to debug the model," – said Paweł Zawistowski, Lead Data Scientist at Adform and Assistant Professor at Warsaw University of Technology.
"I guess one of the biggest challenges that we deal with, really the reason that our teams exist working on infrastructure, is because we have this huge problem with building out production scale models. In our original machine learning experience it was really the result of three or four engineers pushing forward and getting something out for production. We wanted to employ many systems at scale and to improve them, and iterate on them. But it hard to find so many top engineers to keep that up" – said Josh Baer, Product Lead, Machine Learning Platform at Spotify.
Spotify has been using Machine Learning in our product for nearly a decade, but it's only been in the last few years that they have invested in building tools aimed at making the lives of internal ML practitioners easier and more productive.
Migrating to cloud
Spotify was only one of the brightest stars of today's global digital economy present at BDTWS 2020. They shared their use-cases, recommendations, tips, successes, and failures. One of the trends that were easy to see was using cloud platforms.
Spotify is an international media services provider with a headquarter in Stockholm. The company has around 3500 employees and 271 mln users with close to half paying for the service. As a consumer-facing organization is mostly using ML to optimize for better user experience. But it has circa 70 different unique machine learning use cases - they range from improving ad targeting to maximizing the experience of some internal service.
Josh Baer, Product Lead,
Machine Learning Platform,
At this scale, Spotify is almost 100% on the cloud. The company did a big migration in 2016 and 2017 to the Google Cloud Platform. "In our case, the cloud handles some of the work that we used to see as pretty interesting, but now we might see as boring. We would always have to think about where are we going to open up our next data center? How do we make sure that we're provisioning enough machines? Now we don't have to worry about that as much, we don't have to worry as much around managing our own databases. For example, we can use these services that cloud providers use and work with that," – said Josh Baer.
Senior Data Engineer,
Another company that shared their journey to the cloud was Truecaller, a smartphone app and service of caller-identification, call-blocking, flash-messaging, and call-recording. Truecaller has 150 mln active users that generate 30 bln events a day. The company also migrated on-premise data centers to Google Cloud Platform. "We needed to rethink our original on-premise data architecture. The cloud quickly become an option. We considered storage-computing decoupling, maintenance, and also cloud cost and offerings," – said Fouad Alsayadi, Senior Data Engineer at Truecaller. Mixing on-premise, hybrid, and native cloud technologies, Truecaller built a robust, self-service architecture lowering costs and making data scientists happier.
Our business is data
Senior Chapter Lead,
ING Tech Poland
Members of the ING Tech Poland team shared their story about new technology platforms and methodologies used in practice on a global scale and how open source technologies and modern machine learning methods are changing regulatory credit risk. The bank invested in a new platform to build models faster for a global organization under strict regulations.
"Models and data are a key strategic asset sustaining a competitive advantage. Better modelling drives a differentiating customer experience, more business and lower capital requirements, improved risk profile and higher efficiency. Credit risk is one of the biggest real business use case for machine learning and especially at ING we embrace that" – said Konrad Wypchło, Senior Chapter Lead at ING Tech Poland
It was also interesting to hear about the journey of Orange Polska from single-use cases to an advanced ecosystem of data. It was a great example of implementing the environment to personalize real-time customer relationships. "Omnichannel personalization in Orange Polska means 21 synced contact channels, one trusted data, one ecosystem, and a short time to market. We have 200 campaigns, more than 30 event triggers and 35 machine learning models. ROI is estimated 3 to 6 times better" – said Tomasz Burzyński, Business Insights Director at Orange
Ketan Gupta, Product Leader at Booking.com shared with BDTWS 2020 attendees how to build products using data and machine learning. Growing from a small startup to one of the world's leading travel companies, Booking.com has more than 28 mln listings and manages 1,5 mln bookings per day. "Technology advancements and huge amounts of data open to new possibilities. Data and machine learning help us find what to build next, from scratch, to solve users' problems with efficiency and scalability," – said Ketan Gupta.
Looking at data, the company saw there was a spike in customer service tickets and a trend in users checking more properties than usual before they book. The analysis showed there are two sets of travelers facing different challenges: one group was not able to book the room because it was sold out and second who booked the room, which the first group wanted but thought about changing it. Solving the supply problem is not just about having more supply to offer but also to create the right match between what partner offerings and traveler needs. To create a perfect match gets tricky with more than 100 mln traveler base, which means the variance in demand gets wider.
"To bridge the gap, you need a guiding light, and that's where data comes to your rescue. Based on historic reservations data, you can understand the traveller behaviour pattern based on from where they are travelling, to where, when do they book, how often they engage with their reservations and more. This helps form a proxy. But, you are missing out on the most relevant data set on which users like to upgrade their room and why, and that's the biggest challenge while building new products, we don't always have the right set of data. Key here is to launch a proxy and closest possible product and collect more feedback and data," – explained Ketan Gupta.
Having gathered more data, the company built a simple model that predicted which facilities would be the most valuable for a specific traveler type. This helped to serve better upgrade options. This simple model sets a baseline. Adding more data, about price and traveler groups, enabled to provide higher valued recommendations to travelers. The result was higher conversion rates and selling more unsold rooms for hotel partners.
Getting bigger and bigger
CEO & Meeting Designer,
"The conference is growing each year. We have started six years ago with 200 people and we are now about 650 attendees, maybe more, because of the last-minute registrations. We have participants from all over the world, from East, West, North and South. Today it's a truly international event" – said Przemysław Gamdzyk, CEO & Meeting Designer at Evention, the organizer of the conference
The presentations came from three distinct sources: eight came from directly invited speakers, experts carefully identified in the community, eight came from the sponsors and partners, and over half of all the talks came from a call for presentation process.
CEO and Co-founder,
"This is intentional because we want this conference to be open for the community, and we want to make sure that everyone who has a good story to share can speak at the conference. We received 73 submissions and we were very impressed by the number and especially the quality of the presentations. It was challenging to select the best presentations and, at the same time, reject many, outstanding ones" - said Adam Kawa, CEO, and Co-founder at GetInData, the co-organizer of the conference.
In the agenda, attendees could found many modern but also battle-proven Big Data technologies, like Kafka, Flink, Airflow, Elastic search, Google Cloud Platform, Kubernetes. It is because the conference is focused on solutions that work on a production. As every year, at BDTWS 2020, there were also a few presentations that introduced new cool technologies like Hudi, Amundsen or rising stars like Snowflake to show how they are challenging the status quo and how they are even solving the problems that no one has solved before.
Those proven and new technologies are used by speakers and their companies to build powerful Big Data platforms for batch, for real-time processing, for deploying large scale machine learning projects, on-premise, and in the cloud. That is why the presentations can be treated as lessons learned from real-world use cases. It is always better to learn from the experiences of others.
One year after: new era of the cloud
One of the main themes of the last year's BDTWS edition was Cloudera and Hortonworks merger and what would happen next. "The new Cloudera, a year after the merger, is a truly global company. We are present in almost a hundred countries. We have thousands of employees and customers with our big data platform based on open source," - said Marton Balassi, Manager, Streaming Analytics at Cloudera.
Manager, Streaming Analytics,
The new company was expected to reshape the enterprise data landscape by providing a unified, open-source, hybrid-cloud service offering to customers, and they delivered. The new Cloudera Data Platform, the best of both worlds of Cloudera CDH and Hortonworks HDP, with new features and new functionalities, leveraging the cloud environment is the industry's first enterprise data cloud, delivering a comprehensive set of multi-function analytic experiences or any data, anywhere, with collective security, governance, and control.
"The Cloudera secret sauce is in the governance, the security, a bit boring bits around the really cool open-source part. But that's really the unique opportunity of bridging that gap between open source and the enterprise," - said Marton Balassi.
Cloudera was on stage with the polish partner, 3Soft S.A. Together they demonstrated the new Cloudera solutions for a hybrid cloud environment, they talked about Apache Flink integration to the Cloudera Data Platform and showed some real-life challenges solving based on the use cases from the polish market.
Yuan Jiang is a senior staff engineer leading the storage engine team for the Interactive Analytics product at Alibaba. At Big Data Tech Warsaw Summit 2020, he told a story of a large-scale real-time data warehouse product developed in-house. He talked solutions for a real-time data warehouse, its architecture, and typical scenarios.
What is Interactive Analytics service, and is it available for the general public?
It is a sub-second real-time data warehouse. It offers the ability to analyze a massive amount of data interactively, and it is fully compatible with PostgreSQL. We combined large scale Computational Storage – low cost, high performance, and high availability with highly-performed ingestion and query that offers low latency, high throughput, and high concurrency.
Interactive Analytics is widely used inside Alibaba. It is adopted internally by Search, Recommendation, and Ads products and also is available in AliCloud. Lots of customers use it as a private cloud. The public cloud is in the beta stage. It is available in Chinese. We are actually in the process of translating it to English, and we plan general availability in a couple of months.
What are the primary technology highlights of Interactive Analytics?
The goal was to build a low cost for user, large scale computational storage. It is serverless, and user doesn’t have to worry about buying servers and the standard set of things. We will take care of that, and it’s low-cost. It’s a cloud-native so we can do a separation of compute and storage and also unify the storage from both streaming processing and batch processing. For the engineering effort, we build a C++ Native Execution Engine and Query Optimizer and also a storage engine.
Why did the company decide to build this service from the ground up?
We were looking for a product for simple queries. But there was nothing on the market. First, at Alibaba, there were internal talks about some existing computational storage systems, mainly open source products. We took a close look at Apache databases – HBase and Cassandra. They are suitable for highly concurrent, simple queries. They are also called NoSQL databases, but today often they have some SQL layers – at HBase, they have Apache Phoenix, and Cassandra has CQL. Internally you have a straightforward interface, so it’s very suitable for simple queries. We also looked at products like Druid and Apache Kudu. Actually, Druid is the system that we used inside Alibaba before we build Interactive Analytics. They are not suitable for simple queries. They are best for complex big scans. And we wanted to create a system to support both scenarios.
But you also decided to use some existing systems for this product?
Before we do the system, we had to make a decision: should we build our own client system, so the user writes the new code to deal with it? We decided that it’s not good for the user to adopt a new system. It’s actually really hard, so we looked at the existing systems. We found that we can just leverage the PostgreSQL system to do that. It’s already widely used, and there are lots of tools. We have a command line, clients, and Tableau. So you will speed up the adoption rate.
What was the main challenge for the Interactive Analytics team?
Alibaba is the biggest e-commerce company in China. It has close to 1 bln customers. So every day we have a lot of data coming. We just needed extreme performance. So in every component in our system, we want to get the best performance. That is why we support row-oriented and column-oriented storage.
We have vectorization in query execution, high concurrency, and we efficiently use compute resources.
We use C++ to make sure stable low latency; we cost-base query optimize and leverage the character of storage. We have highly efficient resource management and scheduling service.
What are the typical business scenarios for the service?
The first typical business scenario is online processing, real-time A/B testing. We start at the activity logs the user clicks, and it goes through the internal DataHub. Then you go through the real-time compute – inside Alibaba, we use the enhanced version of Flink. It offers performance ten times higher than open-source Apache Flink, and five to ten times higher than Apache Spark for some performance metrics. Then the data goes to the Interactive Analytics, and immediately it can be used for real-time reporting and also customer computer models to analyze user behavior.
The second business scenario is offline processing acceleration. We have MaxCompute. It is Alibaba’s version of the MapReduce system. It offers more performance and more scale than the existing open-source version of Apache Hive. So user store the data in MaxCompute. It’s a MapReduce job to get the result. However, they have high latency. So if the user wants to speed up to get the quick response, we can ingest part of the Maxcompute data and then use some other tool like Tableau, SmartBI, Fine Report, Quick BI or Data Service to use it immediately.
The third scenario is when we want to combine offline processing and data streaming processing data into one place so that it can immediately be used in the dashboard and real-time reporting and online application.
So, in summary, the scenarios we try to support are dashboard, real-time BI reporting, user profiling, and monitoring, and alerting. The system we build has already been used inside Alibaba for e-commerce, IoT, and the financial department. It is also used outside our organization. We sold it for some customers as a private cloud to use it in the area of Public Safety.
What are the main challenges of building an end to end data integration platform at petabyte scale?
Max Schultze [MS]: At Zalando building a Data Lake was not the very first thing the company had in mind. Throughout the growth of the company from a startup to the size of thousands of employees that it is now, the technical landscape grew organically and such did the data landscape. Realizing that classical analytical processing was no longer sufficient to sustain the company’s growth – not even speaking of future-oriented use cases like machine learning – the first team of data engineers was found. Soon after the very first challenge towards a data integration platform became apparent, and it was not even a technical one: Bring order to the chaos.
Understanding your data landscape and coming up with a proper and company-wide accepted set of rules for data responsibility and data hygiene is a gigantic challenge but will propel you forward drastically when mastered. Setting priorities properly will decide between success and failure of your project.
Moving forward three years into the project many things have been built. As obvious as that sounds, one of the biggest challenges is to scale what you have. In a growing company, the amount of data produced is constantly growing, too. Over time that will put stress on every part of your infrastructure. Your management database will run full, your self build data ingestion pipeline will start failing more often, and you will even discover limitations in open source systems you are using. Be ready to throw away what you know and stay open for new ideas and new technologies that are developed by others.
What are the best practices?
MS: Be aware that you will always have fewer people than you have work to do. To keep moving forward, you have to be very efficient in how you build and maintain your infrastructure. One of the best practices we follow very closely is to automate operations as much as possible. Build self-healing systems to minimize manual interventions. Whenever you observe manual tasks being executed to often, which might even be the manual execution of already automated steps, keep automating.
To build a successful Data Lake, keep your users close. Understand what matters and focus on the biggest needs and pain points. There will always be plenty of feature requests and feedback. Directly engage with your users on a regular basis. You can identify needs such as centralization of processing capacities, or pain points, like your ad-hoc analytical system not performing well enough, or your integration pipelines being late too often. Sometimes this can lead to easy fixes, sometimes it will result in fundamental shifts in your thinking and the company-wide analytical setup.
What technology tools would you recommend for this task and why?
MS: Leverage the offerings of cloud providers and do less yourself. Serverless is the next shift in paradigm where you are much more focusing on the “what” than the “how”. No longer will you be able to understand every execution step of your backend, but no longer do you have to worry about system maintenance yourselves. Buying into these services gives you strong guarantees from the cloud providers which will lead to many more quiet nights. Our first entirely serverless pipeline went to production in Summer 2018 and we are yet to have an infrastructure related incident.
Embrace infrastructure as code. Infrastructural setup usually includes a lot of resources put together, dependencies across them and parameters being provided and tuned. The more you take care of that by hand, the higher the risk for human error. Recently we started adapting AWS Cloud Development Kit, which is an open source framework for defining cloud infrastructure as code. It is a tremendous help in defining resources with only the parameters that really matter for you by using smart defaults. Additionally, coding your infrastructure in an IDE informs you about errors while you type them and not after 10 minutes of compilation and deployment.
Ludzie, procesy i narzędzia
Data scientist opowiada o współpracy z data engineerami. Rozmowa z Mateuszem Fedoryszakiem z Twittera.
Na czym polega sekret udanej współpracy między data scientist a data engineer?
Szacunek i pokora. Kiedy pracujesz z ludźmi o komplementarnych umiejętnościach, łatwo jest pomyśleć: My rozwiązujemy prawdziwe problemy, ich zadania może wykonać licealista. Często nie zdajesz sobie sprawy, dlaczego wdrożenie niewielkiej usługi lub narysowanie prostego wykresu może być wyzwaniem. Z drugiej strony nawet osoby, które nie rozumieją w pełni twojej dziedziny, mogą dostarczyć cennych sugestii i opinii.
Czy rozwiązaniem jest ścisły podział obowiązków? Czasami częścią problemu jest zmuszanie naukowców zajmujących się danymi do wykonywania zadań inżynierskich. W drugą stronę prawdopodobnie zdarza się to rzadziej?
Mieliśmy odwrotny problem – wydawało się, że oba zespoły chciały robić wszystko. Zdolni ludzie często mają problemy z oddawaniem zadań innym. Tymczasem, jako data scientist, pewnie mogę wykonywać zadania inżynierskie, ale to nie będzie efektywne. Specjalista zrobi to lepiej.
Jak optymalnie dzielić zadania?
Jednym z ważnych aspektów jest zapewnienie, że stosunek interesujących do nudnych zadań jest dla każdego podobny. Wszyscy dokładają swoją cegiełkę do procesu twórczego, ale każdy też wykonuje swoje żmudne obowiązki.
Czy może Pan podać przykłady takich interesujących i nudnych zadań?
Twórcze jest zaprojektowanie architektury albo opracowanie nowego algorytmu czy modelu. Nudne jest szukanie wycieków pamięci i czyszczenie danych.
Czy narzędzia techniczne są częścią rozwiązania?
Narzędzia są ważne, ale jeszcze ważniejsze są procesy i wzajemne zrozumienie. Zawsze zaskakuje mnie, jak ważny jest ludzki aspekt mojej pracy.
Jak te procesy zostały zaprojektowane w firmie, w której obecnie Pan pracuje?
Gdy zaczynaliśmy współpracę z zespołem inżynierów z innego działu, zapisaliśmy dokładnie, jak dużo czasu każdy zespół ma zamiar poświęcić na ten projekt i nad czym będzie pracował. Prosta rzecz, ale pozwoliła nam uzgodnić oczekiwania. Ważne są też typowo ludzie rzeczy. Jako że jeden zespół pracuje w Londynie a drugi w Nowym Jorku, istotne jest by się spotkać na żywo raz na jakiś czas. Często wideokonferencje nie wystarczają.
Rozmawiał Rafał Jakubowski.
Wywiad pochodzi z publikacji “Raport Rynek Pracy – od BigData i AI do BI”
Can data science significantly generate medical and business value at a non-IT company like Roche?
Mohammadjavad Faraji [MF]: Definitely yes! The combined strengths of our pharmaceutical and diagnostic business under one roof already have made Roche the leader in personalised healthcare – PHC, offering comprehensive diagnostics and targeted therapies for people with cancer and other severe diseases. The digitalisation in healthcare now also brings the ability to understand and interpret unprecedented volumes of data that allows a higher resolution view of each individual patient than ever before. We are committed to delivering on this opportunity and are drawing on our unique combinations of strengths to drive this transformation. Our expertise in medicine, biology, diagnostics, data-science, our world-leading companies, such as Flatiron and Foundation Medicine, our partnerships, and our global reach will all contribute to that journey. Our aim is to transform our drug development, diagnostics, and care delivery so that we can deliver value to patients and the entire health care system.
In which area of product value chain do you see the biggest potential to apply data science and why?
[MF]: The transformative effects of using insights from data will have positive impacts along the entire value chain, translating to benefits for the whole healthcare ecosystem. Our focus spans from early science to product approval, to manufacturing, till the very end of the chain where we provide support for our products. It is certainly difficult to say in which area data science can play a bigger role – in terms of having bigger impact – because I see the entire value chain as one big area where data science is widely spread across its complementary components, with objectives such as validation of scientific hypotheses and deeper scientific insights, better, earlier go/no-go decisions in R&D, faster, more efficient trials, enhanced matching of patients and therapies, increasing access to therapies and effective maintenance of analyser instruments, etc.
What are the main challenges and opportunities of being a data scientist in a corporate world?
[MF]: I see great opportunities for data scientists to work in many different data science related areas within Roche, so there is plenty of choice. From a technical aspect, almost all different components of data science skill set are currently being used at Roche for addressing real challenges, including deep learning, natural language processing, predictive maintenance, etc. Delivering valuable insights as a result of combining those data science skills with domain knowledge, which will be acquired while working on different projects at Roche, is super exciting for any data scientist who is passionate about “doing now what patients need next”. Moreover, there is a very good sense of collaboration at Roche. As an example, we have an annual data science challenge where all data scientists at Roche across the globe work on one challenge, where they can do lots of exchange and self-development. Every challenge that a data scientist may face, while working on an initiative, is also an opportunity to learn something new. Without those challenges,
Why can Apache Flink be considered the best choice for processing data streaming?
Dawid Wysakowicz [DW]: One reason is that it addresses all streaming use cases: bounded stream – aka batch, streaming analytics, event-driven applications, etc. It also has the best of class support for state management and event time. It is also industry proven as it runs at Netflix-scale.
In contrast to other open-source stream processors, Flink provides not only true stream processing at the event granularity, i.e., no micro batching, but also handles many batch use cases very well.
What are the main directions of Flink development?
DW: Right now Flink community focuses on true unification of batch and streaming, both in the sense of query semantics, but also on the runtime level as well. That means it will be possible to perform operations like, e.g., bounded-to-unbounded stream join, enrichment with static data, etc. in an optimized way. Later on, this will expand to more use cases such as machine learning on data streams.
Another area of focus is to expand the scope and features of SQL support. We plan to fully support TPC-DS in the nearest future but also further define industry-wide standards for Streaming SQL. Moreover, you can expect significant performance improvements in the future.
What are you working on right now as a Flink committer?
DW: Recently I was driving the effort of supporting the MATCH_RECOGNIZE clause that allows querying for row sequence patterns with SQL, which I’m giving a talk about. Right now I am involved in the efforts of further improving SQL support.
How did you join the community and started contributing to Apache Flink?
DW: I started contributing because I wanted to improve my coding skills and found Apache Flink’s community the most welcoming one among open-source projects I looked into. I was also lucky enough that I met people that allowed me to work on this project as part of my job, first at GetInData and now at Ververica.
Contributing to open-source projects is something I would recommend to everybody as it a great way to meet and work with super intelligent people on extremely interesting problems.