Agenda 2020

February 27, 2020

8.00 - 9.00

Registration and welcome coffee

9.00 - 9.10

Conference opening

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Adam Kawa

CEO and Co-founder, GetInData

9.10 - 10.40

Plenary Session

9.10 - 9.40

Challenges of modern analytics

Data is the new oil. More and more companies understand the value of data to optimise their core business or enter new business fields. They want to analyse data to enhance their internal processes, the way how they work with customers or how they collaborate with external parties such as suppliers, partners etc. All this is not trivial, requires the right skillset and an appropriate technology. The cloud promises scalability, elasticity and resources on demand. But a cloud native architecture is mandatory to leverage these features. Snowflake is built for the cloud, separates storage from compute and offers everything people expect from the cloud. Even a data marketplace for collaboration and monetisation is available that makes sharing and exchanging data globally easy. All this independent from a cloud vendor and capable for multi-cloud setups.

Keywords: #Snowflake #cloud #clouddataplatform #multicloud #cloudanalytics #datawarehousecloud #data #SQL #database #datawarehouse

Thomas Scholz

Sales Engineering Manager for EMEA, Snowflake

9.40 - 10.10

Credit Risk in practice on a global scale. New technology platform and methodologies in practice

Open source technologies and new machine learning methods are changing regulatory Credit risk in recent times. We will talk how we handle data globally from entire ING, what technologies for modelers experience we use, what machine learning tools we incorporate. All of this being compliant with more strict regulatory frameworks than ever. Handling risk on a balance sheet 1,5 size of Poland GDP is a really interesting task for modelers, data scientists and data engineers.

Keywords: #CreditRisk, #Python, #BigData, #UseCase #MachineLearning, #StatisticalModeling, #DataScience

Marcin Brożek

Credit Risk Modelling Expert, ING Tech Poland

Konrad Wypchło

Senior Chapter Lead, ING Tech Poland

10.10 - 10.40

Leveraging hybrid cloud for real-time insights with the new Cloudera Data Platform

The new Cloudera solutions for hybrid cloud environment. Adding Apache Flink integration to the CDP. Solving the real-life challenges based on the use cases from the Polish market. Apache Flink and CSP roadmap.

Key words: #hybrid_cloud, #data_in_motion, #Cloudera_Data_Platform, #stream_processing

Marton Balassi

Manager, Streaming Analytics, Cloudera

Kamil Folkert

CTO, Member of the Board, 3Soft

10.40 - 11.10

Coffee break

11.10 – 15.30 Simultaneous sessions

Architecture, Operations and Cloud

Data Engineering

 

Streaming and Real-Time Analytics

 

Artificial Intelligence and Data Science

Data Strategy and ROI

 

Host:

Arkadiusz Gąsior

Data Engineer, GetInData

Host:

Łukasz Suchenek

Conferences Editor, Evention

Host:

Paweł Jurkiewicz

Data Engineer, GetInData

Host:

Adrian Bednarz

Big Data Engineer, GetInData

Host:

Stefan Rautszko

Team Manager, Data Design, Roche

11.15 - 11.45

From Containers to Kubernetes Operators for a Datastore

Keywords: #docker #container #kubernetes #operator #orchestration

Philipp Krenn

Developer, Elastic

11.15 - 11.45

Will we see driverless cars in 20s?

Keywords: #autonomousdriving #dataingestion #petabytescale #hardwareinthehloop #mapr #spark #openshift

Sławomir Folwarski

Senior Architect, DXC Analytics Platform, DXC Technology

Piotr Frejowski

System Architect, DXC Robotic Drive Program, DXC Technology

11.15 - 11.45

Creating an extensible Big Data Platform for advanced analytics - 100s of PetaBytes with Realtime access

Keywords: #bigdata #scalability #hadoop #spark #analytics #datascience #dataplatform

Reza Shiftehfar

Engineering Management & Leadership, Uber

11.15 - 11.45

Building Recommendation Platform for ESPN+ and Disney+. Lessons Learned

Keywords: #recommendersystems  #ML #cloud #experimentation

Grzegorz Puchawski

Data Science and Recommendation, Disney Streaming Services

11.15 - 11.45

From bioreactors to kibana dashboards

Keywords: #googleCloud #Streaming #DataFlow #DataOps

Fabian Wiktorowski

IT Expert, Roche

11.45 - 11.50

Technical break

11.50 - 12.20

Replication Is Not Enough for 450 PB: Try an Extra DC and a Cold Store

Keywords: #Hadoop #datasecurity #resilience #in-house #storage

Stuart Pook

Senior Site Reliability Engineer, Criteo

11.50 - 12.20

Data Platform at Bolt: challenges of scaling data infrastructure in a hyper growth startup

Keywords:  #aws #datalake #datawarehouse #preprocessing #machinelearning

Łukasz Grądzki

Engineering Manager, Bolt

11.50 - 12.20

Interactive Analytics at Alibaba

Keywords:

Yuan Jiang

Senior Staff Engineer, Alibaba

11.50 - 12.20

Building a Factory for Machine Learning at Spotify

Keywords: #ml #kubeflow #tensorflow #ml-infra

Josh Baer

Product Lead, Machine Learning Platform, Spotify

11.50 - 12.20

Abstraction matters

Keywords: #lowcode, #executionabstraction, #datavirtualization

Anthony Ibrahim

Head of Ab Initio DACH/CEE, Ab Initio

12.20 - 12.25

Technical break

12.25 - 12.55

How to make your Data Scientists like you and save a few bucks while migrating to cloud - Truecaller case study

Keywords: #cloudmigration #bigquery #airflow #kafka

Fouad Alsayadi

Senior Data Engineer, Truecaller

Juliana Araujo

Data Product Manager, Truecaller

Tomasz Żukowski

Data Analyst, GetInData

12.25 - 12.55

Kafka-workers, Parallelism First

Keywords: #kafka, #data processing, #high-performance

Tomasz Uliński

Software Developer, RTB House

12.25 - 12.55

Adventure in Complex Event Processing at telco

Keywords:

Jakub Błachucki

Big Data Engineer, Orange

Maciej Czyżowicz

Technical Leader for Analytics Stream, Orange

Paweł Pinkos

Big Data Engineer, Orange

12.25 - 12.55

Neural Machine Translation: achievements, challenges and the way forward

Keywords:  #machinetranslation #deeplearning #adversarialexamples #datascience

Katarzyna Pakulska

Data Science Technology Leader, Findwise

Barbara Rychalska

Senior Data Scientist and Data Science Section Leader, Findwise

12.25 - 12.55

It's 2020. Why are we still using 1980s tech?

Keywords: #Analytics #SQL #DWH #CaseStudy #BigData

Arnon Shimoni

Product Manager and Solutions Architect, SQream

12.55 - 13.50

Lunch

13.50 - 14.20

DevOps best practices in AWS cloud

Keywords:  #aws_cloud #devops #best_practices #infrastructure_as_a_code

Adam Kurowski

Senior DevOps, StepStone Services

Kamil Szkoda

DevOps Team Leader and Product Owner , StepStone Services

13.50 - 14.20

Presto @ Zalando: A cloud journey for Europe’s leading online retailer

Keywords:  #CloudAnalytics #Presto #DataVirtualization #SQL-on-Hadoop #DWH

Wojciech Biela

Co-founder & Senior Director of Engineering, Starburst

Piotr Findeisen

Software Engineer, Starburst

Max Schultze

Data Engineer, Zalando SE

13.50 - 14.20

Network monitoring, data processing, forecasting, fraud and anomaly detection- using Spark, Elasticsearch, Machine Learning and Hadoop

Keywords: #spark #elasticsearch #machinelearning #hadoop #dataprocessing

Kamil Szpakowski

Big Data Main Specialist, T-Mobile

13.50 - 14.20

Feature store: Solving anti-patterns in ML-systems

Keywords: #ml #recommendersystem #mlops #automl

Andrzej Michałowski

Head of AI Research & Development, Synerise 

13.50 - 14.20

Omnichannel Personalization as example of creating data ROI - from separate use cases to operational complete data ecosystem

Keywords:  #ROI #real-timeomnichannelpersonalization #scalingdataecosystem #businessengagement #harvesting

Tomasz Burzyński

Business Insights Director, Orange

Mateusz Krawczyk

Personalization Solutions Product Owner, Orange

14.20 - 14.25

Technical break

14.25 - 14.55

The Big Data Bento: Diversified yet Unified

Keywords: #bigdatabento #cloud #unifiedanalyticsplatform #unifieddataanalyticsplatform #spark

Michael Shtelma

Solutions Architect, Databricks

14.25 - 14.55

Towards enterprise-grade data discovery and data lineage at ING with Apache Atlas and Amundsen

Keywords: #BigData, #DataDiscovery, #DataIngestion, #Lineage, #MetadataGovernance, #Data-Driven

Verdan Mahmood

Software Engineer, ING

Marek Wiewiórka

Big Data Architect, GetInData

14.25 - 14.55

Monitoring & Analysing Communication and Trade Events as Graphs

Keywords:  #graphAnalytics #transactionProcessing #FlinkGelly #Elasticsearch #Kibana

Christos Hadjinikolis

Senior Consultant, Lead ML Engineer, Data Reply UK

14.25 - 14.55

Utilizing Machine Learning To Optimize Marketing Spend Through Attribution Modelling

Keywords: #attribution #datascience #statisticalmodeling #marketingmix #interdisciplinary

Arunabh Singh

Lead Data Scientist, HiQ International AB

14.25 - 14.55

Data Science @ PMI – Journey from business problem to the data product industrialization

Keywords:  #UseCase #CI/CD #BestPracticesforData Science #DataProduct #Reproducibleresearch

Michał Dyrda

Senior Enterprise Data Scientist, Philip Morris International

Maciej Marek

Enterprise Data Scientist , Philip Morris International

14.55 - 15.00

Technical break

15.00 - 15.30

How to send 16,000 servers to the cloud in 8 months?

Keywords:   #Openx #gcp #scale #adtech #migration

Marcin Mierzejewski

Engineering Director, OpenX

Radek Stankiewicz

Strategic Cloud Engineer, Google Cloud

15.00 - 15.30

Optimize your Data Pipeline without Rewriting it

Keywords:  #data-driven #optimize #data-pipeline #operation #improvement

Magnus Runesson

Senior Data Engineer, Tink

15.00 - 15.30

Flink on a trip - a real-time car insurance system in a nut(shell)

Wojciech Indyk

Streaming Analytics and All Things Data Black Belt Ninja, Humn.ai

15.00 - 15.30

Reliability in ML - how to manage changes in data science projects?

Keywords: #datascience #datamanagement #revisioncontrol #datapipeline

Kornel Skałkowski

Senior AI Engineer, Consonance Solutions

15.00 - 15.30

Using data to build Products

Keywords:  #NewProducts #MachineLearning #DataFueledGrowth #DataGuidedProductDevelopment #ScalingNewProduct

Ketan Gupta

Product Leader, Booking.com

15.30 - 16.10

Coffee break

16.10 – 17.35 Roundtables sessions

16.10 - 16.15

Intro

Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable discussion – they are selected professionals with a vast knowledge and experience.

There will be 2 rounds of discussion, hence every conference participants can take part in 2 discussions

 

16.15 – 16.55    1st round

17.00 – 17.40   2nd round

16.15 - 16.55

1st round

1. Managing a Big Data project – how to make it all work well?

Data Scientists, Data Engineers, DevOps Specialists, Business Stakeholders – all these people come from different worlds, but need to work close together in order to make a Big Data project a success.
Let’s discuss about our achievements as well as… spectacular failures when it comes to communication, cooperation and meeting ones expectations.
We’ll be talking about methodologies, tools, best practices and, so called, human element.

Michał Rudko

Big Data Analyst / Architect, GetInData

2. Analytics and Customer Experience Management on top of Big Data

How to ensure successful adoption of Big Data and Analytics systems? It is a challenge for most organizations. Let’s discuss how to promote user-centric approach, leverage experience design and manage user expectation on Big Data projects. I would be happy to hear you opinion and answer your questions, based on my practical experience applying Design Thinking and architecture design methodologies. I believe this conversation will be interesting for Architects, Tech Leaders, Product Managers and C-level folks.

Taras Bachynskyy

Director, Big Data & Analytics, SoftServe

3. Data visualization, how to visualize large, complex and dirty data and what tools to use

Data visualisation is great tool at explaining data – best we know right now. But our data volume grows every day and we often hit hard limits of current visualisation systems. How we can approach this problem so we still can analyse and explore data? From interactive interfaces that link many visualisations to machine learning algorithms that pick best chart type and parameters. It is really interdisciplinary issue, so let’s share our knowledge!

Adrian Mróź

Frontend Developer, Allegro

4. Practical application of AI

Industry 4.0 and AI – are we ready for the 4th industrial transformation? Who should be the beneficiary of the Industry 4.0? Key barriers to the implementation of AI projects in organizations. Real cases of AI in Industry.

Natalia Szóstak

Head of R&D, TIDK

5. The need for explainable AI

With the spread of AI-based solutions, more and more organizations would like to understand the reasons for system decisions. It’s especially interested in regulated industries. The session will cover so-called white-box methods, as well as modern approaches to AI explainability which allow understanding more complex models.

Kacper Łukawski

Data Science Lead, Codete

6. Real-life machine learning at scale using Kubernetes and Kubeflow

How to build a machine learning pipeline to process 1500 TB data daily in a fast and cost-effective way on Google Cloud Platform using Kubeflow? How to serve TensorFlow model with almost 1M requests per second and latency < 10ms on Kubernetes? Is Kubernetes and Kubeflow ready to serve data scientists?

Michał Bryś

Data scientist, OpenX

Michał Żyliński

Customer Engineer, Google

7. Big Data on Kubernetes

Kubernetes found its place in microservices world. More and more teams are betting on Kubernetes as their goto platform for deploying business applications. What about Big Data? Can our ETLs also make the move? During the roundtable discussion we’ll discuss how Kubernetes can be utilised as a runtime for Big Data jobs. Is current tooling ready for being deployed in Kubernetes containers? What the potential shift means for used storage technologies? Finally, will Kubernetes democratize Big Data work and we’ll move from central data lakes to distributed data meshes?

Tomasz Kogut

Lead Software Engineer, Adform

8. Best tools for alerting and monitoring of the data platforms

Let’s discuss what should be monitored in data platforms? What are best tools for particular use cases? What is not recommended?

Piotr Kalański

Development Manager, StepStone Services

9. What to do with my HDP/CDH cluster with new Cloudera licensing model

After the merger with Hortonworks, Cloudera becomes a single vendor that builds a distribution that consists of major components from so called Hadoop Ecosystem (e.g. Hadoop, Spark, Hive, Ranger). While these components itself are open-source, access to binaries that are critical to install/upgrade the clusters will be limited to only customers who purchase a paid subscription. This means that thousands of the companies that currently use Hadoop for free, will need to decide what to do next. Should I pay for a subscription or compile own binaries to build own distribution? Should I stop using on-premise Hadoop and go to the public cloud instead? During this panel we explore this topic and try to answer these questions based on our vendor-neutral experience when working with our customers who have large production installations of HDP/CDH clusters.

Krzysztof Zarzycki

Big Data Architect, CTO and Co-founder, GetInData

10. Addressing challenges of modern analytics with Snowflake

Data is the new oil. More and more companies understand the value of data to optimise their core business or enter new business fields. They want to analyse data to enhance their internal processes, the way how they work with customers or how they collaborate with external parties such as suppliers, partners etc. All this is not trivial, requires the right skillset and an appropriate technology. The cloud promises scalability, elasticity and resources on demand. But a cloud native architecture is mandatory to leverage these features. Snowflake is built for the cloud, separates storage from compute and offers everything people expect from the cloud. Even a data marketplace for collaboration and monetisation is available that makes sharing and exchanging data globally easy. All this independent from a cloud vendor and capable for multi-cloud setups.

Tomasz Mazurek

Sales Director for Eastern Europe, Snowflake

11. Being efficient data engineer. Tools, ecosystem, skills and ways of learning

What does it mean to be a productive (data) engineer? Is it about the tools we use? Is it the mindset we have? Is it the environment we are surrounded by? Let’s share and discuss war stories, learning resources, methodologies and libraries that help us escape the gumption traps in the daily life of an engineer. Discussion will be divided into 4 areas: debugging, implementation, communication and learning.

Rafał Wojdyła

Data Engineer,

12. Data discovery – building trust around your data

Worldwide growth of data has changed business landscape forever. Multiple organizations undergo transformation triggered by the data revolution. While one can understand the benefits of collecting bigger data volumes, it has revealed additional challenges when trying to effectively use it. Ability to explore the data and increasing compliance demands force us to think about solutions to leverage power of metadata. Data description evolve from being a simple schema definition to catching application context, behavior and how it is changing over time.
Let’s discuss data discovery in context of use cases, technologies and possible challenges.

Damian Warszawski

Software Engineer, ING Tech Poland

13. SQL on Big Data for batch, ad-hoc & streaming processing

Data Analysis is the key factor in data-driven decision culture and SQL is the omnipresent language for deriving information from data. Today, even small companies have huge data sets while huge organisations have enormous ones. With the advent of technologies aiming at replacing and unifying the ones we used previously, we only have more complex and more heterogeneous data landscapes. How to query the data to fuel key business decisions? How to handle data ingestion? When do we need it?

Piotr Findeisen

Software Engineer, Starburst

14. The Latest and Greatest of Apache Spark

Apache Spark is a fast and general engine for distributed in-memory computations on a massive scale. Spark 3 is in preview and expected to be released in the first quarter of 2020. What features are you waiting for and what problems do you hope to solve with Spark 3? The roundtable is to share and discuss problems we want to solve with the new features coming in Spark 3.

Magnus Runesson

Senior Data Engineer, Tink

15. Serverless data warehousing – big data, the cloud way

Definition what is serverless warehouse. List of solutions which are considered serverless. Data ingestion. Data storage. Data processing, pricing and cost efficiency. Advantages and disadvantages of both serverless and on premise.

Arkadiusz Gąsior

Data Engineer, GetInData

16. Stream processing engines – features, performance, comparison

Streaming systems are gaining more and more attention and we don’t expect this trend to slow down. Currently there are a few engines on the market. At this roundtable we will share our knowledge about similarities and differences among them in various areas. What are the strengths, weaknesses and constraints of each of them? Is there a niche for each of them or maybe a final winner should emerge?

Marek Maj

Big Data Engineer, GetInData

 17. Snorkel Beambell – Real-time Weak Supervision on Apache Beam 

Deep Learning models have led to a massive growth of real-world machine learning that allows Machine Learning Practitioners to get the state-of-the-art score on benchmarks without any hand-engineered features. The challenge with continuous retraining is that one needs to maintain prior state (e.g., the learning functions in case of Weak Supervision or a pretrained model like BERT or Word2Vec for Transfer Learning) that is shared across multiple streams. Apache Beam’s Stateful Stream processing capabilities are a perfect match to include support for scalable Weak Supervision.

Suneel Marthi

Principal Technologist - AI/ML, Amazon Web Services

17.00 - 17.40

2nd round

1. Managing a Big Data project – how to make it all work well?

Data Scientists, Data Engineers, DevOps Specialists, Business Stakeholders – all these people come from different worlds, but need to work close together in order to make a Big Data project a success.
Let’s discuss about our achievements as well as… spectacular failures when it comes to communication, cooperation and meeting ones expectations.
We’ll be talking about methodologies, tools, best practices and, so called, human element.

Michał Rudko

Big Data Analyst / Architect, GetInData

2. Bring Data as Products to consumers

How do you define data products? What mindset and approach we should have to make Product approach possible in Data? Let’s brainstorm about implementation of the approach, opportunities and values it brings.

Łukasz Pakuła

RGITSC Team Manager - DataOps , Roche

3. Data visualization, how to visualize large, complex and dirty data and what tools to use

Data visualisation is great tool at explaining data – best we know right now. But our data volume grows every day and we often hit hard limits of current visualisation systems. How we can approach this problem so we still can analyse and explore data? From interactive interfaces that link many visualisations to machine learning algorithms that pick best chart type and parameters. It is really interdisciplinary issue, so let’s share our knowledge!

Adrian Mróź

Frontend Developer, Allegro

4. BI platfom: a choice or a circumstance?

We take a lot of effort to design datalakes, pipelines and data warehouses. But don’t forget what was the original aim at the very begining of this road: to get some insight. In order to utilize petabytes in orginzed and governed manner we need a solid BI Platform that suits our needs. A well chosen one is not only icing on the cake – it makes all data efforts reasonable and drives our business head-to-head with competition.
During this round table session we will discuss the importance of choosing the right BI platform, exchange best practices (and some bad ones too) and focus on currnet trends in the area of BI.

Emil Ruchała

Data Analyst & BI Developer, GetInData

5. Real-life machine learning at scale using Kubernetes and Kubeflow

How to build a machine learning pipeline to process 1500 TB data daily in a fast and cost-effective way on Google Cloud Platform using Kubeflow? How to serve TensorFlow model with almost 1M requests per second and latency < 10ms on Kubernetes? Is Kubernetes and Kubeflow ready to serve data scientists?

Michał Bryś

Data scientist, OpenX

Michał Żyliński

Customer Engineer, Google

6. Databases in Kubernetes: from bare metal to cloud native

Initial cloud-native conquests started with stateless services but gradually turned towards data management systems. DBMS have relied on centralized, bare metal servers for decades. Cloud-native architectures are a big new technology shift for such systems. Many commercial and open source databases already provide cloud-native adoptions of their products. Among the most interesting cloud-native converts are analytic databases, which imply additional requirements to storage and clustering techniques in order to run fast analytic queries over billions, or even trillions, of rows. Examples include the MySQL clustering project Vitess, which has recently reached CNCF graduation level, and ClickHouse, an extremely fast and scalable analytical database that is being converted to cloud-native operation by Altinity.

Join me to discuss various aspects of running databases in Kubernetes. This is a new technology that has a lot of caveats, such as storage. At the same time databases in Kubernetes promise substantial benefits to the users of such applications as well as companies that operate them. We will explore these issues as well as the path to maturity.

Alexander Zaitsev

Co-founder & CTO, Altinity

7. Scale Your Logs, Metrics, and Traces with the Elastic Stack From traditional applications to microservices and Kubernetes

How do you tackle your monitoring and observability problems? There is a high chance that you are using the Elastic or ELK Stack and this session is all about making it scale: From easier collection of data, to scalable multi-tier architectures, and the lifecycle of your data including the deletion.

Philipp Krenn

Developer, Elastic

8. From on-premise to the cloud: an end to end cloud migration journey

The goal of discussion is to share various experiences and ideas around tackling the migration challenge. Together we will try to nail down the benefits, obstacles and solutions that can help in our journey from on-prem to the cloud.
The plan is to identify the important areas of the migration and to share the lessons we’ve learned from those.

Mateusz Pytel

Google Certified Professional - Cloud Architect, GetInData

9. Challenges of building a modern & future-proof data processing platform

The speed of changes in IT and our companies seems to never stop increasing, especially in the field of BigData. To keep up with it we need to move fast and be smart about it, but how to actually achieve that? How to predict future needs for processing and tools? How to prepare for it? What kind of trade offs we can make? We’ll try to answer that together and share good practices and experience during this session.

Monika Puchalska

Engineering Manager, Zendesk

10. Hadoop is dying, long life to HDFS, what are your options and plans for sustainable advanced analytics and machine learning?

Many enterprise organizations, particularly in financial services and telco, have built massive data lakes in Hadoop, specifically HDFS. However, Hadoop never lived up to its hype as a data warehouse replacement and languished as storage option in on-premises data centers. On top of that, the top two HDFS vendors – Hortonworks and Cloudera – have merged and the third, MapR, has been sold off to HPE in a fire sale.
What is your escape plan and alternatives for applying advanced analytics and machine learning to your growing data volumes? Will you adopt object storage as a cost-effective, data storage repository? Is the separation of compute and storage the future database architecture to manage variable workloads?
Join this roundtable discussion to get to the bottom of these top-of-mind questions and learn the emerging trends and options to build modern data pipelines for advanced analytics and Machine Learning over vast amounts of data?

Maciej Paliwoda

Solution Engineer, VERTICA

11. Being efficient data engineer. Tools, ecosystem, skills and ways of learning

What does it mean to be a productive (data) engineer? Is it about the tools we use? Is it the mindset we have? Is it the environment we are surrounded by? Let’s share and discuss war stories, learning resources, methodologies and libraries that help us escape the gumption traps in the daily life of an engineer. Discussion will be divided into 4 areas: debugging, implementation, communication and learning.

Rafał Wojdyła

Data Engineer,

12. Data discovery – building trust around your data

Worldwide growth of data has changed business landscape forever. Multiple organizations undergo transformation triggered by the data revolution. While one can understand the benefits of collecting bigger data volumes, it has revealed additional challenges when trying to effectively use it. Ability to explore the data and increasing compliance demands force us to think about solutions to leverage power of metadata. Data description evolve from being a simple schema definition to catching application context, behavior and how it is changing over time.
Let’s discuss data discovery in context of use cases, technologies and possible challenges.

Damian Warszawski

Software Engineer, ING Tech Poland

13. SQL on Big Data for batch, ad-hoc & streaming processing

Data Analysis is the key factor in data-driven decision culture and SQL is the omnipresent language for deriving information from data. Today, even small companies have huge data sets while huge organisations have enormous ones. With the advent of technologies aiming at replacing and unifying the ones we used previously, we only have more complex and more heterogeneous data landscapes. How to query the data to fuel key business decisions? How to handle data ingestion? When do we need it?

Piotr Findeisen

Software Engineer, Starburst

14. The Latest and Greatest of Apache Spark

Apache Spark is a fast and general engine for distributed in-memory computations on a massive scale. Spark 3 is in preview and expected to be released in the first quarter of 2020. What features are you waiting for and what problems do you hope to solve with Spark 3? The roundtable is to share and discuss problems we want to solve with the new features coming in Spark 3.

Magnus Runesson

Senior Data Engineer, Tink

15. Managing workflows at scale

How to build and maintain thousands of pipelines in the organisation? What are the biggest pain points in orchestrating hundreds of ETLs? What open source and managed solutions are available?

Paweł Kupidura

Data Engineer, Bolt

16. Stream processing engines – features, performance, comparison

Streaming systems are gaining more and more attention and we don’t expect this trend to slow down. Currently there are a few engines on the market. At this roundtable we will share our knowledge about similarities and differences among them in various areas. What are the strengths, weaknesses and constraints of each of them? Is there a niche for each of them or maybe a final winner should emerge?
Let’s share our experiences in using streaming engines as well as our predictions about their future.

Marek Maj

Big Data Engineer, GetInData

17. Data Auditing

Have you ever encountered a situation when your pipeline (or system) produced less (or more) data than expected? Has it lost your data? Or maybe that data has never reached the source you read (or reached, but delayed too much)? Have you ever seen too many empty (or NULL) fields? Or a field called `age` with negative values? Or maybe you encrypted some data thoroughly enough, that no one could read (decrypt) that data anymore? Last, but not least, what to do if you detect a problem (or, do you even monitor your data? Do you have any alerts set?). Can you backfill, mutate or leave a wrong dataset? Let’s talk about different ways to ensure that data you produce makes sense.

Bartosz Janota

Senior Data Engineer, Bolt

17.40 - 17.55

Coffee break

17.55 - 18.25

Panel discussion: Ways to make large-scale ML actually work

Despite the spread of dedicated AI platforms, ready-to-use ML libraries and tons of data available, running successful large-scale AI/ML projects still faces technical and organizational challenges. According to some studies, 8 out of 10 such projects fail. This panel will explore the necessary technical prerequisites that a company should introduce to build ML-based solutions efficiently. This includes, for example, organizing the data (e.g. data discovery, data lineage, data quality), experimenting with models (e.g. notebooks, libraries, collaboration), one-click-deployment of a model (e.g. AI/ML platforms, infrastructure) and more. While many of these challenges are not that hard when working with small data, everything gets more complex & time-consuming when the data and scale is larger.

Host:

Marcin Choiński

Head of Big Data & Analytics Ecosystem, TVN

Panelists:

Josh Baer

Product Lead, Machine Learning Platform, Spotify

Marek Wiewiórka

Big Data Architect, GetInData

Paweł Zawistowski

Lead Data Scientist, Adform, Assistant Professor, Warsaw University of Technology

18.25 - 18.40

Closing & Summary

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

19.00 - 22.00

Networking party for all participants and speakers

At the end of the conference we would like to invite all the attendees for the informal evening meeting at “Dekada” Club , which is located at the Grójecka 19/25, 02-021 Warszawa.