AGENDA 2018

Changes in the order of presentation might occur

8.00 - 9.00 -

Registration and coffee

9.00 - 9.15

Conference opening

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Adam Kawa

CEO and Co-founder, GetInData

9.15 – 10.45 Plenary session

-

Never Underestimate the Power of a Single Node

Recent developments in GPU hardware and storage technology have changed how we do data analysis and machine learning. These technologies on a single node have grown many folds in the last five years while the growth in network speed has lagged behind. I will talk about the overall ML lifecycle and challenges we face in doing ML at scale, from protecting your Uber accounts to making self driving cars a reality. Then I want to focus on an important part of ML lifecycle which is data/ML exploration and experimentation. In large companies like Uber, data scientists are inclined to use shared Hadoop infra for all their needs. For data exploration, this is inefficient for the user and also makes the cluster run slow. I will talk about our new solution to tackle this problem by using a high powered node that lets us to work with 100s of GB to few TBs of data interactively without paying the overhead of a distributed system. I will also talk about some of the interesting machine learning and infrastructure problems that I face in my new role in Uber’s self driving team.

Karthik Ramasamy

Data Science Manager, Uber

10.45 - 11.15

Coffee break

11.15 – 15.30 Simultaneous sessions

Architecture, Operations & Deployment

This track is dedicated to system architects, administrators and people with DevOps skills who are interested in technologies and best practices for planning, building, installing, managing and securing their Big Data infrastructure in enterprise environments – both on-premise and the cloud.

Data Engineering

 

This track is the place for developers to learn about tools, techniques and innovative solutions to collect, store and process large volumes of data. It covers topics like data ingestion, ETL, distributed engineers, process scheduling, metadata and schema management, distributed datastores and more.

Analytics & Data Science

This track includes real case-studies demonstrating how Big Data is used to address a wide range of business problems. You can find here talks about large-scale Machine Learning, A/B tests, visualizing data as well as various analysis that enable making data-driven decisions and feed personalized features of data-driven products.

Real-Time Analytics

 

This track covers technologies, strategies and use-cases for real-time data ingestion and deriving real-time actionable insights from the flow of events coming from sensors, devices, users, and front-end systems.

-

Big Data Journey at a Big Corp

We will present the journey of Orange Polska evolving from a proprietary ecosystem towards significantly open-source ecosystem based on Hadoop and friends 

  – a journey particularly challenging at a large corporation. We’ll present key drivers for starting Big Data, evolution of BI, emergence of Data Scientists and advanced analytics along with operational reporting and stream processing to detect issues. This presentation will cover both technical aspects and business environment, as both are  inherently linked in process of big data enterprise adoption.
(More...)

Tomasz Burzyński

BI Director, Orange Polska

Maciej Czyżowicz

Architekt Korporacyjny, Orange Polska

-

Building a Modern Data Pipeline: Lessons Learned

Adform is one of the biggest European ad-tech companies – for example, our RTB engine at peak handles ~1m requests per second, each in under 100 ms, producing ~20TB of data daily.

 In this talk I will present the data pipeline and the infrastructure behind it, emphasizing our core principles (such as event sourcing, immutability, correctness) as well as the lessons learned along the way while building it and the state it is converging to. 
(More...)
Keywords: stream processing, kafka, event sourcing, big data

Saulius Valatka

Technical Lead, Adform

-

Machine learning security

Despite rapid progress of tools and methods, security has been almost entirely overlooked in the mainstream machine learning. Unfortunately, even the most sophisticated and carefully crafted models can become victims of using the so-called adversarial examples. 

  This talk will cover the concepts of adversarial data and machine learning security, go through examples of possible attack vectors and discuss the currently known defence mechanisms. 
(More...)
Keywords: machine learning, security, adversarial examples

Paweł Zawistowski

Assistant professor/Senior Data Scientist, Warsaw University of Technology/Adform

-

Apache Flink: Better, Faster & Uncut

This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing,

 end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.
(More...)
Keywords: Apache Flink, streaming, data processing engine

Piotr Nowojski

Software Engineer, data Artisans

-

Big data serving with Vespa

Offline processing with big data sets can be done with tools such as Hadoop or Spark and streams of data processed with Storm. But what do you do when you need to process data at the time a user is making a request?

 This talk will introduce Vespa – an engine solving the problem of big data serving. Vespa is behind the recommendation, ad targeting and search at Yahoo where it handles billions of daily  queries over billions of documents. Some iteration of the engine has been in production for over 15 years. Vespa was recently open sourced at http://vespa.ai
(More...)
Keywords: Vespa, recommendations, targeting, search

Jon Bratseth

Distinguished architect, Yahoo!

-

Privacy by Design

Privacy and personal integrity has become a focus topic, due to the upcoming GDPR deadline in May 2018 and it’s requirements for data storage, retention, and access. This talk provides an engineering perspective on privacy and highlights pitfalls and topics that require early attention. 

  The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments. 
(More...)
Keywords: Privacy, GDPR, data pipelines, data engineering

Lars Albertsson

Founder & data engineering consultant, Mapflat

-

7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with Analytics

The next time you find yourself thinking there isn’t enough time in a week, consider what Drinker Biddle did for their client in 7 days.

When a senior executive for a publicly traded company was fired for underperformance, he made a serious allegation on his way out the door.  He claimed he was laid off because of his repeated attempts to inform officials that the company was falsifying quarterly financial reports to the public. Instead of waiting for the typical pace of discovery that could potentially cost their client at least a quarter of a million dollars, Drinker Biddle used powerful analytics technology to conduct an intelligent investigation, fast. In this session, you will learn about machine learning that makes digging through large multi-sources data sets possible. You will have a chance to see the backstage of how engineers empower legal teams to organize data, discover the truth and act on it. 
(More...)
Keywords: machine learning, analytics, workflow

Perry Marchant

vice president of engineering, Relativity

-

Thinking in Data Flows

In this presentation we’ll look at how far one can push the notion of batch = streaming, how processor-oriented architectures like Apache NiFi and Apache Streams work

 and why they work better than a forced choice between batch and streaming. We’ll close with some real-world of examples of using NiFi and Streams for loading batch and streaming data to HDFS, Elasticsearch, and other data destinations used in modern data pipelines.
(More...)

Keywords: streaming, data flow, NiFi, Streams

Joey Frazee

Solutions Engineer, Hortonworks

Steve Blackmon

VP Technology, People Pattern

-

Cloud operations with streaming analytics using Apache NiFi and Apache Flink

The amount of information coming from a Cloud deployment that can be used to have a better situational awareness and operate it efficiently is huge. 

  This session will explain how Red Hat uses tools like NiFi, Kafka and Flink to process the constant stream of syslog messages (RFC5424) produced by the Infrastructure as a Service, provided by OpenStack, and also detect common failure patterns that can arise and generate alerts as needed.
(More...)
Keywords: Apache Flink, Apache NiFi, Cloud monitoring, Apache Kafka

Suneel Marthi

Principal Technologist - AI/ML, Amazon Web Services

-

Time Series Jobs Scheduling at Criteo With Cuttle

At Criteo we run something like 300k jobs, processing around 4PB of logs to produce trillions of new records each day. We do that using several frameworks such as Hive, raw Map/Reduce, Scalding or Spark.

  In this presentation I will introduce you to “Cuttle” – our open-source Scala based jobs scheduler. You will learn what it is good for and how you can use it to produce data at scale. 
(More...)
Keywords: workflow, scheduling, hadoop, scala

Guillaume Bort

Technical lead, Data Realiability Engineering, Criteo

-

The Factorization Machines algorithm for building recommendation system

One of successful examples of data science applications in the Big Data domain are recommendation systems. The goal of my talk is to present the Factorization Machines algorithm, available in the SAS Viya platform. 

  The Factorization Machines is a good choice for making predictions and recommendations based on large sparse data, in particular specific for the Big Data. In practical part of the presentation, a low level granularity data from the NBA league will be used to build an application recommending optimal game strategies as well as predicting results of league games.
(More...)
Keywords: SAS Viya, Factorization Machines, recommendation system, sparse data

Paweł Łagodziński

Sr Business Solutions Manager, SAS

-

Assisting millions of active users in real-time

Nowadays many companies become data rich and intensive. They have millions of users generating billions of interactions and events per day.


These massive streams of complex events can be processed and reacted upon to e.g. offer new products, next best actions, communicate to users or detect frauds, and quicker we can do it, the higher value we can generate.

In this talk we will present, how in joint development with our client and in just few months effort we have built from ground up a complex event processing platform for their intensive data streams. We will share how the system runs marketing campaigns or detect frauds by following behavior of millions users in real-time and reacting on it instantly. The platform designed and built with Big Data technologies to infinitely and cost-effectively scale already ingests and processes billions of messages or terabytes of data per day on a still small cluster. We will share how we leveraged the current best of breed open-source projects including Apache Flink, Apache Nifi and Apache Kafka, but also what interesting problems we needed to solve. Finally, we will share where we’re heading next, what next use cases we’re going to implement and how.

(More...)

Krzysztof Zarzycki

Big Data Architect and Co-founder, GetInData

Dawid Wysakowicz

Data Engineer, GetInData

12.55 - 13.50

Lunch

-

Bringing Druid to production; the possibilities and pitfalls

Druid, a high-performance, column-oriented, distributed data store. This database allows you to query petabytes of columnar data in a realtime fashion.


Firstly, an introduction Druid’s architecture and the many components within the database system and their role. Secondly, the two ways (batch/realtime) of ingesting data into Druid and their pro’s and con’s. Finally, a case will be presented of using Druid into production. The focus is a cost effective implementation that allows Druid to scale using an OpenStack private cloud. The take-aways of the session are insights in when to use Druid and help you to identify and common pitfalls when running Druid in Production.
(More...)
Keywords: Druid, Databases, Scale

Fokko Driesprong

Data Engineer , GoDataDriven

-

Booking.com's way to scale ML across the business

It is in the culture of Booking.com that we make all changes in the product based on data from A/B experiments. We do thousands of them every year in all areas of the business.

 In the last few years we’ve faced a challenge of scaling ML usage across all the teams in the company so it can be used in hundreds of experiments per year at least. Our initial set of goals was to speedup experimentation with real-time features, make features reusable by Data Scientists (DS) within the company and reduce training/serving data skew problem.

During the presentation I will show (1) how is possible to design production pipelines in a way that allows DS to build and deploy them without the help of a developer, (2) why constructing online features is a much more complex job than offline construction and business-wise it is not always a priority to invest into their construction even if they are proven to be beneficial to the model performance and (3) if cost of using advanced ML algorithms decreases way faster than the cost of preparing the data.

(More...)
Keywords: A/B testing, Spark, Kafka, Cassandra, Hive, ML

Roman Studenikin

Software Developer, Booking.com

-

We are waiting for the speaker's confirmation

-

Near Real-Time Fraud Detection in Telecommunication Industry

In general, fraud is the common painful area in the telecom sector, and detecting fraud is like finding a needle in the haystack due to volume and velocity of data. There are 2 key factors to detect fraud: 

(1). Speed: If you can’t detect in time, you’re doomed to loose because they’ve already got what they need. Simbox detection is one of the use case for this situation. Frauders use it to bypass interconnection fee. In this use case we’re talking about our real time architecture using Spark SQL to detect simbox within 5 minutes.

(2). Accuracy: Frauders changes their method all the time. But our job is finding their behaviour using machine learning algorithms accurately. Anomaly detection is one of the use case for this situation. In this use case we’re talking about data mining architecture to make fraud models using Spark ML within 1 hour. We also discuss some ML algorithm performance on Spark such as K-means, three sigma rule, T-digest and so on. In order to accomplish these factors, we processes 8-10 billion records which size is 4-5 TB every day. Our solution combines end-to-end data ingestion, processing, and mining the high volume data to detect some use cases of fraud in near real time using CDR and IPTDR to save millions, and better user experience.

(More...)
Keywords: fraud detection, realtime processing, Spark SQL, Spark ML, Machine Learning Algorithms

Burak Işıklı

Software Engineer, Turkcell

-

Elephants in the cloud or how to become cloud ready

The way you operate your Big Data environment is not going to be the same anymore. This session is based on our experience managing on-premise environments

 and taking the lesson from innovative data-driven companies that successfully migrated their multi PB Hadoop clusters. Where to start and what decisions you have to make to gradually becoming cloud ready. The examples would refer to Google Cloud Platform yet the challenges are common. 
(More...)
Keywords: hadoop, private cloud, google compute platform, migration, hybrid platforms

Krzysztof Adamski

Big Data Architect, GetInData

-

Software Engineer in the world of Machine Learning

Given the example of one of Ocado’s ML projects, called Order Forecasting, I will explain how old software engineering enables the success of ML projects.

 Although large-scale ML requires new tricks and a new way of thinking, things like testing, continuous integration, reproducibility, monitoring and ease of maintenance are now more important than ever. It’s something we had to learn in Ocado the hard way and hopefully you will avoid all the traps along the way by leveraging our experience.
(More...)
Keywords: machine learning, software engineering, google cloud platform, user story

Przemysław Pastuszka

Machine Learning Engineer, Ocado Technology

-

Deriving Actionable Insights from High Volume Media Streams

In this talk we describe how to analyze high volumes of real-time streams of news feeds, social media, blogs in scalable and distributed way using  Apache Flink

 and Natural Language Processing tools like Apache OpenNLP  to perform common NLP tasks like Named Entity Recognition (NER), chunking, and text classification.
(More...)
Keywords: nlp, streaming, news, machine learning

Jörn Kottmann

Senior Software Developer, Sandstone SA

Peter Thygesen

Partner & Senior Software Engineer, Paqle A/S

-

We are waiting for the speaker's confirmation

-

Airflow as a Service

Oozie is still a popular workflow scheduler for Hadoop. It is a good choice if you like programming within XML file. Engineers at Allegro don’t.

 Apache Airflow allows configuration as a code which is useful for workflow versioning and dev/test/prod release cycle. In this talk we present our approach to Airflow as a Service. This includes: Automatically setting up Airflow cluster on demand Running Airflow on Docker and Mesos Implementing common operators Collaborative work Automatic tests and deployment lots of other real life issues we have solved in order to make it work out of the box for dozens of our analysts, data scientists and developers. This concept can be easily generalized for other Data services, such as Jupyter Notebooks.
(More...)
Keywords: Workflow, Automation, Orchestration, Docker

Robert Mroczkowski

Data Platform Engineer and Technical Owner of Hadoop Cluster, Grupa Allegro

-

We are waiting for the speaker's confirmation

-

Data Science Lessons I have learned in 5 years

Since 2013 I have been working as Data Scientist – one of today’s hottest jobs in IT industry. During this time, I got the opportunities to experience the evolution of data science landscape — to see what worked and what didn’t.

  In this presentation, I will present some of my best learnings in the past 5 years, like foundations for building data science team, efficient ways for data scientists to work with other teams, skills that data scientists should have, and common fallacies in data science work. 
(More...)
Keywords: Data Science, Data Sciencist, teamwork, work skills

Boxun Zhang

Sr. Data Scientist, GoEuro

-

Design Patterns for Calculating User Profiles in Real Time

At mobile.de, Germany’s largest online vehicle marketplace, we calculate user profile in real-time to optimize the user journey on the e-marketplace platform by presenting relevant products to the user,

 and by improving the relevance of search results. This presentation will discuss possible architecture designs and choices for addressing this challenge using popular open-source stream processing solutions.
(More...)
Keywords: Big Data, Stateful Stream Processing

Igor Mazor

senior data engineer, mobile.de

15.55 – 17.30 Roundtables sessions

15.55 - 16.00

Intro

Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable disussion – they are selected professionals with a vast knowledge and experience.

17.30 - 17.45

Coffee break

17.45 - 18.15

Panel discussion - Getting more out of your data in 2018

Building an efficient Big Data platform and mining large volumes of data seems to be a never-ending story for data-driven companies. It’s an ongoing journey with many pitfalls, twists and unclear future. Each year, there is something that changes the game, brings new value, promises rewards or wastes our time. During this panel, our experts will talk about their plans and hopes for 2018 – small and big improvements to their big data strategy that will help them to get more out of data in 2018. This includes new technologies that get significant adoption, new use-cases that become mainstream, new challenges that more and more companies face. The discussion won’t be about distant future, but about actions that you can take in 2018. The topics might cover migration to cloud, Hadoop 3.0, streaming ETL & ML, ML at scale, data privacy, fast data and more.

-

Host:

Adam Kawa

CEO and Co-founder, GetInData

18.15 - 18.30

Closing & Summary

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Adam Kawa

CEO and Co-founder, GetInData

19.00 - 22.00

Networking party for all participants and speakers

Estimated rank of the presentation, where: 1- very technical, 5 – mostly business related