The conference (February 9) and workshops* (February 8) will take place at the Sheraton Warsaw Hotel.

*Because of the big interest in “Introduction BigData technology” and “Data Science” workshops, we are lanuching new date – 7th February 2017

At the end of the workshops (February 8) we would like to invite all the attendees for the informal evening meeting in Browarmia. The party starts at 6:00 PM.

icon_navyblue_program_konferencji

AGENDA

8.15 - 9.00

Registration and coffee

9.00 - 9.15

Conference opening

PG

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

kawa

Adam Kawa

Data Engineer and Founder, GetInData

9.15 - 9.45

TBD

Presentation subject will be published soon
Google representative (TBD)

9.45 - 10.15

Machine Learning the Product

A/B testing is a popular method for learning your product. However, with traditional A/B testing techniques, we can only learn from A/B test in a rather superficial way – we can measure the size of an effect but often don’t know the cause of that effect. In this presentation, I will introduce a different, machine learning approach used at Spotify for analyzing A/B test, aiming to reveal the cause of effect and maximize learning.

boxun_zhang

Boxun Zhang

Data Scientist, Spotify

10.15 - 10.45

Managing the Margins: Big Data case study - Prescriptive Analysis for Semiconductor Manufacturing

The semiconductor industry is the backbone of the digital age. Sector innovations drive the ability to do more on ever smaller machines, but perhaps equally important is the ability to optimize the manufacturing processes. For example, in the digital printing of semiconductor components, 1 in a billion failure rate for droplets may sound like an acceptable rate. This is less so when you consider that up to 50 million droplets can be pushed per second, leading to an unacceptable defect rate of one every 20 seconds. Pre-emptive analytics on streaming sensor and image data play a key role in finding indications of where and when defects are looming. This presentation will focus on an industry use case for combining SAS and open source analytics to tackle these essential big data challenges, and will also provide some insights on applications in other sectors.

schubertsascha

Sascha Schubert

, SAS Institute

10.45 - 11.15

Coffee break

Simultaneous sessions

Operations & Deployment

This track is dedicated to system administrators and people with DevOps skills who are interested in technologies and best practices for planning, installing, managing and securing their Big Data infrastructure in enterprise environments – both on-premise and the cloud.

Data Application Development

This track is the place for developers to learn about tools, techniques and innovative solutions to collect and process large volumes of data. It covers topics like data ingestion, ETL, process scheduling, metadata and schema management, distributed datastores and more.

Analytics & Data Science

This track includes real case-studies demonstrating how Big Data is used to address a wide range of business problems. You can find here talks about large-scale Machine Learning, A/B tests, visualizing data as well as various analysis that enable making data-driven decisions and feed personalized features of data-driven products.

Real-Time Processing

This track covers technologies, strategies and use-cases for real-time data ingestion and deriving real-time actionable insights from the flow of events coming from sensors, devices, users, and front-end systems.

11.15 - 11.45

Creating Redundancy for Big Hadoop Clusters is Hard

Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and 

over 100000 jobs per day. This cluster was critical in both stockage and compute but without backups. After many efforts to increase our redundancy, we now have two clusters that, combined, have more than 2000 nodes, 130 PB, two different versions of Hadoop and 200000 jobs per day but these clusters do not yet provide a redundant solution to our all storage and compute needs. This talk discusses the choices and issues we solved in creating a 1200 node cluster with new hardware in a new data centre. Some of the challenges involved in running two different clusters in parallel will be presented. We will also analyse what went right (and wrong) in our attempt to achieve redundancy and our plans to improve our capacity to handle the loss of a data centre.
(More...)
stuart_pook_web-8315

Stuart Pook

Senior DevOps Engineer, Criteo

11.15 - 11.45

DataOps or how I learned to love production

A plethora of data processing tools, most of them open source, is available to us. But who actually runs data

pipelines? What about dynamically allocating resources to data pipeline components? In this talk we will discuss options to operate elastic data pipelines with modern, cloud native platforms such as DC/OS with Apache Mesos, Kubernetes and Docker Swarm. We will review good practices, from containerizing workloads to making things resilient and show elastic data pipelines in action.
(More...)
michael-hausenblas

Michael Hausenblas

Developer Advocate, Mesosphere

11.15 - 11.45

Alchemists 2.0: Turning data into gold

How to bring money to the table with Data Science. Practical examples of

Data Science “in-action” from recent projects. When to use Linear Regression vs XG Boost in business applications. What is the money impact of using Data Science.
(More...)
pawelgodula

Paweł Godula

Expert, BCG Gamma

11.15 - 11.45

Real-Time Data Processing at RTB House – Architecture & Lessons Learned

Our platform, which purchases and runs advertisements in the Real-Time Bidding model, processes 250K bid

requests and generates 20K events per every second which gives 3TB data every day. Because of machine learning, system monitoring and financial settlements we need to filter, store, aggregate and join these events together. As a result processed events and aggregated statistics are available in Hadoop, Google BigQuery and Postgres. The most demanding are business requirements such as: events that should be joined together can appear 30 days after each other, we are not allowed to create any duplicates, we have to minimalize possible data losses as well as there could not be any differences between generated data outputs. We have designed and implemented the solution which has reduced delay of availability of this data from 1 day to 15 seconds.

We will preent: Our first approach to the problem (end-of-day batch jobs) and final solution (real-time stream processing) 2. detailed description of the current architecture 3. how we had tested new data flow before it was deployed and in which way it is being monitored now 4. our one-click deployment process 5. decisions which we made with its advantages and disadvantages and our future plans to improve our current solution.

We would like to share our experience connected with scaling solution over clusters of computers in several data centers. We will focus on the current architecture but also on testing and monitoring issues with our deployment process. Finally, we would like to provide an overview of engaged projects like Kafka, Mirrormaker, Storm, Aerospike, Flume, Docker etc. We will describe what we have achieved from given open source and some problems we have come across.

(More...)
bartosz-los

Bartosz Łoś

Software Developer, RTB House

11.45 - 11.50

Technical break

11.50 - 12.20

Scalable Analytics for Microservices Architecture

Avito is a third biggest classified site in the world after Craigslit and 58.com from Chine. Avito nowadays is

not a single web-site, but dozens specialized vertical sites and applications.

All sites and applications are being transformed into microservices architecture, spawning hundreds of microservices. It is critical to implement common BI infrastructure, able to collect, process, combine and analyse data from all microservices, regardless of launching new ones, switching off or changing an old ones. Analytics of Avito is based on HP Vertica MPP database, highly normalized data lake and an asynchronous event bus. All those tools give Avito an ability to use all types of Machine Learning and Reporting tools to manage it’s sites, applications and microservices.

(More...)
speaker6

Nikolay Golov

Chief Data Warehousing Architect, Avito

11.50 - 12.20

Ab initio magnus data

Presentation description would be published soon.

firat-tekiner

Firat Tekiner

Data Scientist and Big Data Architect, AB Initio

11.50 - 12.20

SAS Viya – the fundamentals of analytics architecture of the future

Since the inception of modern analytical platforms, companies have been trying to out-smart each other to perform analytics faster than ever.

SAS Institute, has been leading the Analytics industry for over 40 years in the area of advanced analytics with new innovations including MVA, In-Database and In-Memory computing. SAS has recently released its 3rd Generation In-Memory platform, SAS Viya, that has been designed from ground-up for scalable analytics to solve the problems of the future, powered by CAS (Cloud Analytics Services) server. This session will give you an overview of the new and exciting features of SAS Viya and CAS, and how it differs from some of the other in-memory platforms in the market. We will discuss scalability,   memory management, Hadoop infrastructure integration, integration with open source tools like Python, R and others. (Web) interfaces.
(More...)
muhammad-asif-abbasi

Muhammad Asif Abbasi

Principal Business Solutions Manager, SAS Institute

11.50 - 12.20

Use-cases where Flink is better than technologies like Hive, Spark, Spark Streaming and why

While there are many popular open-source technologies for processing large datasets, Apache Flink

is one that excites me the most. Not because it provides sub-second latency at scale, exactly-once semantics or a single solution for batch and stream processing. But because … it lets you accurately process your data with little effort – something that is hard or usually ignored with Spark, Storm, Hive or Scalding. In this talk I will explain unique capabilities, ideas and design patterns of Flink & Kafka for accurate and simplified stream processing in batch and real-time.
(More...)
kawa

Adam Kawa

Data Engineer and Founder, GetInData

12.20 - 12.25

Technical break

12.25 - 12.55

Spotify’s Event Delivery

Spotify is currently one of the most popular music streaming services in the world with over 100 million monthly

active users. We have over the last few years have a phenomenal growth that now has pushed our backend infrastructure out from our data centers and into the cloud. Earlier this year we announced that we are transitioning all of our backend into Google Cloud Platform, GCP.

Our event delivery system is a key component in our data infrastructure, that delivers billions of events per day with predictable latency and well defined interface for our developers. This data is used to produce Discover Weekly, Spotify Party, Year in music and many more Spotify features. In this talk will be about focus on the evolution of the event delivery service, the lessons learned and present the design of our new system based on Google Cloud Platform technologies.

(More...)
nelson_arape

Nelson Arapé

Backend Developer, Spotify

12.25 - 12.55

Enabling 'Log Everything' at Skyscanner

Skyscanner is a leading global travel search site offering a comprehensive and free flight search service as well as online

comparisons for hotels and car hire. We believe that data should be at the heart of every decision at Skyscanner, so it’s important that our engineers have the tools to seamlessly log the data that will help them with those decisions. In this talk, we discuss the approach we’ve taken to enable this and reflect on some of the challenges and lessons learnt. Technologies used include kafka, logstash, elasticsearch, secor, aws (S3, lambda), samza, protocal buffers and others.
(More...)
arthur-vivian

Arthur Vivian

Software Engineer, SkyScanner

robintweedie

Robin Tweedie

Senior Software Engineer, SkyScanner

12.25 - 12.55

Meta-Experimentation at Etsy

Experimentation abounds, but how do we test our tests? I’ll share some ways

we at Etsy proved our experimentation methods broken, and the approach we took to fixing them. I’ll discuss multiple ways of running A/A tests (as opposed to A/B tests), and a statistical method called bootstrapping, which we used to remedy our experiment analysis.
(More...)
emily-sommer

Emily Sommer

Software Engineer, Etsy

12.25 - 12.55

Stream Analytics with SQL on Apache Flink

SQL is undoubtedly the most widely used language for data analytics for many good reasons. It is declarative,

many database systems and query processors feature advanced query optimizers and highly efficient execution engines, and last but not least it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?”. One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap. One approach is to support standard SQL which is known by users and tools but comes at the cost of cumbersome workarounds for many common streaming computations. Other approaches are to design custom SQL-inspired stream analytics languages or to extend SQL with streaming-specific keywords. While such solutions tend to result in more intuitive syntax, they suffer from not being established standards and thereby exclude many users and tools.

Apache Flink is a distributed stream processing system with very good support for streaming analytics. Flink features two relational APIs, the Table API and SQL. The Table API is a language-integrated relational API with stream-specific features. Flink’s SQL interface implements the plain SQL standard. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite.

In this talk we present the future of Apache Flink’s relational APIs for stream analytics, discuss their conceptual model, and showcase their usage. The central concept of these APIs are dynamic tables. We explain how streams are converted into dynamic tables and vice versa without losing information due to the stream-table duality. Relational queries on dynamic tables behave similar to materialized view definitions and produce new dynamic tables. We show how dynamic tables are converted back into changelog streams or are written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and are updated in place with low latency. We conclude our talk demonstrating the power and expressiveness of Flink’s relational APIs by presenting how common stream analytics use cases can be realized.

(More...)
fabian-hueske

Fabian Hueske

Software Engineer, data Artisans

12.55 - 13.50

Lunch

Operations & Deployment

_

Data Application Development

Analytics & Data Science

_

Real-Time Processing

_

13.50 - 14.20

Key challenges in building large distributed full-text search systems based on Apache Solr and Elasticsearch

There are large distributed search platforms based on the most popular two search engines: Apache Solr and

Elasticsearch. For a long time these two technologies can do much more than full-text search. They are scalable and highly productive noSQL (document-oriented) databases, which are able to store massive data and serve vast number of requests. This is why we can discuss Solr and Elasticsearch in terms of big data projects. Let discuss challenges connected with data indexing and searching, configuring clusters, scaling and distributing them between data centers. During the presentation there will be an overview of available features and issues, but it won’t be another comparison of Solr and Elasticsearch. Both technologies are well-proven software and instead of favoring one of them I would like to present all their possibilities.
(More...)
Tomasz Sobczak

Tomasz Sobczak

Senior Consultant, Findwise

13.50 - 14.20

One Jupyter to rule them all

If you tell your colleagues you develop Hadoop applications, they probably find you a geek that knows Java,

MapReduce, Scala and a lot of APIs for submitting, scheduling and monitoring jobs. And of course is a Kerberos expert. Actually, it might be quite real a few years ago, but nowadays Big Data ecosystem contains many tools that enable Big Data for everyone, including non-technical guys. In Allegro we simplified the way of creating applications that gain value from datasets. Look how we maintain full development process from the very first line of code to production deployment, in particular: * develop and maintain code inside Jupyter using pySpark as Big Data framework, * store codebase in git repositories and perform code-review process, * create and maintain unit tests and integration tests for pySpark applications, * schedule and monitor these processes on Hadoop cluster. Why using CLI for Big Data is pretty obsolete.
(More...)
mariusz-strzelecki

Mariusz Strzelecki

Senior Data Engineer, Allegro Group

13.50 - 14.20

H2O Deep Water - Making Deep Learning Accessible to Everyone

Deep Water is H2O’s integration with multiple open source deep learning libraries such as TensorFlow, MXNet

and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water.  After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O’s R/Python/Flow (Web) interfaces.
(More...)
jo-fai-chow-h20

Jo-fai Chow

Data Scientist, H2O.ai

13.50 - 14.20

Blink - Alibaba's Improvements to Flink

A large portion of transactions on Alibaba’s e-commerce Taobao platform is initiated through its Alibaba Search

engine. Real time data streaming processing is one of the cornerstones in Alibaba’s search infrastructure. Among all the streaming solutions, Flink is the closest to meet our requirements. In this talk, we present the design and implementation of Blink, an improved runtime engine for Flink, better integrated with Yarn. It also addresses various scale and reliability issues we encountered in our production. Since the changes are at the runtime layer, Blink is fully compatible with the Flink API and its machine learning libraries. We will also share the experience in our production use of Blink in a Hadoop cluster of more than one thousand servers in Alibaba Search. We are actively working with the community to contribute the changes back to Apache Flink.
(More...)
Xiaowei_Jiang

Xiaowei Jiang

Senior Director, Alibaba Search Division

14.20 - 14.25

Technical break

14.25 - 14.55

ING CoreIntel - collect and process network logs across data centers in near realtime

Krzysztof Adamski

Krzysztof Adamski

Solutions Architect (Big Data), ING Services Poland

krzysztof-zmij

Krzysztof Żmij

Expert IT / Hadoop, ING Services Poland

14.25 - 14.55

Orchestrating Big Data pipelines @ Fandom

Fandom is the largest entertainment fan site in the world. With more than 360,000 fan communities and a global

audience of over 190 million monthly uniques, we are the fan’s voice in entertainment. Being the largest entertainment site, wikia generates massive volumes of data, which varies from clickstream, user activities, api requests, ad delivery, A/B testing and much more. The big challenge is not just the volume but the orchestration involved in combining various sources of data with various periodicity, volumes. And Making sure the processed data is available for the consumers within the expected time. Thus helping gain the right insights well within the right time. A conscious decision was made to choose the right open source tool to solve the problem of orchestration, after evaluating various tools we decided to use Apache airflow. This presentation will give an overview of comparisons of existing tools and emphasize on why we choose airflow. And how Airflow is being used to create a stable reliable orchestration platform to enable non data engineers to seamlessly access data by democratizing data. We will focus on some tricks and best practises of developing workflows with Airflow and show how we are using some of the features of airflow.
(More...)
speaker6

Krystian Mistrzak

Data Engineer, Fandom Powered by WIkia

speaker6

Thejas Murthy

Data Engineer, Fandom Powered by WIkia

14.25 - 14.55

Big data in genomics

Marek Wiewiórka

Marek Wiewiórka

Solution Architect, GetInData

Genomic population studies incorporates storing, analyzing and interpretation of various kinds of

genomic variants as its central issue. When thousands of patients sequenced exomes and genomes are being sequenced, there is a growing need for efficient database storage systems, querying engines and powerful tools for statistical analyses. Scalable big data solutions such as Apache Impala, Apache Kudu, Apache Phoenix or Apache Kylin can address many of the challenges in large scale genomic analyses. The presentation will cover some of the lessons-learned from the project aiming at creating a data warehousing solution for storing and analyzing genomic variants information at Department of Medical Genetics Warsaw Medical University. Overview of the existing big data projects for analyzing data from next generation sequencing will be given as well. Presentation will conclude with a brief summary and future directions discussion
(More...)

14.25 - 14.55

RealTime AdTech reporting & targeting with Apache Apex

AdTech companies need to address data increase at breakneck speed along with customer demands of insights &

analytical reports. At PubMatic we receive billions of events and several TBs of data per day from various geographic regions. This high volume data needs to be processed in realtime to derive actionable insights such as campaign decisions, audience targeting and also provide feedback loop to AdServer for making efficient ad serving decisions. In this talk we will share how we designed and implemented these scalable low latency realtime data processing solutions for our use cases using Apache Apex.
(More...)
ashish-tadose

Ashish Tadose

Senior Data Architect, PubMatic

14.55 - 15.00

Technical break

15.00 - 15.30

TBD

15.00 - 15.30

TBD

Sky Bet is one of the largest UK online bookmakers and introduced a Hadoop platform 4 years ago. This session 

explains how the platform addresses 2 common problems in the gambling industry – knowing your current liability position and helping potential irresponsible gamblers before they identify themselves. These use cases are linked by a common need for data from the same source systems and highlight the different uses of the data that can co-exist on a shared Hadoop Cluster The journey of replacing a traditional data warehouse with the promised land of Hadoop will be explained. It won’t forget the mis-turns and slips made along the way – this is no Proof-of-Concept idealistic talk, real world implementations are difficult. The journey will start with the first use case, meeting the needs of sportsbook traders to be able to manage liabilities in a competitive and high frequency environment and how that led, years later, to completely decommissioning the legacy data warehouse. The platform has evolved into supporting a Data Science team and the ability to create predictive models that warn of potential irresponsible gamblers. This more recent use case illustrates a completely different way of using the same data and how the engineering approach accommodates it. There’s no code in the talk, the aim is to explaining how a real world system delivered real world use cases and the teams that need to deliver them.
(More...)
mark-pybus

Mark Pybus

Head of Data Engineering, Sky Betting & Gaming

15.00 - 15.30

TBD

15.00 - 15.30

Hopsworks: Secure Streaming-as-a-Service with Kafka/Flink/Spark

Since June 2016, Kafak, Spark and Flink-as-a-service have been available to researchers and companies in

Sweden from the Swedish ICT SICS Data Center at www.hops.site using the HopsWorks platform (www.hops.io). Flink and Spark applications are run within a project on a YARN cluster with the novel property that applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running streaming applications, how we use Graphana and Graphite for monitoring streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Oct 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications. Hopsworks is entirely UI-driven with an Apache v2 open source license.
(More...)

15.30 - 15.55

Coffee break

ROUNDTABLE SESSIONS

15.55 - 16.00

Intro

Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable disussion – they are selected professionals with a vast knowledge and experience.

16.00 - 16.45

Round I

16.45 - 17.30

Round II

17.30 - 17.45

Coffee break

17.45 - 18.15

Panel discussion - BigData implementations – how to make justified ROI

BigData brings a lot of promises about potential benefits, but life proves it’s not always so easy. How to make BigData projects great? How to get quick-wins? How to avoid expensive mistakes? How to communicate with the others  – business side or a client – to make it a viable project? What are the major success factors and where are the easily to be missed out obstacles that can derail Big Data projects?

18.15 - 18.30

Closing & Summary

PG

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

kawa

Adam Kawa

Data Engineer and Founder, GetInData