The conference (February 9) and workshops (February 7, 8, 10) will take place at the Sheraton Warsaw Hotel.

 

EVENING PARTYdinner_party_caterers_caterer_search_browarmia_krlewska_brow4_204672276

At the end of the workshops (February 8) we would like to invite all the attendees
for the informal evening meeting in Browarmia. The party starts at 6:00 PM.

icon_navyblue_program_konferencji

AGENDA

8.00 - 9.00

Registration and coffee

9.00 - 9.15

Conference opening

PG

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

kawa

Adam Kawa

CEO and Co-founder, GetInData

9.15 - 9.45

The data horizon 2017: Vision of Google Team

How cloud can address big data tasks: practical application of big data analytics and machine learning inside and outside Google. Ours vision of Cloud.

magdalena_dziewguc

Magdalena Dziewguć

, Google

michal_sapinski

Michał Sapiński

Software engineer, Google

9.45 - 10.15

Meta-Experimentation at Etsy

Experimentation abounds, but how do we test our tests? I’ll share some ways we at Etsy proved our experimentation methods broken, and the approach we took to fixing them. I’ll discuss multiple ways of running A/A tests (as opposed to A/B tests), and a statistical method called bootstrapping, which we used to remedy our experiment analysis.

emily-sommer

Emily Sommer

Software Engineer, Etsy

10.15 - 10.45

Managing the Margins: Big Data case study - Prescriptive Analysis for Semiconductor Manufacturing

The semiconductor industry is the backbone of the digital age. Sector innovations drive the ability to do more on ever smaller machines, but perhaps equally important is the ability to optimize the manufacturing processes. For example, in the digital printing of semiconductor components, 1 in a billion failure rate for droplets may sound like an acceptable rate. This is less so when you consider that up to 50 million droplets can be pushed per second, leading to an unacceptable defect rate of one every 20 seconds. Pre-emptive analytics on streaming sensor and image data play a key role in finding indications of where and when defects are looming. This presentation will focus on an industry use case for combining SAS and open source analytics to tackle these essential big data challenges, and will also provide some insights on applications in other sectors.

schubertsascha

Sascha Schubert

Advisory Business Solutions Manager, Global Technology Practice, SAS Institute

10.45 - 11.15

Coffee break

Simultaneous sessions

Operations & Deployment

This track is dedicated to system administrators and people with DevOps skills who are interested in technologies and best practices for planning, installing, managing and securing their Big Data infrastructure in enterprise environments – both on-premise and the cloud.

Data Application Development

This track is the place for developers to learn about tools, techniques and innovative solutions to collect and process large volumes of data. It covers topics like data ingestion, ETL, process scheduling, metadata and schema management, distributed datastores and more.

Analytics & Data Science

This track includes real case-studies demonstrating how Big Data is used to address a wide range of business problems. You can find here talks about large-scale Machine Learning, A/B tests, visualizing data as well as various analysis that enable making data-driven decisions and feed personalized features of data-driven products.

Real-Time Processing

This track covers technologies, strategies and use-cases for real-time data ingestion and deriving real-time actionable insights from the flow of events coming from sensors, devices, users, and front-end systems.

Session chairs

Piotr-Bednarek-212x245

Piotr Bednarek

Administrator Hadoop, GetInData

piotr-krewski

Piotr Krewski

Big Data Consultant and Co-founder, GetInData

PG

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

klaudia-zdunczyk

Klaudia Zduńczyk

Business Development Specialist, GetInData

11.15 - 11.45

That won’t fit into RAM

SentiOne is one of the leading solutions in Europe for social media listening and analysis. We monitor over 26 European markets including CEE, Scandinavia,

DACH, and the Balkans. The amount of data that is processed every day and is ready to be queried by our users is enormous. Over the years we have tested many technologies and approaches in big data from which many have failed. The presentation includes our experiences and lessons learned on setting up big data company from scratch. I will give details on configuring robust ElasticSearch cluster with over 26TB of data and describe key challenges in efficient web crawling and data extraction.
(More...)
speaker6

Michał Brzezicki

Wiceprezes Zarządu, SentiOne

11.15 - 11.45

Enabling 'Log Everything' at Skyscanner

Skyscanner is a leading global travel search site offering a comprehensive and free flight search service as well as online

comparisons for hotels and car hire. We believe that data should be at the heart of every decision at Skyscanner, so it’s important that our engineers have the tools to seamlessly log the data that will help them with those decisions. In this talk, we discuss the approach we’ve taken to enable this and reflect on some of the challenges and lessons learnt. Technologies used include kafka, logstash, elasticsearch, secor, aws (S3, lambda), samza, protocal buffers and others.
(More...)
robintweedie

Robin Tweedie

Senior Software Engineer, SkyScanner

arthur-vivian

Arthur Vivian

Software Engineer, SkyScanner

11.15 - 11.45

Alchemists 2.0: Turning data into gold

How to bring money to the table with Data Science. Practical examples of

Data Science “in-action” from recent projects. When to use Linear Regression vs XG Boost in business applications. What is the money impact of using Data Science.
(More...)
pawelgodula

Paweł Godula

Senior data Scientist, BCG Gamma

11.15 - 11.45

Real-Time Data Processing at RTB House – Architecture & Lessons Learned

Our platform, which purchases and runs advertisements in the Real-Time Bidding model, processes 250K bid

requests and generates 20K events per every second which gives 3TB data every day. Because of machine learning, system monitoring and financial settlements we need to filter, store, aggregate and join these events together. As a result processed events and aggregated statistics are available in Hadoop, Google BigQuery and Postgres. The most demanding are business requirements such as: events that should be joined together can appear 30 days after each other, we are not allowed to create any duplicates, we have to minimalize possible data losses as well as there could not be any differences between generated data outputs. We have designed and implemented the solution which has reduced delay of availability of this data from 1 day to 15 seconds.

We will preent: Our first approach to the problem (end-of-day batch jobs) and final solution (real-time stream processing) 2. detailed description of the current architecture 3. how we had tested new data flow before it was deployed and in which way it is being monitored now 4. our one-click deployment process 5. decisions which we made with its advantages and disadvantages and our future plans to improve our current solution.

We would like to share our experience connected with scaling solution over clusters of computers in several data centers. We will focus on the current architecture but also on testing and monitoring issues with our deployment process. Finally, we would like to provide an overview of engaged projects like Kafka, Mirrormaker, Storm, Aerospike, Flume, Docker etc. We will describe what we have achieved from given open source and some problems we have come across.

(More...)
bartosz-los

Bartosz Łoś

Software Developer, RTB House

11.45 - 11.50

Technical break

11.50 - 12.20

Scalable Analytics for Microservices Architecture

Avito is the third biggest classified site in the world after Craigslit and 58.com from China. Avito nowadays

is not a monolite project, but comprises dozens specialized vertical sites and applications.

The introduction of microservice architecture in Avito spawned hundreds of new services. In this situation is is critical to implement common BI infrastructure, able to collect, process, combine and analyse data from all those microservices and persistent to constant changes.

Avito Analytics is based on HP Vertica MPP database, highly normalized data lake and an asynchronous event bus. Those tools give Avito the ability to use all types of Machine Learning and Reporting tools, manage sites, applications and microservices.

Avito is the Russian OLX. Moreover, nowadays, Avito and OLX are both part of the Naspers group, we do the same business in different counties and share experience.

(More...)
nikolay-golov

Nikolay Golov

Chief Data Warehousing Architect, Avito

11.50 - 12.20

DataOps or how I learned to love production

A plethora of data processing tools, most of them open source, is available to us. But who actually runs data

pipelines? What about dynamically allocating resources to data pipeline components? In this talk we will discuss options to operate elastic data pipelines with modern, cloud native platforms such as DC/OS with Apache Mesos, Kubernetes and Docker Swarm. We will review good practices, from containerizing workloads to making things resilient and show elastic data pipelines in action.
(More...)
michael-hausenblas

Michael Hausenblas

Developer Advocate, Mesosphere

11.50 - 12.20

SAS Viya – the fundamentals of analytics architecture of the future

Since the inception of modern analytical platforms, companies have been trying to out-smart each other to perform analytics faster than ever.

SAS Institute, has been leading the Analytics industry for over 40 years in the area of advanced analytics with new innovations including MVA, In-Database and In-Memory computing. SAS has recently released its 3rd Generation In-Memory platform, SAS Viya, that has been designed from ground-up for scalable analytics to solve the problems of the future, powered by CAS (Cloud Analytics Services) server. This session will give you an overview of the new and exciting features of SAS Viya and CAS, and how it differs from some of the other in-memory platforms in the market. We will discuss scalability,   memory management, Hadoop infrastructure integration, integration with open source tools like Python, R and others. (Web) interfaces.
(More...)
muhammad-asif-abbasi

Muhammad Asif Abbasi

Principal Business Solutions Manager, SAS Institute

11.50 - 12.20

Streaming analytics better than batch - when and why

While a lot of problems can be solved in batch, the stream processing approach currently gives you more benefits. And it’s not only sub-second latency at scale. But mainly possibility…

to express accurate analytics with little effort – something that is hard or usually ignored with older batch technologies like Pig, Scalding, Spark or even established stream processors like Storm or Spark Streaming. In this talk we’ll use a real-world example of user session analytics to give you a use-case driven overview of business and technical problems that modern stream processing technologies like Flink help you solve, and benefits you can get by using them today for processing your data as a stream.
(More...)
kawa

Adam Kawa

CEO and Co-founder, GetInData

Krzysztof Zarzycki

Krzysztof Zarzycki

Big Data Architect and Co-founder, GetInData

dawidwysakowicz

Dawid Wysakowicz

Data Engineer, GetInData

12.20 - 12.25

Technical break

12.25 - 12.55

Creating effective, scalable and easily manageable environment for real-time big data processing and analytics

Creating effective, scalable and easily manageable environment for big data processing is a challenge which touches multiple domains. Business ideas, data science, analytic algorithms and analytic


software tools as well as scalable infrastructure which has to fit specific use case and be open for dynamic changes. Cisco and Alterdata understands together all stages of this process and are able to guide companies through this journey.
During the session we will describe use case of real-time big data analytics related to location tracking and how it leverages automated and scalable Cisco platform.
How to effectively use C-store DBMS analytics platform as well as Cisco Validated Design for Big Data architecture which combines tools as Cisco UCS (Unified Computing System), Cisco ACI (Application Centric Infrastructure) and UCS Director for Big Data which provides a single-touch solution that automates Hadoop deployment and provides a single management pane across both physical infrastructure and Hadoop software.
(More...)
krzysztof_baczynski

Krzysztof Baczyński

Cisco Big Data Lead for Poland, Cisco

kamil_ciukszo

Kamil Ciukszo

Founder and CEO, Alterdata

12.25 - 12.55

One System One Architecture Many Applications

AB Initio software is a general-purpose data processing and metadata management platform. It has a single architecture

for processing Hadoop, files, database tables, message queues (kafka,jms, etc), web services, and metadata. This architecture enables virtually any technical or business rule to be graphically defined, shared, and executed in timely manner. It is a true Big Data architecture whereby it processes data in parallel across multiple processors, even processors on different servers such as Hadoop. It can run the same rules in batch and real-time, and within a service-oriented architecture. It is fully production ready and supports distributed checkpoint restart with application monitoring and alerting. And it enables end-to-end metadata to be collected, versioned, and analysed by nontechnical users.
AB Initio delivers a rich set of software products that work together in a way that makes it easy to rapidly develop big data systems. The building block of these systems is the AB Initio graph, which combines AB Initio processing components, third-party programs, and any necessary custom codes into a high-performance parallel and distributed application.

(More...)
firat-tekiner

Firat Tekiner

Data Scientist and Big Data Architect, AB Initio

12.25 - 12.55

Anomaly detection made easy

Imagine such situation: you have deployed a service to production and everything seems to work. After some time your phone rings and an analyst says ‘Could you

help me with searching latest clickstream produced by your application?’. Well, now it got serious. To make matters worse, you have been notified about the error by your client. It shouldn’t have happened. It should be the other way round.
@Allegro we found a solution for this use-case. I am going to tell you how we managed to detect anomalies (heavy web traffic after successful commercial, or fall of search events, or no clicks on Ad).
We tested all available solutions (Twitter detector, HTM algorithms) and came to conclusion that all machine learning models are too complicated. We didn’t understand them. We created our own simple model. I will show you how we moved from promising idea in R language to final working solution in Scala.
If you like buzzwords these might be for you: #Machine Learning, #Scala, #R, #Statistics, #Simplicity, #Real-time processing

(More...)
piotr-guzik

Piotr Guzik

Software Engineer, Grupa Allegro

12.25 - 12.55

Stream Analytics with SQL on Apache Flink

SQL is undoubtedly the most widely used language for data analytics for many good reasons. It is declarative,

many database systems and query processors feature advanced query optimizers and highly efficient execution engines, and last but not least it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?”. One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap. One approach is to support standard SQL which is known by users and tools but comes at the cost of cumbersome workarounds for many common streaming computations. Other approaches are to design custom SQL-inspired stream analytics languages or to extend SQL with streaming-specific keywords. While such solutions tend to result in more intuitive syntax, they suffer from not being established standards and thereby exclude many users and tools.

Apache Flink is a distributed stream processing system with very good support for streaming analytics. Flink features two relational APIs, the Table API and SQL. The Table API is a language-integrated relational API with stream-specific features. Flink’s SQL interface implements the plain SQL standard. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite.

In this talk we present the future of Apache Flink’s relational APIs for stream analytics, discuss their conceptual model, and showcase their usage. The central concept of these APIs are dynamic tables. We explain how streams are converted into dynamic tables and vice versa without losing information due to the stream-table duality. Relational queries on dynamic tables behave similar to materialized view definitions and produce new dynamic tables. We show how dynamic tables are converted back into changelog streams or are written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and are updated in place with low latency. We conclude our talk demonstrating the power and expressiveness of Flink’s relational APIs by presenting how common stream analytics use cases can be realized.

(More...)
fabian-hueske

Fabian Hueske

Software Engineer, data Artisans

12.55 - 13.50

Lunch

Operations & Deployment

_

Data Application Development

Analytics & Data Science

_

Real-Time Processing

_

Session chairs

Piotr-Bednarek-212x245

Piotr Bednarek

Administrator Hadoop, GetInData

piotr-krewski

Piotr Krewski

Big Data Consultant and Co-founder, GetInData

klaudia-zdunczyk

Klaudia Zduńczyk

Business Development Specialist, GetInData

dawidwysakowicz

Dawid Wysakowicz

Data Engineer, GetInData

13.50 - 14.20

Creating Redundancy for Big Hadoop Clusters is Hard

Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and

over 100000 jobs per day. This cluster was critical in both stockage and compute but without backups. After many efforts to increase our redundancy, we now have two clusters that, combined, have more than 2000 nodes, 130 PB, two different versions of Hadoop and 200000 jobs per day but these clusters do not yet provide a redundant solution to our all storage and compute needs. This talk discusses the choices and issues we solved in creating a 1200 node cluster with new hardware in a new data centre. Some of the challenges involved in running two different clusters in parallel will be presented. We will also analyse what went right (and wrong) in our attempt to achieve redundancy and our plans to improve our capacity to handle the loss of a data centre.
(More...)
stuart_pook_web-8315

Stuart Pook

Senior DevOps Engineer, Criteo

13.50 - 14.20

2 Use Cases from Sky Bet

Sky Bet is one of the largest UK online bookmakers and introduced a Hadoop platform 4 years ago. This session

explains how the platform addresses 2 common problems in the gambling industry – knowing your current liability position and helping potential irresponsible gamblers before they identify themselves. These use cases are linked by a common need for data from the same source systems and highlight the different uses of the data that can co-exist on a shared Hadoop Cluster The journey of replacing a traditional data warehouse with the promised land of Hadoop will be explained. It won’t forget the mis-turns and slips made along the way – this is no Proof-of-Concept idealistic talk, real world implementations are difficult. The journey will start with the first use case, meeting the needs of sportsbook traders to be able to manage liabilities in a competitive and high frequency environment and how that led, years later, to completely decommissioning the legacy data warehouse. The platform has evolved into supporting a Data Science team and the ability to create predictive models that warn of potential irresponsible gamblers. This more recent use case illustrates a completely different way of using the same data and how the engineering approach accommodates it. There’s no code in the talk, the aim is to explaining how a real world system delivered real world use cases and the teams that need to deliver them.
(More...)
mark-pybus

Mark Pybus

Head of Data Engineering, Sky Betting & Gaming

13.50 - 14.20

H2O Deep Water - Making Deep Learning Accessible to Everyone

Deep Water is H2O’s integration with multiple open source deep learning libraries such as TensorFlow, MXNet

and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water.  After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O’s R/Python/Flow (Web) interfaces.
(More...)
jo-fai-chow-h20

Jo-fai Chow

Data Scientist, H2O.ai

13.50 - 14.20

RealTime AdTech reporting & targeting with Apache Apex

AdTech companies need to address data increase at breakneck speed along with customer demands of insights &

analytical reports. At PubMatic we receive billions of events and several TBs of data per day from various geographic regions. This high volume data needs to be processed in realtime to derive actionable insights such as campaign decisions, audience targeting and also provide feedback loop to AdServer for making efficient ad serving decisions. In this talk we will share how we designed and implemented these scalable low latency realtime data processing solutions for our use cases using Apache Apex.
(More...)
ashish-tadose

Ashish Tadose

Senior Data Architect, PubMatic

14.20 - 14.25

Technical break

14.25 - 14.55

Spotify’s Event Delivery

Spotify is currently one of the most popular music streaming services in the world with over 100 million monthly

active users. We have over the last few years have a phenomenal growth that now has pushed our backend infrastructure out from our data centers and into the cloud. Earlier this year we announced that we are transitioning all of our backend into Google Cloud Platform, GCP.

Our event delivery system is a key component in our data infrastructure, that delivers billions of events per day with predictable latency and well defined interface for our developers. This data is used to produce Discover Weekly, Spotify Party, Year in music and many more Spotify features. In this talk will be about focus on the evolution of the event delivery service, the lessons learned and present the design of our new system based on Google Cloud Platform technologies.

(More...)
nelson_arape

Nelson Arapé

Backend Developer, Spotify

14.25 - 14.55

Data Engineering in Facebook Ads teams

Facebook serves ads from over 4 million advertisers to more than a billion people each day. Every day we face the challenge of building the best products

to such a large user base. In order to focus the right ones, we have to make well informed decisions, which we can prove with data. This is why using making information easily accessible and understandable is crucial for success of the whole team. This talk provides an overview of how Facebook uses data to run the Ads products teams. We will discuss embedding Data Engineers work within engineering teams, their impact on the product, have a look at techniques which help with standardization and organization of metrics to manage the complexity of data in a scalable way.

(More...)
pawel_koperek

Paweł Koperek

Data Engineer, Facebook

14.25 - 14.55

One Jupyter to rule them all

If you tell your colleagues you develop Hadoop applications, they probably find you a geek that knows Java,

MapReduce, Scala and a lot of APIs for submitting, scheduling and monitoring jobs. And of course is a Kerberos expert. Actually, it might be quite real a few years ago, but nowadays Big Data ecosystem contains many tools that enable Big Data for everyone, including non-technical guys. In Allegro we simplified the way of creating applications that gain value from datasets. Look how we maintain full development process from the very first line of code to production deployment, in particular: * develop and maintain code inside Jupyter using pySpark as Big Data framework, * store codebase in git repositories and perform code-review process, * create and maintain unit tests and integration tests for pySpark applications, * schedule and monitor these processes on Hadoop cluster. Why using CLI for Big Data is pretty obsolete.
(More...)
mariusz-strzelecki

Mariusz Strzelecki

Senior Data Engineer, Allegro Group

14.25 - 14.55

ING CoreIntel - collect and process network logs across data centers in near realtime

Security is at the core of every bank activity. ING set an ambitious goal to have an insight into the overall network data activity. The purpose is

to quickly recognize and neutralize unwelcomed guests such as malware, viruses and to prevent data leakage or track down misconfigured software components.
Since the inception of the CoreIntel project we knew we were going to face the challenges of capturing, storing and processing vast amount of data of a various type from all over the world. In our session we would like to share our experience in building scalable, distributed system architecture based on Kafka, Spark Streaming, Hadoop and Elasticsearch to help us achieving these goals.
Why choosing good data format matters? How to manage kafka offsets? Why dealing with Elasticsearch is a love-hate relationship for us or how we just managed to put it all these pieces together.

(More...)
krzysztof-zmij

Krzysztof Żmij

Expert IT / Hadoop, ING Services Poland

Krzysztof Adamski

Krzysztof Adamski

Solutions Architect (Big Data), ING Services Poland

14.55 - 15.00

Technical break

15.00 - 15.30

Key challenges in building large distributed full-text search systems based on Apache Solr and Elasticsearch

There are large distributed search platforms based on the most popular two search engines: Apache Solr and

Elasticsearch. For a long time these two technologies can do much more than full-text search. They are scalable and highly productive noSQL (document-oriented) databases, which are able to store massive data and serve vast number of requests. This is why we can discuss Solr and Elasticsearch in terms of big data projects. Let discuss challenges connected with data indexing and searching, configuring clusters, scaling and distributing them between data centers. During the presentation there will be an overview of available features and issues, but it won’t be another comparison of Solr and Elasticsearch. Both technologies are well-proven software and instead of favoring one of them I would like to present all their possibilities.
(More...)
Tomasz Sobczak

Tomasz Sobczak

Senior Consultant, Findwise

15.00 - 15.30

Orchestrating Big Data pipelines @ Fandom

Fandom is the largest entertainment fan site in the world. With more than 360,000 fan communities and a global

audience of over 190 million monthly uniques, we are the fan’s voice in entertainment. Being the largest entertainment site, wikia generates massive volumes of data, which varies from clickstream, user activities, api requests, ad delivery, A/B testing and much more. The big challenge is not just the volume but the orchestration involved in combining various sources of data with various periodicity, volumes. And Making sure the processed data is available for the consumers within the expected time. Thus helping gain the right insights well within the right time. A conscious decision was made to choose the right open source tool to solve the problem of orchestration, after evaluating various tools we decided to use Apache airflow. This presentation will give an overview of comparisons of existing tools and emphasize on why we choose airflow. And how Airflow is being used to create a stable reliable orchestration platform to enable non data engineers to seamlessly access data by democratizing data. We will focus on some tricks and best practises of developing workflows with Airflow and show how we are using some of the features of airflow.
(More...)
mistrzak_krystian

Krystian Mistrzak

Data Engineer, Fandom Powered by WIkia

thejas

Thejas Murthy

Data Engineer, Fandom Powered by WIkia

15.00 - 15.30

Big data in genomics

Genomic population studies incorporates storing, analyzing and interpretation of various kinds of

genomic variants as its central issue. When thousands of patients sequenced exomes and genomes are being sequenced, there is a growing need for efficient database storage systems, querying engines and powerful tools for statistical analyses. Scalable big data solutions such as Apache Impala, Apache Kudu, Apache Phoenix or Apache Kylin can address many of the challenges in large scale genomic analyses. The presentation will cover some of the lessons-learned from the project aiming at creating a data warehousing solution for storing and analyzing genomic variants information at Department of Medical Genetics Warsaw Medical University. Overview of the existing big data projects for analyzing data from next generation sequencing will be given as well. Presentation will conclude with a brief summary and future directions discussion
(More...)
Marek Wiewiórka

Marek Wiewiórka

Solution Architect, GetInData

15.00 - 15.30

Hopsworks: Secure Streaming-as-a-Service with Kafka/Flink/Spark

Since June 2016, Kafak, Spark and Flink-as-a-service have been available to researchers and companies in

Sweden from the Swedish ICT SICS Data Center at www.hops.site using the HopsWorks platform (www.hops.io). Flink and Spark applications are run within a project on a YARN cluster with the novel property that applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running streaming applications, how we use Graphana and Graphite for monitoring streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Oct 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications. Hopsworks is entirely UI-driven with an Apache v2 open source license.
(More...)
theofilos-kakantousis

Theofilos Kakantousis

Co-founder, Logical Clocks AB

15.30 - 15.55

Coffee break

ROUNDTABLE SESSIONS

15.55 - 16.00

Intro

Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable disussion – they are selected professionals with a vast knowledge and experience.

16.00 - 16.45

Round I

  1. Best tools for alerting and monitoring of the clusters
sujkowskitomasz

Tomasz Sujkowski

Big Data Administrator, Agora SA

2. Machine Learning and Big Data: perfect solution for all problems?

andrzej-dydynski

Andrzej Dydyński

Data Scientist, Samsung

3. Latest advances in machine learning and their impact on our industries
During the discussion we will focus on the latest advancements in machine learning – mostly in the area of artificial neural networks – and their impact on the landscape of industries, tools and IT professions. Should we expect another AI ice age or this time is different and we are on the good way to solving intelligence? https://www.youtube.com/watch?v=aygSMgK3BEM

michal_sapinski

Michał Sapiński

Software engineer, Google

4. Effective tools and environment for data scientists

artur-maliszewski

Artur Maliszewski

Head of Business Intelligence, Currency One

5. How to hire data scientists?

przemyslaw_biecek

Przemysław Biecek

Co-founder, SmarterPoland.pl

6. Major challenges in project based on Hadoop environment (lack of measureable results, staffing problems, high cost of keeping up to date source code, the necessity to deal with many different and fast changing technologies). Data Governance in BigData.

konradhoszowski

Konrad Hoszowski

Technical Account Manager, AB Initio

firat-tekiner

Firat Tekiner

Data Scientist and Big Data Architect, AB Initio

7. Building an EDW using the Big Data technologies – challenges and opportunities
How to successfully build an EDW using the Big Data technologies stack. Adapting the EDW methodologies, techniques and best practices (Kimball, Inmon, Data Vault, Anchor, Hub and Spoke) to the Big Data realities. How to plan the program, build the team, choose technologies, infrastructure (cloud vs on prem), model and process the data, etc.

marcin_choinski

Marcin Choiński

Head of Big Data & Analytics Ecosystem, TVN Digital S.A.

8. Being efficient data engineer. Tools, ecosystem, skills, ways of learning

Big Data Engineer is quite a new profession. Yet, Bi Data ecosystem is big and it is growing rapidly and changing fast. There are a lot of frameworks, tools which are suposed to make us efficient. Some of them can help, while others are obsolete. There are specific use-cases when we should apply different tools and approach. I would like to talk about usage of common frameworks like Spark, Kafka, Hadoop, Camus, Oozie, AirBnBWorkflow and others in order to make our life easier. We can discuss typical issues that occur in daily work, and the way we handled them @Allegro. We might also talk about different ways of learning Big Data technologies. To sum up, questions that should be asked in this table are: Where can we find good learning materials? How can we improve ? What skills do we need to succeed ? How to write custom tools, and is it worth the effort ?

piotr-guzik

Piotr Guzik

Software Engineer, Grupa Allegro

9. How to overcome challenges which you can expect while designing and managing environment, both software and hardware, for big data analytics.

krzysztof_baczynski

Krzysztof Baczyński

Cisco Big Data Lead for Poland, Cisco

kamil_ciukszo

Kamil Ciukszo

Founder and CEO, Alterdata

10. Real-time stream processing frameworks – available technologies, their pros & cons, deployment techniques, interesting features.

fabian-hueske

Fabian Hueske

Software Engineer, data Artisans

11. Beyond pre-computed answers – interactive, sub-second OLAP queries with Druid / Kylin
BigData stands for volume, velocity and last, but not the least, variety. Variety in data translates to variety of business use-cases and questions we may want to ask about it. One of major challenges in modern data engineering, is how to produce systems, which not only satisfy the need of our businesses today, but are also capable of keeping up with the ever increasing pace of evolving business requirements – at a palatable cost. One emerging segment of BigData technologies enabling us to build such systems, are distributed OLAP engines such as Druid and Kylin. Let’s chat about: ideal and not-so-ideal use-cases, success and failure stories, operational trade-offs and issues, scaling and optimising, getting data in (also in real time) and out (quickly), our experiences and ideas. Let’s share and learn from each other.

 

piotr-turek

Piotr Turek

Big Data Software Architect, DreamLab

16.45 - 17.30

Round II

  1. Enterprise requirements for clusters: security, audit, encryption, backups *
szymanskiartur

Artur Szymański

Hadoop Administrator, Vodafone

2. Machine Learning and Big Data: perfect solution for all problems?

andrzej-dydynski

Andrzej Dydyński

Data Scientist, Samsung

3. Latest advances in machine learning and their impact on our industries
During the discussion we will focus on the latest advancements in machine learning – mostly in the area of artificial neural networks – and their impact on the landscape of industries, tools and IT professions. Should we expect another AI ice age or this time is different and we are on the good way to solving intelligence? https://www.youtube.com/watch?v=aygSMgK3BEM

michal_sapinski

Michał Sapiński

Software engineer, Google

4. Effective tools and environment for data scientists

artur-maliszewski

Artur Maliszewski

Head of Business Intelligence, Currency One

5. Expensive mistakes to avoid when building a data platform

piotr-kalanski

Piotr Kalański

Data Engineering Team Leader, StepStone

6. Release process when deploying production data applications

cejrowski-pawel

Paweł Cejrowski

Big Data Engineer, Grupa Wirtualna Polska

7. On-click deployment – how to automate the platform properly and efficiently

Piotr-Bednarek-212x245

Piotr Bednarek

Administrator Hadoop, GetInData

8. Being efficient data engineer. Tools, ecosystem, skills, ways of learning

Big Data Engineer is quite a new profession. Yet, Bi Data ecosystem is big and it is growing rapidly and changing fast. There are a lot of frameworks, tools which are suposed to make us efficient. Some of them can help, while others are obsolete. There are specific use-cases when we should apply different tools and approach. I would like to talk about usage of common frameworks like Spark, Kafka, Hadoop, Camus, Oozie, AirBnBWorkflow and others in order to make our life easier. We can discuss typical issues that occur in daily work, and the way we handled them @Allegro. We might also talk about different ways of learning Big Data technologies. To sum up, questions that should be asked in this table are: Where can we find good learning materials? How can we improve ? What skills do we need to succeed ? How to write custom tools, and is it worth the effort ?

piotr-guzik

Piotr Guzik

Software Engineer, Grupa Allegro

9. Fast SQL solutions for Hadoop

Hadoop was developed as a batch processing solution but it quickly became important also for data scientists and analysts. There are plenty products that give you opportunity to do fast ad-hoc analysis on big data like Spark, Impala, Presto or Drill, to mention just few of them. In this session we will share our experience with various “SQL on Hadoop” solutions, hear some success stories and also discuss common pitfalls.

 

jakubpieprzyk

Jakub Pieprzyk

Data Science Developer, RyanAir

10. Data visualisation – why, how and when?

przemyslaw_biecek

Przemysław Biecek

Co-founder, SmarterPoland.pl

11. Large-scale data collection and ingestion – Kylo and other projects (Gobblin, Nifi, Kafka Connect, Camus)

tomaszdomanski

Tomasz Domański

Senior Data Engineer, ThinkBig (a Teradata company)

* the table will be hosted in Polish language only

17.30 - 17.45

Coffee break

17.45 - 18.15

Panel discussion - BigData implementations – how to make justified ROI

BigData brings a lot of promises about potential benefits, but life proves it’s not always so easy. How to make BigData projects great? How to get quick-wins? How to avoid expensive mistakes? How to communicate with the others  – business side or a client – to make it a viable project? What are the major success factors and where are the easily to be missed out obstacles that can derail Big Data projects?

Hosts:

PG

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

Krzysztof Zarzycki

Krzysztof Zarzycki

Big Data Architect and Co-founder, GetInData

grzegorz-bartler

Grzegorz Bartler

Head of Business Intelligence Departament, Polkomtel, Cyfrowy Polsat

piotr gawrysiak

dr hab. Piotr Gawrysiak

Chief Data Scientist, mBank S.A.

speaker6

Derek Yeung

Head of Platform Engineering, Nordea

olaf-piotrowski

Olaf Piotrowski

Chief Data Officer, Allegro

18.15 - 18.30

Closing & Summary

PG

Przemysław Gamdzyk

CEO & Meeting Designer, Evention

kawa

Adam Kawa

CEO and Co-founder, GetInData