The conference (February 22) and workshops (February 21) will take place at the Airport Hotel Okęcie (17 Stycznia 24 Street, Warsaw)

DESCRIPTION

Big Data Workshop is a one-day event dedicated to everyone who wants to understand and get some hands-on taste of working with Big Data and Hadoop ecosystem. We will be talking about technologies such as Hadoop, Hive, Spark and Kafka.

During the workshop you’ll act as a Big Data specialist working for a fictional company called StreamRock that creates an application for music streaming (Spotify alike). The main goal of your work is to take advantage of Big Data technologies such as Hadoop, Spark or Hive to analyze data about the users and the song they played. You will process the data to get discover answers to many business questions and power product features that StreamRock is building. Every exercise will be executed on a remote multi-node Hadoop cluster.

The workshop is highly focused on a practical experience. Instructor will share with you interesting and practical insights gained while working with Big Data technologies for several years.

TARGET AUDIENCE

Workshop is dedicated to everyone who is interested in Big Data, analytics, engineers, managers and others.

REQUIREMENTS

All you need to fully participate in the workshop is a laptop with the web browser, terminal (e.g. Putty) and the wi-fi connection. Any prior knowledge of Big Data technologies is not assumed.

PARTICIPANT'S ROI

  • Carefully curated knowledge of the most popular Big Data technologies
  • Intuition about when and why use different Big Data tools
  • Hands-on experience on simulated real-life use-cases
  • Tips about real world applications and best practices from experienced professionals.

TRAINING MATERIALS

All participants will get training materials in the form of PDF files containing slides with theory and exercise manual with the detailed description of all exercises. During the workshop exercises will be done on remote Hadoop cluster. If you want to redo exercises later on your own you can use virtual machine (eg. Hortonworks Sandbox or Cloudera Quickstart that can be downloaded from each vendor’s site).

TIME BOX

This is one-day event,  there will be coffee breaks and one-hour lunch break (included in price).

AGENDA

8.45 - 9.15

Coffee and socializing

9.15 - 10.45

Session #1 - Introduction to the Big Data and Apache Hadoop

  • Description of the StreamRock company along with all its opportunities and challenges that come from the Big Data technologies
  • Introduction to core Hadoop technologies such as HDFS or YARN
  • Hands-on exercise: Accessing a remote multi-node Hadoop cluster

10.45 - 11.00

Coffee break

11.00 - 12.30

Session #2 - Providing data-driven answers to business questions using SQL-like solution

  • Introduction to Apache Hive
  • Hands-on exercise: Importing structured data into the cluster using HUE
  • Hands-on exercise: Ad-hoc analysis of the structured data with Hive
  • Hands-on exercise: The visualisation of results using HUE

12.30 - 13.30

Lunch

13.30 - 15.30

Session #3 - Implementing scalable ETL processes on the Hadoop cluster

  • Introduction to Apache Spark, Spark SQL and Spark DataFrames.
  • Hands-on exercise: Implementation of the ETL job to clean and massage input data using Spark.
  • Quick explanation of the Avro and Parquet binary data formats.
  • Practical tips for implementing ETL processes like process scheduling, schema management, integrations with existing systems.

15.30 - 15.45

Coffee break

15.45 - 16.45

Session #4 - Other essential tools from Hadoop Ecosystem

  • Scheduling and orchestration of tasks with Oozie
  • Data collection with Apache Kafka
  • Real-time random read-write with Apache HBase

16.45 - 17.00

Coffee break

17.00 - 17.30

Session #5 - Summary and Q&A

  • Big Data Jeopardy game

Keywords: Hadoop Ecosystem, Hive, Spark, Big Data Analytics, Big Data ETL

Workshop speaker, Getindata Instructor:

Maciej Arciuch

Senior Data Engineer, Grupa Allegro, GetInData

Piotr Krewski

Big Data Consultant and Co-founder, GetInData

DESCRIPTION

This one day workshop teaches data engineers how to process unbounded streams of data in real-time using popular open-source frameworks. We focus mostly on Apache Flink – the most promising open-source stream processing framework that is more and more frequently used in production.

During the course we simulate real-world end-to-end scenario – processing logs generated by users interacting with a mobile application in real-time. The technologies that we use include Kafka, Flink, HDFS and YARN. All exercises will be done on the remote multi-node clusters.

TARGET AUDIENCE

Data engineers who are interested in leveraging large-scale and distributed tools to process streams of data in real-time.

REQUIREMENTS

Some experience coding in Java or Scala and basic familiarity with Big Data tools (HDFS, Yarn).

PARTICIPANT'S ROI

  • Concise and practical knowledge of applying stream processing to solve business problems.
  • Hands-on coding experience under supervision of experience Flink engineers.
  • Tips about real world applications and best practices.

TRAINING MATERIALS

All participants will get training materials in the form of PDF files containing slides with theory and exercise manual with the detailed description of all exercises. During the workshop exercises will be done on remote Hadoop cluster. If you want to redo exercises later on your own you can use virtual machine (eg. Hortonworks Sandbox or Cloudera Quickstart that can be downloaded from each vendor’s site).

TIME BOX

The workshop will last for 8 full hours, so you should reserve yourself a full 1 day. Of course there will be coffee and lunch breaks during the training.

AGENDA

8.45 - 9.15

Coffee and socializing

9.15 - 10.15

Session #1 - Introduction to Apache Kafka + hands-on exercises

10.15 - 10.30

Coffee break

10.30 - 12.30

Session #2 - Apache Flink

  • Introduction and key concepts
  • Basic Flink API
  • Hands-on exercises

12.30 - 13:30

Lunch

13.30 - 15.00

Session #3 - Flink cont.

  • Time & Windows
  • Integration with Kafka
  • Hands-on exercises

12.30 - 13.30

Lunch

13.30 - 15.00

Session #3 - Flink cont.

  • Time & Windows
  • Integration with Kafka and Elasticsearch
  • Hands-on exercises

15.00 - 15.15

Coffe break

15.15 - 16.45

Session #4 - Flink c.d.

  • Stateful operations
  • Best practices
  • Daemons and cluster infrastructure
  • Hands-on exercises

16.45 - 17.00

Coffee break

17.00 - 17.30

Session #5 - Summary and comparison with other stream processing engines (Spark Streaming and Storm)

Keywords: Kafka, Flink, Real Time Processing, Low Latency Stream Processing

Workshop speaker, Getindata Instructor:

Krzysztof Zarzycki

Big Data Architect and Co-founder, GetInData

Dawid Wysakowicz

Data Engineer, GetInData

DESCRIPTION

This one day workshop teaches participants how to apply data scientist methods to Large amount of data. We focus mostly on Apache Spark, Spark ML (Spark Library dedicated for machine learning) and machine learning tuning using Apache Spark.

During the course we simulate real-world end-to-end scenario – how to create a working, production ready model for text categorisation. The technologies that we use include Python, Apache Spark, Spark ML, Zeppelin. Exercises will be done on the remote multi-node clusters.

TARGET AUDIENCE

Workshop is dedicated to everyone who is interested in Big Data, analytics, text mining and data science.

REQUIREMENTS

All you need to fully participate in the workshop is a laptop with the web browser, terminal (e.g. Putty) and the wi-fi connection. Any prior knowledge of Big Data technologies or data science techniques is not assumed.

PARTICIPANT'S ROI

  • Knowledge how to approach data analysis and data science with Apache Spark
  • Knowledge how to work with text data.
  • Knowledge about two approaches to machine learning with Spark – using Spark ML and sklearn.
  • Hands-on experience on simulated real-life use-case
  • Ability to tackle business problems requiring text mining techniques

TRAINING MATERIALS

All participants will get training materials in the form of PDF files containing slides with theory and exercise manual with the detailed description of all exercises. During the workshop exercises will be done on remote Hadoop cluster. If you want to redo exercises later on your own you can use virtual machine (eg. Hortonworks Sandbox or Cloudera Quickstart that can be downloaded from each vendor’s site)

TIME BOX

This is one-day event,  there will be coffee breaks and one-hour lunch break (included in price).

AGENDA

PART 1

Working with text data

  • What is text data / how we can used it / how store it
  • Popular method for text embeddings
  • Popular classification method for text
  • Hands-on exercise: First small model for text categorization

 

PART 2

General Apache Spark

  • Create and transform DataFrames
  • ETL process with Spark
  • Perform exploratory data analysis (EDA)
  • Hands-on exercise: Load and looking at data for ads categorization

PART 3

Basic Machine Learning on Spark

  • Introduction to sklearn library
  • Hands-on exercise: use Spark for searching the best model parameters for sklearn
  • Describe Spark ML and the difference between sklearn and Spark ML
  • Hands-on exercise: Build first model with Spark ML

 

PART 4

Spark ML pipeline

  • Spark ML pipeline for text analysis
  • Hands-on exercise: Build Spark ML pipeline for ads categorisation

Keywords: Spark, Machine Learning, Text Mining, MlLib, Data Science

Workshop speakers:

Rafał Prońko

Machine Learning Developer, YND

Tomasz Żukowski

Data Analyst, GetInData