Agenda - Big Data Technology Warsaw Summit
BigData Technology Warsaw Summit 2024Our agenda is packed with presentations, arranged into 6 categories – find your most desired topics!

You can choose whether you prefer to WATCH THE CONFERENCE ONLINE or JOIN US IN PERSON IN WARSAW. More presentations, more experts and more topics!
28.03.2023 - WORKSHOP DAY
9.00 - 16.00
PARALLEL WORKSHOPS (independent workshops, paid entry) | on-site, WARSAW
DESCRIPTION:
In this one-day workshop, you will learn how to create modern data transformation pipelines managed by dbt and orchestrated with Apache Airflow. You will discover how you can improve your pipelines’ quality and the workflow of your data team by introducing a set of tools aimed to standardize the way you incorporate good practices within the data team: version controlling, testing, monitoring, change-data-capture, and easy scheduling. We will work through typical data transformation problems you can encounter on a journey to deliver fresh & reliable data and how modern tooling can help to solve them. All hands-on exercises will be carried out in a public cloud environment (e.g. GCP or AWS).
Participants limit: 20
SESSION LEADER:
GetInData | Part of Xebia
GetInData | Part of Xebia
DESCRIPTION:
In this one-day workshop, you will learn how to operationalize Machine Learning models using popular open-source tools, like Kedro and Kubeflow, and deploy it using cloud computing.
During the course, we simulate real-world end-to-end scenarios – building a Machine Learning pipeline to train a model and deploy it in Kubeflow environment. We’ll walk through the practical use cases of MLOps for creating reproducible, scalable, and modular data science code. Next, we’ll propose a solution for running pipelines on Google Cloud Platform, leveraging managed and serverless services. All exercises will be done using either a local docker environment or GCP account.
Participants limit: 18
SESSION LEADERS:
GetInData | Part of Xebia
GetInData | Part of Xebia
DESCRIPTION:
In this one day workshop you will learn how to build streaming analytics apps that deliver instant results in a continuous manner on data-intensive streams. You will discover how to configure streaming pipelines, transformations, aggregations or triggers using SQL and Python in an user-friendly development environment using open source tools of Apache Flink, Apache Kafka and Getindata OSS projects.
Participants limit: 18
SESSION LEADER:
GetInData | Part of Xebia
19.00 - 22.00
Evening Meeting for Speakers. Let's meet! To talk, to meet new people, to exchange experience. We invite you for a face 2 face interaction onsite. The integration meeting will take place at the Floor No2 restaurant at the Marriott Hotel in the center of Warsaw. The event starts at 19:00.


29.03.2023 - 1ST CONFERENCE DAY | HYBRID: ONLINE + ONSITE
8.30 - 9.00
Morning cofee and networking time
9.00 - 9.15
Plenary session
9.15 - 9.40
Plenary session
9.45 - 10.15
Plenary session
This panel will bring together leading experts from chosen vendors of big data solutions. They will share a deeper understanding of the latest trends, technologies and methodologies driving the big data industry, and leave with practical insights they can apply in their organizations. We will ask series of deep tech questions to all panelist.
10.15 - 10.40
Plenary session
#databricks #terraform #devops #security #azure
Creating Azure Databricks environment is as simple as “click of button” but how to ensure platform is secured and protected from data exfiltration? How can Infrastructure as Code and Terraform support platform hardening? How can DevOps accelerate your Bigdata projects? What are key configurations options to consider? What are pitfalls and limitations? What could be improved?

10.40 - 11.05
Plenary session
11.05 - 11.35
BREAK
11.35 - 13.15
PARALLEL SESSIONS
11.35 - 12.05
Parallel session 1
#streaming #nifi #kafka #flink #cybersecurity #flow #cloud #datamovement
In the context of a real-life cybersecurity use case at scale, we'll describe a modern streaming architecture providing the ability to enable real-time actionable insights and actions at scale in the event of a cyber attack. We'll especially focus on the role of NiFi in this architecture and how it is a key element for your data distribution & acquisition problems in modern environments where we have to deal with a wide range of systems, interfaces, formats, etc.

Parallel session 2
#Azure, #Analytical Platform, #Data Platform, #BI, #AA, #Cloud Architectur
The Volvo Group has a very ambitious strategy that assumes 50% of its revenue by 2030 comes from services and solutions. Most of those services will be digital and will have data needs. We want to deep-dive into the large-scale data ecosystem and share the challenges we faced while developing the cloud analytical platform. We will share the journey from the high-level architecture phase to the practical implementation and operational cases in the context of the MS Azure stack.

Parallel session 3
Parallel session 4
12.10 - 12.40
Parallel session 2
#DataPlatform #DataAnalysis #AWS #DataScience #DataLake #Jupyter #DataDrivenBusiness
To be strict they are not dead... yet, but they will be if we don’t change our approach.
Firstly, now it takes around 6 months from the start of building a data platform until analytical work becomes possible. Second thing, very often companies don’t know how to monetize data, or they don’t have the resources to use it.
But what if we change our approach and data platform will be built in an analysis-oriented way?
Start by thinking about how data can help your organization,
then do the analysis,
and the platform will be created "by the way" as a response to analytical needs.
Join our presentation and see how we do it at BlueSoft.

Parallel session 1
#bigdata #datamesh #snowflake #dataops #selfservedataplatform
Data Mesh! Just a catchy concept or more than that? In this session we would like to guide you through our Data Mesh Journey at Roche. Starting from the very first idea and motivation up to the actual evolution to an enterprise-level self-serve data platform including insights into multiple capabilities (e.g. Snowflake, DataOps.live etc.). We will explain the Data Mesh Concept and how the combination of cloud services enables teams to fulfill all characteristics of a Data Product.

Parallel session 3
Parallel session 4
12.45 - 13.15
Parallel session 1
Parallel session 1
Parallel session 3
#machine_leaning #NLP #DSP #music_industry
The revolution of machine learning is reaching every aspect of our lives - including art and music.
In this talk, we will dive into the world of song analysis and the extraction of lyrical and musical features. We will discuss existing approaches, both in machine learning - Natural Language Processing, Digital Signal Processing - and in music theory & linguistics.
We will see how we can use these features in different kinds of machine learning models, and how these models can be used to solve problems in the music industry, such as song tags and song similarity.
Attend this talk to learn how your technical skills can be useful also to your hobbies.

Parallel session 3
#stream processing, #streaming sql, #real time, #apache flink, #open source
In this presentation, we will explain the fundamentals of Apache Flink: What are the common use-cases, how do you build applications on top of it, how does it integrate with other systems and how does it help solve operational challenges.
Also, we will discuss new features added in the last Flink releases, such as the Kubernetes operator, batch & streaming unification and more.

13.15 - 14.00
LUNCH BREAK
14.00 - 15.40
PARALLEL SESSIONS
14.00 - 14.30
Parallel session 1
#data lake, #data architecture, #data swamp, #BI
We all start with the same dream of an amazing well curated data lake. A place where we could store all the company's valuable data indefinitely, a lake of insights just waiting to be found.
But reality has proven us again and again, that these lakes of insights we seek, more often than not turn out to be chaotic, unstructured data swamps.
But how about a different approach? What if we could solve for schema, cataloging, ownership, governance and so many other issues - before the data was even created.
In this talk we will discuss how we've done exactly that in Wix's data platform. We will review how we've built the tools, culture and development ecosystem to make BI data a first class citizen company wide, and how that enabled us to build a fully structured and curated data lake in petabytes scale.

Parallel session 2
#dataengineering #bigdataanalytics #bigdatatechnologies
Data engineering has become a more complex environment and means different things to different persons/organisations
Understanding this is useful when hiring/looking for a job/assessing solutions and vendors
In this presentation, we will go though different approaches, taking into account engineering culture, data landscape, expected outcome and hiring/contracting history
For each we will look at examples, pros and cons

Parallel session 3
#recommenderSystems #lean #machineLearning #development
Exploring Lean Data Science for Recommendation Systems
Showcasing the Solution Evolution - from heuristics to complex ML models
How different industries and organizations shape the approach of designing a recommendation systems

Parallel session 4
#data mesh, #data product, #contract testing, #MLOPs, #CD4ML
What is Data Mesh?
What is a Data Product?
What are Consumer-Driven Contracts (CDCs)?
How to apply CDCs to safely compose data products?

14.35 - 15.05
Parallel session 1
Parallel session 2
#analyticsengineering #datavault #dbtlabs #snowflake #mlops
Ingesting sources by generating Airbyte Connections from DBT sources
Business Model Based Data Vault Integration Layer
DBT Hardrules to prepare Source Data for Datavault or Raw Consumption
Mapping Source Data to the Business Model
DBT Softrules to implement Data Products and re-usable components based on the Business Model
Combining Raw and Business Model based data to reduce and operationalise Data for ML Use Case

Parallel session 3
#activelearning #machinelearning #deeplearning #wisesampling
Active Learning - a way for collecting labeled data wisely, thus achieving equivalent or better performance levels with fewer data samples and saving time and money.
Background - what type of methods there is? What is the motivation behind using active learning?
Methods - two different methods will be describe, be demonstrating them on a specific real world use case
Results - what are the results of those methods compared to regular random sampling?
Validations - how can we validate the results? How can we be sure that all the components behaves as we want?

Parallel sesion 4
#Data engineer, #data architect, #head of data, #CTO, #CDO
Handling billions of messages/day is not an easy challenge for streaming applications: the streaming world never sleeps, while humans … well, they should! In this talk, a real-world practical solution is presented for Near-Real-Time streaming applications monitoring and automated maintenance, including:
a flexible and scalable architecture to be resilient against extreme volume and velocity dynamics
an actionable monitoring system to enable automated recovery processes
a customizable triggers system
an interoperable data model for health metrics
a decoupled and application-agnostic design

15.10 - 15.40
Parallel session 1
#OnPremExit #CloudMigration #LessonsLearned #IfIWouldHaveKnown
How GE Healthcare has transitioned effectively from on-prem analytical ecosystem to the cloud.
Migration options and approaches
Best Practices for re-platform & redesign of legacy solutions to cloud
Pitfalls: What people do not say...
Retrospective and lessons learned

Parallel session 2
#TechnologicalChoices #DataProcessingEngines #BigDataArchitecture #ProsAndCons #DataDrivenBusiness
During the presentation we will try to help you not get lost among the most popular Data Processing Engines. Our presentation is an attempt to anser following questions: What Data Processing Engine should I use? What are the key points to consider while making such a decision?
We will present these technologies into your ETL processes from different perspectives - starting from the complexity of data pipelines, through amount of data, maintainability and required expertise. We will focus on the leading open-source and cloud technologies like Apache Spark, Apache Beam, BigQuery/Snowflake/Hive, and many more.

Parallel session 3
#timeseries, #datascience, #forecasting, #preprocessing, #python
Time series data is fundamental for most model businesses — however its handling involves a variety of common issues that can lead to serious consequences when not treated with care.
In this talk, I will introduce the audience to several such common issues, their causes and effective solutions.
This will include: (1) stabilising complex time series dynamics; (2) imputation of time-clustered missing values; (3) reducing impact of noise on forecasting models.

Parallel session 4
#data-meetups #meetups #data-community
Meetups are a great way to upskill your team, create internal/external data communities, improve your leadership skills, and hire new talent. In fact, the tech world is famous for its vibrant meetups and conferences. In this talk, demystify the organization of meetups. From finding sponsors, getting world-class speakers, and attracting a relevant audience.

15.40 - 16.00
BREAK
16.00 - 16.25
Plenary session
16.25 - 16.50
Plenary session
How to build a good Netflix signup experience? In a consumer-facing product, user journeys - like account creation - have a direct impact on the business metrics. Come and learn how Netflix uses (big) data to continuously improve their user journeys.
#DataAnalysis #UserJourneys #A/Bexperimentation

16.50 - 17.45
ROUNDTABLES (ONSITE only)
Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable discussion – they are selected professionals with a vast knowledge and experience.
There will be one roundtable sessions, hence every conference participants can take part in 1 discussion.
Roundtable discussion
Digital experimentation: the ultimate weapon for businesses seeking a competitive edge in the world of tech. By using real-time data and cutting-edge tech, it optimizes digital offerings, enhances user experiences, and drives growth. Unlock the power of digital experimentation to take your digital game to the next level.
Roundtable discussion
What we can observe today is the rapid growth of new AI models that achieve astonishing results in generating text, audio, images, and even source code. It becomes clear that the generative AI revolution has begun. The question is: are companies ready for it? Join me for a discussion – everyone will have a chance to share their thoughts, discuss experiences, exchange ideas, ask and answer questions.
Discussion points include e.g.:
● Which AI models have the potential for being applied by companies?
● How can they be used to gain profit?
● What are the risks of using them?
● How can we mitigate these risks?
Roundtable discussion
What are the key characteristics of a self-service data streaming platform? Is ANSI SQL good enough? If not, what's missing? What about data discoverability? How data should be shared inside of an organization and how it should be secured? What else does your organization care deeply about?
17.45 - 18.00
SUMMARY & PRIZE GIVEAWAY
19.00 - 22.00
Evening Meeting for all (*advance registration for the event is required)
Let's get together! To talk, to meet new people, to see old colleagues. We invite you for a face 2 face interaction onsite. The integration meeting will take place at the Level27 club in the center of Warsaw. The event starts at 19:00.
More information HERE.


30.03.2023 - 2TH CONFERENCE DAY | ONLINE only
9.30 - 12.00
PARALLEL TECHNICAL WORKSHOPS (all participants could join) | online
DESCRIPTION:
Newer platform technologies enable analysts to build pipelines using 2 to 3 weeks training, as opposed to existing technology stacks that require extensive experience in dedicated engineering tools and languages like Python. This change is revolutionary for larger organisations as it unlocks a large pool of resources and provides strategic flexibility. This workshop will cover the topic of the modern data platform architecture principles. The session is intended to be an open discussion around the design choices that we will be presenting.
SESSION LEADERS:
Reckitt
Reckitt
DESCRIPTION:
Designing scalable data platform on Azure. Using the most popular technologies for data ingestion, storage and processing. Technologies used: Azure Data Lake, Azure Synapse, Azure Data Factory, Azure Data Hub, data streaming, etc. Hands-on workshop to set up data platform on Azure from scratch. Creating sample data pipelines for batch & streaming data. Walkthrough different Azure components.
SESSION LEADER:
SoftServe
12.00 - 13.00
BREAK
13.00 - 13.10
Plenary session
13.10 - 13.35
Plenary session
13.40 - 17.10
PARALLEL SESSIONS
13.40 - 14.10
Parallel session 1
Parallel session 2
Parallel session 3
14.15 - 14.45
Parallel sesion 1
#OpenLineage #Column-Level Lineage #Observability #Spark
OpenLineage is a standard for metadata and lineage collection that is growing rapidly in adoption. Column-level lineage is one of its most highly anticipated new features. In this talk, we:
show the foundations of column lineage within the OpenLineage standard,
provide a real-life demo of how it is automatically extracted from Spark jobs,
describe and demo column lineage extraction from SQL queries,
show how the lineage can be consumed on the Marquez backend.
We aim to focus on practical aspects of column-level lineage that will be interesting to data practitioners all over the world.

Parallel session 2
#NLP #MLOps #Socialmedia #Pharmadata #AWS
Explain the business need
Overall data platform architecture in AWS
Data linking process using NLP techniques
Role of MLOPS framework in linking
Importance of social media data listening using brandwatch

Parallel session 3
#streaming #bigdata #dataengineering #aws
Delivering data at scale is not a problem anymore. Cloud services and distributed data processing frameworks make this task relatively easy, unless... unless you face an ordered delivery requirement. The talk sheds some light on the problems you may face and the possible solutions. Although the presentation covers AWS data services, the problems and solutions can be generally applied.

14.50 - 15.20
Parallel session 1
#mlops, #devops, #engineering culture
Monitoring, observability, scalability, reproducibility, CI/CD, logging, tracing, blah blah blah - all of the major SRE-ish words are present in ML powered solutions too. Can we just blindly apply the same tooling we know from the usual systems there? Well, if we apply the Betteridge's law of headlines as well as take into account that a separate MLOps term was coined, we know that the answer is probably "no".
In this talk, I will present an arbitrary production ML system and focus on the tooling around it. You will immediately be struct by both the similarities and the differences between the DevOps world and the MLOps world. Suddenly the logging, monitoring or reproducibility tools and concepts you know will gain a lot of new features and responsibilities.

Parallel session 2
#feedback #loops #recommender #systems
degenerative feedback loops are present not only in recommender systems anymore
when data distribution shifts, model retraining will usually make the situation worse
talk summarizes how to detect and prevent degenerative feedback loops in production environment when applied to a mobile game

Parallel session 3
#monetization #strategy #semantics #blockchain
Money does grow on trees. We are surrounded by an extreme amounts of public, free and open data that everybody can get. But in reality this is a very complex and technically challenging process of sourcing, integration and making sense of it. Please join us to see how to build a tech stack and sustainable business model for monetization of theoretically free big data sources.

15.20 - 15.30
BREAK
15.30 - 16.00
Parallel session 1
#iot #azure #event-hub #devices #scalability
The aim of this presentation is to provide insights about key factors and decisions that drive the technical and business aspects of the IoT data-driven project. We will present the initial setup, the challenges, the state of the project after changes, the future vision and the plan on how to achieve it.
Points to consider when moving the data from the factory to the Azure Cloud
Battle testing referential architecture vs real life use-cases
Scaling the solution by onboarding new factories - how to design the best architecture
Device Telemetry data analysis use-case - deep dive into the data

Parallel session 2
#AI #AnomalyDetection #RCA #Network #PredictiveNetworkMaintenance #AI-EmpoweredNetwork
Have you ever wondered how AI is used in the telco network area?
Orange, as data-driven and AI-powered telco company, place Data and AI at the heart of our innovations for Smarter Networks. In that case, network management is supported by Machine Learning models for detecting anomalies based on different types of network data. During this presentation we will zoom on Predictive Network Maintenance based on system logs from network nodes. Our aim is to shorten root cause analysis with Explainable AI to improve network experts daily job.

Parallel session 3
#ApacheFlink
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
In this presentation we will cover:
Understanding Flink Job basics.
Where to start performance analysis of records processing?
What about analysis of the checkpointing or recovery process?
Various Tips & Tricks

16.05 - 16.35
Parallel session 1
#mlops #featurestore #rust
Rust is known for its speed and safety that’s why it is widely used in operating systems, databases, virtual reality areas.
But can we use in it the data science and machine learning areas to boost processes and lower infrastructure cost ?
During the presentation I would like to show how we can use Rust libraries and components (integrated with Python) to speedup MLOps in several areas (data processing, feature stores, features serving and models serving).

Parallel session 2
#realtime #analytics #Pinot
The history of data analysis in business
The need for real-time, user facing analytics
Intro to Apache Pinot

Parallel session 3
#MedTech #DataGeneration
Data generation of never seen distributions based on expert knowledge
Strategies on how to effectively cooperate with the teams of experts from other fields
Technical solutions on how to evaluate models and retrain them using new data from other distribution

16.40 - 17.10
Parallel session 1
#dataops #dataquality #dataeng
If data is the new oil, you must be sure your pipe(line)s do not leak!
We can take DataOps practices a step further through the use of modern tools like git-like data lake technologies and open table formats.
Outline
state of the art
a common use case
towards a solution
conclusion

Parallel session 2
Chronon is Airbnb's data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. Chronon reduces this task from months to days - by making the process declarative. It allows data scientists to easily define features in a simple configuration language. The framework then provides access to point-in-time correct features - for both - offline model training and online inference. In this talk we will explore the problems that arise in industrial feature engineering context and explain how you can use Chronon to solve them.
Parallel session 3
#Real-time analytics #Fleet monitoring #Streaming processing #Streaming SQL
Introducing streaming SQL
Benefits of using streaming SQL
How different streaming SQL features can help to monitor fleet in real-time
Tumble window
Session window
Gaps analysis
Sub-stream analysis
Real-time alerts and automations

17.15 - 17.45
Plenary session
17.45 - 18.00