Agenda - Big Data Technology Warsaw Summit
BigData Technology Warsaw Summit 2023Our agenda is packed with presentations, arranged into 6 categories – find your most desired topics!

You can choose whether you prefer to WATCH THE CONFERENCE ONLINE or JOIN US IN PERSON IN WARSAW. More presentations, more experts and more topics!
28.03.2023 - WORKSHOP DAY
8.30 - 9.00
9.00 - 16.00
PARALLEL WORKSHOPS (independent workshops, paid entry)
All three independent workshops (paid entry) will take place the day before the conference, on March 28, onsite on the 2nd floor in the Warsaw Marriott Hotel in rooms: Wawel & Syrena, Ballroom E, Ballroom F. The conference rooms will be appropriately marked so that each workshop participant can easily find the selected speech. More about the location of the workshops HERE.
DESCRIPTION:
In this one-day workshop, you will learn how to create modern data transformation pipelines managed by dbt and orchestrated with Apache Airflow. You will discover how you can improve your pipelines’ quality and the workflow of your data team by introducing a set of tools aimed to standardize the way you incorporate good practices within the data team: version controlling, testing, monitoring, change-data-capture, and easy scheduling. We will work through typical data transformation problems you can encounter on a journey to deliver fresh & reliable data and how modern tooling can help to solve them. All hands-on exercises will be carried out in a public cloud environment (e.g. GCP or AWS).
Participants limit: 20
SESSION LEADER:
GetInData | Part of Xebia
GetInData | Part of Xebia
DESCRIPTION:
In this one-day workshop, you will learn how to operationalize Machine Learning models using popular open-source tools, like Kedro and Kubeflow, and deploy it using cloud computing.
During the course, we simulate real-world end-to-end scenarios – building a Machine Learning pipeline to train a model and deploy it in Kubeflow environment. We’ll walk through the practical use cases of MLOps for creating reproducible, scalable, and modular data science code. Next, we’ll propose a solution for running pipelines on Google Cloud Platform, leveraging managed and serverless services. All exercises will be done using either a local docker environment or GCP account.
Participants limit: 18
SESSION LEADERS:
GetInData | Part of Xebia
GetInData | Part of Xebia
DESCRIPTION:
In this one day workshop you will learn how to build streaming analytics apps that deliver instant results in a continuous manner on data-intensive streams. You will discover how to configure streaming pipelines, transformations, aggregations or triggers using SQL and Python in an user-friendly development environment using open source tools of Apache Flink, Apache Kafka and Getindata OSS projects.
Participants limit: 18
SESSION LEADER:
GetInData | Part of Xebia
DESCRIPTION:
In this one-day workshop, you will learn how to create modern data transformation pipelines managed by dbt and orchestrated with Apache Airflow. You will discover how you can improve your pipelines’ quality and the workflow of your data team by introducing a set of tools aimed to standardize the way you incorporate good practices within the data team: version controlling, testing, monitoring, change-data-capture, and easy scheduling. We will work through typical data transformation problems you can encounter on a journey to deliver fresh & reliable data and how modern tooling can help to solve them. All hands-on exercises will be carried out in a public cloud environment (e.g. GCP or AWS).
Participants limit: 20
SESSION LEADER:
GetInData | Part of Xebia
GetInData | Part of Xebia
DESCRIPTION:
In this one-day workshop, you will learn how to operationalize Machine Learning models using popular open-source tools, like Kedro and Kubeflow, and deploy it using cloud computing.
During the course, we simulate real-world end-to-end scenarios – building a Machine Learning pipeline to train a model and deploy it in Kubeflow environment. We’ll walk through the practical use cases of MLOps for creating reproducible, scalable, and modular data science code. Next, we’ll propose a solution for running pipelines on Google Cloud Platform, leveraging managed and serverless services. All exercises will be done using either a local docker environment or GCP account.
Participants limit: 18
SESSION LEADERS:
GetInData | Part of Xebia
GetInData | Part of Xebia
DESCRIPTION:
In this one day workshop you will learn how to build streaming analytics apps that deliver instant results in a continuous manner on data-intensive streams. You will discover how to configure streaming pipelines, transformations, aggregations or triggers using SQL and Python in an user-friendly development environment using open source tools of Apache Flink, Apache Kafka and Getindata OSS projects.
Participants limit: 18
SESSION LEADER:
GetInData | Part of Xebia
19.00 - 22.00
Evening Meeting for Speakers. Let's meet! To talk, to meet new people, to exchange experience. We invite you for a face 2 face interaction onsite. The integration meeting will take place at the Floor No2 restaurant at the Marriott Hotel in the center of Warsaw. The event starts at 19:00.


29.03.2023 - 1ST CONFERENCE DAY | HYBRID: ONLINE + ONSITE
8.00 - 9.00
Registration of participants at the hotel; Morning cofee & breakfast and networking time
9.00 - 9.15
Plenary session
Evention
GetInData | Part of Xebia
9.15 - 9.40
Plenary session
The banking sector continuously adopts new and more advanced analytical models. As the number of models grows it is more and more difficult to maintain the required time-to-market for putting the models into production. We will show how we are navigating through our cloud-based MLOps journey. It is sometimes difficult but in the end inevitable.
PKO Bank Polski
9.45 - 10.15
Plenary session
This panel will bring together leading experts from chosen vendors of big data solutions. They will share a deeper understanding of the latest trends, technologies and methodologies driving the big data industry, and leave with practical insights they can apply in their organizations. We will ask series of deep tech questions to all panelist.
Panel participants:
Snowflake
Google Cloud
Cloudera
10.15 - 10.40
Plenary session
#databricks #terraform #devops #security #azure
Creating Azure Databricks environment is as simple as “click of button” but how to ensure platform is secured and protected from data exfiltration? How can Infrastructure as Code and Terraform support platform hardening? How can DevOps accelerate your Bigdata projects? How BigData engineers can benefit from automated and secured platform? What are key configurations options to consider? What are pitfalls and limitations? What could be improved in 2023?

Iqvia
10.40 - 11.05
Plenary session
#data, #analytics, #dataarchitecture, #Vertica
Intrum is Europe’s undisputed market leading credit management company. We have been utilizing Big Data for years. It is a constant work in progress with challenges, improvements, and adjustments. We are bringing together data from all countries to enable data products on all levels. The presentation will go through our journey so far as well as what we will be looking at as we move forward to make sure that we are able to handle the ever-growing thirst for data. We will look at why we added Vertica to our data landscape on top of Hadoop, what Vertica capabilities we are leveraging today for optimal performance and management of data for 24 countries and what changes will be there as we look towards Cloud, for example, the use of delta format for raw data and Databricks for some of the processing as well as switch to Vertica EON on Kubernetes.

Intrum Global Technologies
11.05 - 11.35
BREAK
11.35 - 13.15
Host
GetInData | Part of Xebia
Host
GetInData | Part of Xebia
Host
GetInData | Part of Xebia
Host
Evention
Host of parallel session 1
GetInData | Part of Xebia
Host of parallel session 2
GetInData | Part of Xebia
Host of parallel session 3
GetInData | Part of Xebia
Host of parallel session 4
Evention
11.35 - 12.05
Parallel session 1
During short presentation I will show how to effectively prepare every step that leads us from raw data through bumpy road with data engineering, feature extraction, model building with the final goal which is promoting model and transformation to production.
As important spinoffs I will show how metadata lineage can improve understanding whole process and Grafana panels could provide broad analytic visualization.

Ab Initio
Parallel session 2
The Volvo Group has a very ambitious strategy that assumes 50% of its revenue by 2030 comes from services and solutions. Most of those services will be digital and will have data needs. We want to deep-dive into the large-scale data ecosystem and share the challenges we faced while developing the cloud analytical platform. We will share the journey from the high-level architecture phase to the practical implementation and operational cases in the context of the MS Azure stack.

Volvo Group Digital & IT
Volvo Group Digital & IT
Parallel session 3
In the context of a real-life cybersecurity use case at scale, we'll describe a modern streaming architecture providing the ability to enable real-time actionable insights and actions at scale in the event of a cyber attack. We'll especially focus on the role of NiFi in this architecture and how it is a key element for your data distribution & acquisition problems in modern environments where we have to deal with a wide range of systems, interfaces, formats, etc.

Cloudera
Parallel session 4
GPT family of models is taking over the world by storm. And using them has never been easier – Azure is offering OpenAI as a generally available managed service. But how can we start using it and infuse our solutions with generative intelligence? This session will answer this question along with addressing: what exactly is available on Azure from the OpenAI suite? What models should we use and when? And how can we work with, and integrate, this service in our development process?

Microsoft
12.10 - 12.40
Parallel session 1
A real-life example of implementation of machine learning technique to automatically analyze large amount of audited data in order to arrive with high-value IT audit observations and making entire audit process more efficient. What are the results of audit observations based on machine learning. The presentation could help you to understand how implementation of new audit analytics might improve auditing process and make the entire audit function modern and effective.

ING Hubs Poland
Parallel session 2
Data Mesh! Just a catchy concept or more than that? In this session we would like to guide you through our Data Mesh Journey at Roche. Starting from the very first idea and motivation up to the actual evolution to an enterprise-level self-serve data platform including insights into multiple capabilities (e.g. Snowflake, DataOps.live etc.). We will explain the Data Mesh Concept and how the combination of cloud services enables teams to fulfill all characteristics of a Data Product.

Roche Informatics
Roche Informatics
Parallel session 3
To be strict they are not dead... yet, but they will be if we don’t change our approach.
Firstly, now it takes around 6 months from the start of building a data platform until analytical work becomes possible. Second thing, very often companies don’t know how to monetize data, or they don’t have the resources to use it.
But what if we change our approach and data platform will be built in an analysis-oriented way?
Start by thinking about how data can help your organization,
then do the analysis,
and the platform will be created "by the way" as a response to analytical needs.
Join our presentation and see how we do it at BlueSoft.

BlueSoft
BlueSoft
Parallel session 4
Meta Supercomputer, Spark & Kafka processing, rapid restore of data in 100GBps regime and S3 cloud integration - all in one solution. Seems impossible? Pure Storage FlashBlade has been solving the bigdata challenges like no storage solution before. The session will focus on the modern use cases that work both on-prem and in the cloud for AI, ML and container based workloads among multiple others. Data platforms are NOT dead - they are re-born.

Pure Storage
12.45 - 13.15
Parallel session 1
Learn best practices for how to create tables that allow you to answer any questions. Improve the productivity of your analysts and data scientists by creating easy-to-use and understand tables.
#dataengineering #analytics #productivity #sql

ex-Kry, ex-Spotify
Parallel session 2
The aim of this presentation is to provide insights about key factors and decisions that drive the technical and business aspects of the IoT data-driven project. We will present the initial setup, the challenges, the state of the project after changes, the future vision and the plan on how to achieve it.
Points to consider when moving the data from the factory to the Azure Cloud
Battle testing referential architecture vs real life use-cases
Scaling the solution by onboarding new factories - how to design the best architecture
Device Telemetry data analysis use-case - deep dive into the data

Datumo
C&F
Parallel session 3
The revolution of machine learning is reaching every aspect of our lives - including art and music.
In this talk, we will dive into the world of song analysis and the extraction of lyrical and musical features. We will discuss existing approaches, both in machine learning - Natural Language Processing, Digital Signal Processing - and in music theory & linguistics.
We will see how we can use these features in different kinds of machine learning models, and how these models can be used to solve problems in the music industry, such as song tags and song similarity.
Attend this talk to learn how your technical skills can be useful also to your hobbies.

Meta
Parallel session 4
In this presentation, we will explain the fundamentals of Apache Flink: What are the common use-cases, how do you build applications on top of it, how does it integrate with other systems and how does it help solve operational challenges.
Also, we will discuss new features added in the last Flink releases, such as the Kubernetes operator, batch & streaming unification and more.

Decodable
13.15 - 14.00
LUNCH BREAK
14.00 - 15.40
14.00 - 14.30
Parallel session 1
We all start with the same dream of an amazing well curated data lake. A place where we could store all the company's valuable data indefinitely, a lake of insights just waiting to be found.
But reality has proven us again and again, that these lakes of insights we seek, more often than not turn out to be chaotic, unstructured data swamps.
But how about a different approach? What if we could solve for schema, cataloging, ownership, governance and so many other issues - before the data was even created.
In this talk we will discuss how we've done exactly that in Wix's data platform. We will review how we've built the tools, culture and development ecosystem to make BI data a first class citizen company wide, and how that enabled us to build a fully structured and curated data lake in petabytes scale.

Wix.com
Parallel session 2
Data engineering has become a more complex environment and means different things to different persons/organisations
Understanding this is useful when hiring/looking for a job/assessing solutions and vendors
In this presentation, we will go though different approaches, taking into account engineering culture, data landscape, expected outcome and hiring/contracting history
For each we will look at examples, pros and cons

PepsiCo
Parallel session 3
Exploring Lean Data Science for Recommendation Systems
Showcasing the Solution Evolution - from heuristics to complex ML models
How different industries and organizations shape the approach of designing a recommendation systems

FREE NOW
Parallel session 4
What is Data Mesh?
What is a Data Product?
What are Consumer-Driven Contracts (CDCs)?
How to apply CDCs to safely compose data products?

HTW Berlin; Thoughtworks
14.35 - 15.05
Parallel session 1
The selection of managed and cloud-native machine learning services that you can run your data science pipelines and deploy your trained models is versatile. Unfortunately, there is no single way of interacting with platforms like Amazon Sagemaker Pipelines, Google Vertex AI Pipelines, Microsoft AzureML Pipelines, Kubeflow Pipelines. In this presentation you will learn how GetInData MLOps Platform powered by battle-tested technologies such as Kedro, MLflow and Terraform would make your data scientists’ life easier and more productive - regardless of what cloud provider you use.
#datasience #ML #kubeflow #vertexai #MLOps #Sagmaker #AzureML

GetInData | Part of Xebia
GetInData | Part of Xebia
Parallel session 2
Ingesting sources by generating Airbyte Connections from DBT sources
Business Model Based Data Vault Integration Layer
DBT Hardrules to prepare Source Data for Datavault or Raw Consumption
Mapping Source Data to the Business Model
DBT Softrules to implement Data Products and re-usable components based on the Business Model
Combining Raw and Business Model based data to reduce and operationalise Data for ML Use Case

Alligator Company
Parallel session 3
Active Learning - a way for collecting labeled data wisely, thus achieving equivalent or better performance levels with fewer data samples and saving time and money.
Background - what type of methods there is? What is the motivation behind using active learning?
Methods - two different methods will be describe, be demonstrating them on a specific real world use case
Results - what are the results of those methods compared to regular random sampling?
Validations - how can we validate the results? How can we be sure that all the components behaves as we want?

Healthy.io
Parallel sesion 4
Handling billions of messages/day is not an easy challenge for streaming applications: the streaming world never sleeps, while humans … well, they should! In this talk, a real-world practical solution is presented for Near-Real-Time streaming applications monitoring and automated maintenance, including:
a flexible and scalable architecture to be resilient against extreme volume and velocity dynamics
an actionable monitoring system to enable automated recovery processes
a customizable triggers system
an interoperable data model for health metrics
a decoupled and application-agnostic design

Agile Lab
15.10 - 15.40
Parallel session 1
How GE Healthcare has transitioned effectively from on-prem analytical ecosystem to the cloud.
Migration options and approaches
Best Practices for re-platform & redesign of legacy solutions to cloud
Pitfalls: What people do not say...
Retrospective and lessons learned

GE Healthcare
GE Healthcare
Parallel session 2
During the presentation we will try to help you not get lost among the most popular Data Processing Engines. Our presentation is an attempt to anser following questions: What Data Processing Engine should I use? What are the key points to consider while making such a decision?
We will present these technologies into your ETL processes from different perspectives - starting from the complexity of data pipelines, through amount of data, maintainability and required expertise. We will focus on the leading open-source and cloud technologies like Apache Spark, Apache Beam, BigQuery/Snowflake/Hive, and many more.

Allegro.pl
Allegro.pl
Parallel session 3
Time series data is fundamental for most model businesses — however its handling involves a variety of common issues that can lead to serious consequences when not treated with care.
In this talk, I will introduce the audience to several such common issues, their causes and effective solutions.
This will include: (1) stabilising complex time series dynamics; (2) imputation of time-clustered missing values; (3) reducing impact of noise on forecasting models.

Xebia Data (Netherlands)
Parallel session 4
Meetups are a great way to upskill your team, create internal/external data communities, improve your leadership skills, and hire new talent. In fact, the tech world is famous for its vibrant meetups and conferences. In this talk, demystify the organization of meetups. From finding sponsors, getting world-class speakers, and attracting a relevant audience.

Xebia
15.40 - 16.00
BREAK
16.00 - 16.25
Plenary session
This panel will bring together practitioners from chosen enterprises that are utilizing big data projects. They will share experiences and lessons learned from implementation of big data interesting use cases - to show where we are headed in usage of big data technologies, tools and fulfillment of business needs. We will ask series of practical questions to all panelist that might inspire you in context of your organization.
ING Hubs Poland
Roche Informatics
Reckitt
16.25 - 16.50
Plenary session
How to build a good Netflix signup experience? In a consumer-facing product, user journeys - like account creation - have a direct impact on the business metrics. Come and learn how Netflix uses (big) data to continuously improve their user journeys.
#DataAnalysis #UserJourneys #A/Bexperimentation

Netflix
16.50 - 17.45
ROUNDTABLES (ONSITE only)
Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable discussion – they are selected professionals with a vast knowledge and experience.
There will be one roundtable sessions, hence every conference participants can take part in 1 discussion.
Roundtable discussion
Serverless data transformation approach seems to be good choice for many if not for all use cases? Do you agree? Probably the answer is not so straight forward. Thus I would like to invite you to the discussion about that. We will share our real project experience, talking about chances and risks. We will analyse several perspectives including: available technology stacks, solution delivery time, overall cost, scale of solution, competences availability, vendor lock. See you soon!
Roundtable discussion
DataLakeHouse is a concept everybody knows. However how, together with AI it can support busniess decisions. Should DataLakeHouse still feed AI so it can make it own correct decisions in finance world. Let's discuss together, about core concepts, is datalake house really needed, and if AI at some point of time, can make proper investment decisions.
Roundtable discussion
Digital experimentation: the ultimate weapon for businesses seeking a competitive edge in the world of tech. By using real-time data and cutting-edge tech, it optimizes digital offerings, enhances user experiences, and drives growth. Unlock the power of digital experimentation to take your digital game to the next level.
Roundtable discussion
What we can observe today is the rapid growth of new AI models that achieve astonishing results in generating text, audio, images, and even source code. It becomes clear that the generative AI revolution has begun. The question is: are companies ready for it? Join me for a discussion – everyone will have a chance to share their thoughts, discuss experiences, exchange ideas, ask and answer questions.
Discussion points include e.g.:
● Which AI models have the potential for being applied by companies?
● How can they be used to gain profit?
● What are the risks of using them?
● How can we mitigate these risks?
Roundtable discussion
What are the key characteristics of a self-service data streaming platform? Is ANSI SQL good enough? If not, what's missing? What about data discoverability? How data should be shared inside of an organization and how it should be secured? What else does your organization care deeply about?
Roundtable discussion
Data Mesh has been one of the most hyped buzzwords in the data space for the past 2-3 years. Everybody talks about it, but only a few have fully grasped the concepts behind it. Let's get together at this roundtable, to discuss our impressions and understandings of Data Mesh, what is at the core of its concepts, and what experiences we can share and learn from each other.
Roundtable discussion
Description: There’s a lot of fuss around Data Quality, Data Observability and DataOps, but what they are all about?
How do they fit in a data platform or a data practice? What are the most relevant tools in the space?
Join this round table to discuss your experience and learn other people experience regarding Data Ops, Data Observability and Data Quality.
Roundtable discussion
From on-prem Hadoop to GCS, Dataflow, and Bigquery, from TSV to protobuf, from EMR to Databricks, from EC2 to k8s, from Redshift to Delta and Presto. Sounds familiar? Let's talk about migrations in Data Platforms, how to do that, and why it's hard, costly, and usually takes longer than expected.
- Why does it take longer than initially estimated?
- Why do users often not want to / can't migrate now?
- Do you always need to run two data infrastructures (the old and the new one) in parallel? Will it cost twice as much as expected?
- Should we offload migration to the users or migrate on their behalf by task force teams?
- Why do we need to migrate again, haven't we finished the last migration like 3 months ago?
Roundtable discussion
At a time when various industries are increasingly embracing AI, the excitement around the opportunity may overshadow the wide-ranging ethical & legal implications. For example, an ed-tech startup may fast-track its way to operationalizing a language tutoring chatbot by using pre-trained language models containing built-in bias, which may lead to reputational damage if not managed properly. Similarly, a healthcare company may choose to implement a highly accurate black box model without realizing that the customers (or regulators) may require a solution that is transparent and interpretable––even if it is less accurate. In this session, we’ll share our approaches to these and other considerations around developing responsible AI.
Roundtable discussion
MLOps landscape is rapidly evolving with new tools being released almost on a daily basis. How not to get lost in such a deluge of solutions? What are the main selection criteria and core features that such frameworks need to have? Is the Kubernetes platform always the preferred runtime environment? When would you go for fully managed services like Vertex AI pipelines or Azure ML pipelines? What approach would you choose in the case of a hybrid setup? To abstract, or not to abstract - does it make sense to have a unified pipeline API? What are your MLOps plans for 2023? Let’s brainstorm together!
17.45 - 18.00
SUMMARY & PRIZE GIVEAWAY
19.00 - 22.00
Evening Meeting for all (*advance registration for the event is required)
Let's get together! To talk, to meet new people, to see old colleagues. We invite you for a face 2 face interaction onsite. The integration meeting will take place at the Level27 club in the center of Warsaw. The event starts at 19:00.
More information HERE.


30.03.2023 - 2TH CONFERENCE DAY | ONLINE only
9.30 - 12.00
PARALLEL TECHNICAL WORKSHOPS (all participants could join)
DESCRIPTION:
Newer platform technologies enable analysts to build pipelines using 2 to 3 weeks training, as opposed to existing technology stacks that require extensive experience in dedicated engineering tools and languages like Python. This change is revolutionary for larger organisations as it unlocks a large pool of resources and provides strategic flexibility. This workshop will cover the topic of the modern data platform architecture principles. The session is intended to be an open discussion around the design choices that we will be presenting.
SESSION LEADERS:
Reckitt
Reckitt
DESCRIPTION:
Designing scalable data platform on Azure. Using the most popular technologies for data ingestion, storage and processing. Technologies used: Azure Data Lake, Azure Synapse, Azure Data Factory, Azure Data Hub, data streaming, etc. Hands-on workshop to set up data platform on Azure from scratch. Creating sample data pipelines for batch & streaming data. Walkthrough different Azure components.
SESSION LEADER:
SoftServe
DESCRIPTION:
GPT family of models is taking over the world by storm. And using them has never been easier – Azure is offering OpenAI as a generally available managed service. But how can we start using it and infuse our solutions with generative intelligence? This session will answer this question along with addressing: what exactly is available on Azure from the OpenAI suite? What models should we use and when? And how can we work with, and integrate, this service in our development process?
SESSION LEADER:
Microsoft
DESCRIPTION:
Newer platform technologies enable analysts to build pipelines using 2 to 3 weeks training, as opposed to existing technology stacks that require extensive experience in dedicated engineering tools and languages like Python. This change is revolutionary for larger organisations as it unlocks a large pool of resources and provides strategic flexibility. This workshop will cover the topic of the modern data platform architecture principles. The session is intended to be an open discussion around the design choices that we will be presenting.
SESSION LEADERS:
Reckitt
Reckitt
DESCRIPTION:
Designing scalable data platform on Azure. Using the most popular technologies for data ingestion, storage and processing. Technologies used: Azure Data Lake, Azure Synapse, Azure Data Factory, Azure Data Hub, data streaming, etc. Hands-on workshop to set up data platform on Azure from scratch. Creating sample data pipelines for batch & streaming data. Walkthrough different Azure components.
SESSION LEADER:
SoftServe
DESCRIPTION:
GPT family of models is taking over the world by storm. And using them has never been easier – Azure is offering OpenAI as a generally available managed service. But how can we start using it and infuse our solutions with generative intelligence? This session will answer this question along with addressing: what exactly is available on Azure from the OpenAI suite? What models should we use and when? And how can we work with, and integrate, this service in our development process?
SESSION LEADER:
Microsoft
12.00 - 13.00
BREAK
13.00 - 13.10
Plenary session
Evention
GetInData | Part of Xebia
13.10 - 13.35
Plenary session
In this presentation, we will look into Revolut's architecture - having a counterintuitive, very uncomplicated and growth-friendly approach, using surprisingly simple architecture and patterns. You will never look at systems of this scale the same way again.
- Over-engineering - how do we make complex systems complicated
- Systems thinking - complicated vs. complex
- How to make our systems (really) less complicated with examples of battle-proven Revolut's approach
- Data analytics on the scale of Revolut's operations, made simple
- Dealing with complexity - combining DDD and Systems Thinking patterns - touch on Data Meshes
- Deferring decisions as a skill
- Hypergrowth scaling and legacy

Revolut Business
13.40 - 15.20
13.40 - 14.10
Parallel session 1
If data is the new oil, you must be sure your pipe(line)s do not leak!
We can take DataOps practices a step further through the use of modern tools like git-like data lake technologies and open table formats.
Outline
state of the art
a common use case
towards a solution
conclusion

Agile Lab
Parallel session 2
Data Ops – become a trendy topic right now… easy and efficient ingestion, transformation and complex analytics of the data in various formats is becoming a must today. Simply to understand the data insights and gain competitive advantage.
Today's data pipelines assume combining efficient solutions for ingestions (i.e Kafka), transformations (i.e DBT) and Database Layer for ultrafast analytics workloads. All of them must meet following expectations:
- Simplified but yet Highly Scalable Architecture
- Fault-Tolerant Architecture
- Process High Volumes of Data (in Various Formats – structured or semi structured)
- Streamlined Data Pipeline Development

Vertica by OpenText
Parallel session 3
Do you struggle with long time-to-value, unreliable data, vendor lock-in, and expensive and inflexible solutions that don't adapt to changing market conditions? You're not alone. Many organizations have turned to the Modern Data Stack as a solution, but simply adopting modern cloud tools is not enough to ensure engineering quality and self-serviceability. That's why we've integrated cloud (e.g. Snowflake, Databricks, or BigQuery) and state-of-the-art open-source components (e.g. dbt, Airflow) and our own tools (also open-source!) into a powerful, cost-efficient, and quick-to-deploy platform. In this session, we'll share the architecture of the Modern Data Platform we've implemented with our customers and their real-world case studies, where the goal was to provide a scalable and self-serviced environment for analytics engineers.

GetInData
14.15 - 14.45
Parallel sesion 1
OpenLineage is a standard for metadata and lineage collection that is growing rapidly in adoption. Column-level lineage is one of its most highly anticipated new features. In this talk, we:
show the foundations of column lineage within the OpenLineage standard,
provide a real-life demo of how it is automatically extracted from Spark jobs,
describe and demo column lineage extraction from SQL queries,
show how the lineage can be consumed on the Marquez backend.
We aim to focus on practical aspects of column-level lineage that will be interesting to data practitioners all over the world.

GetInData | Part of Xebia
GetInData | Part of Xebia
Parallel session 2
𝘛𝘩𝘦 𝘤𝘳𝘶𝘹 𝘰𝘧 𝘵𝘩𝘪𝘴 𝘴𝘦𝘴𝘴𝘪𝘰𝘯 𝘭𝘪𝘦𝘴 𝘪𝘯 𝘴𝘩𝘰𝘸𝘤𝘢𝘴𝘪𝘯𝘨 𝘰𝘶𝘳 𝘦𝘹𝘤𝘦𝘱𝘵𝘪𝘰𝘯𝘢𝘭 𝘱𝘳𝘰𝘸𝘦𝘴𝘴 𝘪𝘯 𝘦𝘹𝘦𝘤𝘶𝘵𝘪𝘯𝘨 𝘢 𝘭𝘢𝘳𝘨𝘦-𝘴𝘤𝘢𝘭𝘦 𝘥𝘢𝘵𝘢 𝘭𝘪𝘯𝘬𝘪𝘯𝘨 𝘱𝘳𝘰𝘫𝘦𝘤𝘵 𝘵𝘩𝘳𝘰𝘶𝘨𝘩 𝘵𝘩𝘦 𝘢𝘥𝘦𝘱𝘵 𝘶𝘵𝘪𝘭𝘪𝘻𝘢𝘵𝘪𝘰𝘯 𝘰𝘧 𝘤𝘶𝘵𝘵𝘪𝘯𝘨-𝘦𝘥𝘨𝘦 𝘵𝘦𝘤𝘩𝘯𝘰𝘭𝘰𝘨𝘪𝘦𝘴 𝘢𝘯𝘥 𝘧𝘳𝘢𝘮𝘦𝘸𝘰𝘳𝘬𝘴 𝘴𝘶𝘤𝘩 𝘢𝘴 𝘕𝘓𝘗, 𝘈𝘞𝘚, 𝘒𝘶𝘣𝘦𝘧𝘭𝘰𝘸, 𝘥𝘣𝘵, 𝘢𝘯𝘥 𝘮𝘰𝘳𝘦. 𝘔𝘰𝘳𝘦𝘰𝘷𝘦𝘳, 𝘸𝘦 𝘢𝘪𝘮 𝘵𝘰 𝘦𝘮𝘱𝘩𝘢𝘴𝘪𝘻𝘦 𝘵𝘩𝘦 𝘴𝘪𝘨𝘯𝘪𝘧𝘪𝘤𝘢𝘯𝘤𝘦 𝘰𝘧 𝘴𝘰𝘤𝘪𝘢𝘭 𝘮𝘦𝘥𝘪𝘢 𝘥𝘢𝘵𝘢 𝘭𝘪𝘴𝘵𝘦𝘯𝘪𝘯𝘨 𝘣𝘺 𝘭𝘦𝘷𝘦𝘳𝘢𝘨𝘪𝘯𝘨 𝘵𝘩𝘦 𝘱𝘰𝘸𝘦𝘳 𝘰𝘧 𝘉𝘳𝘢𝘯𝘥𝘸𝘢𝘵𝘤𝘩, 𝘢 𝘱𝘰𝘸𝘦𝘳𝘧𝘶𝘭 𝘵𝘰𝘰𝘭 𝘪𝘯 𝘵𝘩𝘪𝘴 𝘥𝘰𝘮𝘢𝘪𝘯.

Roche Informatics
Parallel session 3
Delivering data at scale is not a problem anymore. Cloud services and distributed data processing frameworks make this task relatively easy, unless... unless you face an ordered delivery requirement. The talk sheds some light on the problems you may face and the possible solutions. Although the presentation covers AWS data services, the problems and solutions can be generally applied.

Free2Move
14.50 - 15.20
Parallel session 1
Monitoring, observability, scalability, reproducibility, CI/CD, logging, tracing, blah blah blah - all of the major SRE-ish words are present in ML powered solutions too. Can we just blindly apply the same tooling we know from the usual systems there? Well, if we apply the Betteridge's law of headlines as well as take into account that a separate MLOps term was coined, we know that the answer is probably "no".
In this talk, I will present an arbitrary production ML system and focus on the tooling around it. You will immediately be struct by both the similarities and the differences between the DevOps world and the MLOps world. Suddenly the logging, monitoring or reproducibility tools and concepts you know will gain a lot of new features and responsibilities.

Chaos Gears
Parallel session 2
degenerative feedback loops are present not only in recommender systems anymore
when data distribution shifts, model retraining will usually make the situation worse
talk summarizes how to detect and prevent degenerative feedback loops in production environment when applied to a mobile game

King (Candy Crush Saga)
Parallel session 3
Money does grow on trees. We are surrounded by an extreme amounts of public, free and open data that everybody can get. But in reality this is a very complex and technically challenging process of sourcing, integration and making sense of it. Please join us to see how to build a tech stack and sustainable business model for monetization of theoretically free big data sources.

Token Flow Insights SA
15.20 - 15.30
BREAK
15.30 - 17.10
15.30 - 16.00
Parallel session 1
This talk is on the merge of the business and tech domains. In this talk, we will explore the specifics of the data pipelines in advertising, and what problems can be solved with it. Those pipelines' architecture will be explored along with concrete examples of the implementation.

Captify
Parallel session 2
Have you ever wondered how AI is used in the telco network area?
Orange, as data-driven and AI-powered telco company, place Data and AI at the heart of our innovations for Smarter Networks. In that case, network management is supported by Machine Learning models for detecting anomalies based on different types of network data. During this presentation we will zoom on Predictive Network Maintenance based on system logs from network nodes. Our aim is to shorten root cause analysis with Explainable AI to improve network experts daily job.

Orange Polska
Parallel session 3
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
In this presentation we will cover:
Understanding Flink Job basics.
Where to start performance analysis of records processing?
What about analysis of the checkpointing or recovery process?
Various Tips & Tricks

Confluent
16.05 - 16.35
Parallel session 1
Rust is known for its speed and safety that’s why it is widely used in operating systems, databases, virtual reality areas.
But can we use in it the data science and machine learning areas to boost processes and lower infrastructure cost ?
During the presentation I would like to show how we can use Rust libraries and components (integrated with Python) to speedup MLOps in several areas (data processing, feature stores, features serving and models serving).

Bank Millennium
Parallel session 2
The history of data analysis in business
The need for real-time, user facing analytics
Intro to Apache Pinot

StarTree
Parallel session 3
Not too long ago, digital marketers were enjoying the golden age of user-level tracking, which allowed for quick and efficient decision-making and testing. Today, privacy changes and regulations are fundamentally changing the industry landscape. In this talk, we will follow Twigeo, a growth-focused marketing agency, on its journey toward building a modern marketing measurement stack. We will go over the problem, business and data constraints, and solution highlights, such as combining Bayesian modeling with location-based experiments in a self-improving feedback loop. Finally, we will discuss how it's all brought together using a mix of cloud and open-source technologies such as Snowflake, Airflow, MLflow, and Streamlit.

Twigeo
16.40 - 17.10
Parallel session 1
If a job fails, how can you learn about downstream datasets that have become out-of-date? Can you be confident that jobs are consuming fresh, high-quality data from their upstream sources? How might you predict the impact of a planned change on distant corners of the pipeline? These questions become easier once you have a complete understanding of data lineage, the complex set of relationships between all of your jobs and datasets. In this talk, Ross Turk from Astronomer will provide an introduction to the core concepts behind OpenLineage, an open standard for data lineage, and discuss various tactics for using it.

Astronomer
Parallel session 2
Chronon is Airbnb's data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. Chronon reduces this task from months to days - by making the process declarative. It allows data scientists to easily define features in a simple configuration language. The framework then provides access to point-in-time correct features - for both - offline model training and online inference. In this talk we will explore the problems that arise in industrial feature engineering context and explain how you can use Chronon to solve them.
Airbnb
Parallel session 3
Introducing streaming SQL
Benefits of using streaming SQL
How different streaming SQL features can help to monitor fleet in real-time
Tumble window
Session window
Gaps analysis
Sub-stream analysis
Real-time alerts and automations

Timeplus
17.15 - 17.30
Plenary session
Evention
GetInData | Part of Xebia
ONLINE EXPO + KNOWLEDGE ZONE
Free participation
We have great set of presentation available in the CONTENT ZONE that would be available pre-recorded as Video on Demand for conference participants in advance
During the presentation, on the example of Allegro, we will discuss such issues as:
- massive data (onboarding)
- Public Cloud / Multi Cloud
- building tools for developers
- data mesh, governance
We will also try to tell you how to abstract things between different clouds/ecosystems and we will consider whether it is possible at all?

Allegro
Watch this speech to learn the architectural details of why the Hive table format falls short and why the Iceberg table format resolves them, as well as the benefits that stem from Iceberg’s approach.
You will learn:
- The issues that arise when using the Hive table format at scale, and why we need a new table format
- How a straightforward, elegant change in table format structure has enormous positive effects
- The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
- The resulting benefits of this architectural design

Dremio
Shiny Object Syndrome – that’s the thing that we, people working in technology very often fall victim to. Every time there’s a new “kid” on the block – be it in architectural, infrastructural, organizational or a framework space – we all get excited and we want to give it a go, since it is supposed to solve the previous issues. It tends to shine so brightly that it overshadows anything else that we have on our dockets – who would care about the old and well-established concepts, right? Well, it just so happens that those concepts have become so established for a reason and that’s why they usually should be given a higher priority than trying out “the only tool that revolutionizes data modelling”. In this talk I’d like to share the lessons that I’ve learnt over my career and what I prioritize when working on data platforms. I’ll also elaborate on how I’ve fallen victim to a shiny new object syndrome that for me was a delivery of an event-driven solution.

GetInData | Part of Xebia
- Big data applied to reflect work habits with over 100 metrics grouped around: time spent on meetings, deep work, context switching, intra-team bonding, cross-team learning and work-life harmony,
- Meta-data fetching & hashing (.NET), and analytics (Python),
- Meta-data processing challenges, e.g. what is the meaning of interactions, how to deal with the outliers or lack of data,
- ML/AI models for prediction of retention: lessons learned from feature engineering and classical algorithms (e.g. logistic regression, Random Forest, XGBoost),
- ML/AI ways forward: raw data & more contextual ML/AL algorithms (e.g. image processing for calendar view, and GCN graph convolutional networks for interactions).

Network Perspective
- Today pain points in mental health documentation.
- Developing Topic Modeling with low data resources to using the advantages of Big Data.
- Topic Modeling techniques: the journey to Deep Learning (DL) based solutions.

Eleos Health
Eleos Health
* Issues faced during the implementation
* How the folks as a single team went throw it
* Technical implementation (Feature Store, Data Mesh, MLOps)
* Generating business impact

Next Reason