Agenda 22 - Big Data Technology Warsaw Summit
BigData Technology Warsaw Summit 2024Learning From Experiments Without A/B Testing - Case Study From Willa (Swedish FinTech)
One of the more annoying challenges product businesses face is in establishing causal relationships while trying to determine the impact of making product changes. Frequently, A/B testing is used to this end, but that is expensive and time consuming for engineering and design teams, and often requires tooling as well as randomization. Here at Willa (a payments and invoicing app for US-based freelancers), we used an econometric technique called Difference-In-Differences (D-I-D) regression to tease out the causal impact of simplifying the invoicing process using our app, on the invoice creation rate per user. We took advantage of natural variation in product usage between different kinds of users, and were able to reach statistically significant results cheaper and faster than via A/B testing.
The technologies used were BigQuery and Jupyter/Databricks, and GCP more broadly, although the approach is platform agnostic. Overall, the presentation will be interdisciplinary in nature, so anyone interested in economics, data science and engineering, cloud technologies and general dynamics of product businesses is welcome to attend.
TerrariumDB as a streaming database for real-time analytics
TerrariumDB is a column and row store engine designed specifically for behavioral intelligence, real-time data processing, and is the core of the Synerise platform. It simultaneously processes data heavy analytics while executing various business scenarios in realtime. TerrariumDB was designed to analyse behavioural data, where data order and time are important to make business decisions. During the talk, there are described why we are developing our distributed database engine, where were the challenges and pitfalls, for which use cases does TerrariumDB fits best, and how it handles billions of queries per day where 99 percentile does matter.
Eliminating Bias in the Deployment of AI and Machine Learning
The primary source of bias in machine learning is not in the algorithms deployed, but rather the data used as input to build the predictive models. In this talk we will discuss why this is a huge problem and what to do about it. Different sources of bias will be identified along with possible solutions for remedying the situation when deploying machine learning. We will also speak about the importance of transparency when using machine learning to predict outcomes that impact critical decisions.
• Learn why most predictive models are biased.
• Learn about the sources of bias in predictive models.
• Learn how to reduce the negative impact of potential bias in predictive models.
NetWorkS! project - real-time analytics that controls 50% of mobile network in Poland
The ability to analyze data in real time for mobile network is crucial for diagnostics and ensuring the quality of the service for end customers. To achieve this we have built a real-time ingestion and analytics platform that processes 2.2 billions messages a day from mobile networks hardware. During the talk we will show how we used Flink and Flink SQL to build this platform. The solution includes calculation of more than 5000 KPIs and 1500 aggregation defined in SQL, on 750 Kafka topics. We will describe how we manage Flink jobs at scale using Ververica and Kubernetes, how we monitor the platform using Clickhouse and what problems we need to overcome in the project.
Developing and Operating a real-time data pipeline at Microsoft's scale - lessons from the last 7 years
Microsoft is a data driven company. All client-side software is well instrumented and emits telemetry. Designing, developing and operating (DevOps model) a big data pipeline gathering this data at the Microsoft scale (the pipeline has: 100k+ Azure cores, 13 Data Centers, hundreds of PBs) is a great learning opportunity. In this presentation I will show what we've learned over the last 7 years and describe the DevOps process we use. This will be a journey spanning: our design principles, testing approach, ops mindset (monitoring, automation, continuous-improvement), rollout across 13 Data Centers strategy and more.
Understanding Query Semantics at eBay
- User queries at an e-commerce site exhibit a plethora of information ranging from brands, size, intent or even desires. A better understanding of users intent leads to a better user experience.
- How can we exploit plain natural language texts to extract semantic and syntactic information. And how can such information help us towards improving the site behaviour?
- We will talk about enhanced language processing and understanding techniques for semantifying queries using advanced sequence learning techniques.
- We will also discuss how to design an offline evaluation to quantify the model performance at a large scale.
Analytics Translator: The New Must-Have Role for Data-Driven Businesses
Auditing your data and answering the life long question, is it the end of the day yet?
In this talk I’m going to present to you the design process behind Nielsens Data Auditing system, Life Line. From tracking and producing , to analysing and storing auditing information, using technologies such as Kafka, Avro, Spark, AWS Lambda functions and complex SQL queries. The data auditing project was one of main pillars in 2020, the extensive design process we went through paid off, and tremendously raised the quality of our data. We’re going to cover:
* A lot of data arrival and integrity pain points
* Designing your metadata and the use of AVRO
* Producing and consuming auditing data
* Designing and optimizing your auditing table - what does this data look like anyway?
* Creating an alert based monitoring system and some other add-ons
* Answering the most important question of all - is it the end of the day yet?
Lessons Learned from Containerizing Data Infrastructure at Uber
Since Data infrastructure was set up at Uber, we have been managing our own server fleet. Age old practices of managing hosts posed several challenges that stood in the way of innovation.
We did an entire ground up re-architecture of our deployment stack, embraced the DevOps model and automated away operational tasks. This effort gained us a lot of benefits across several areas (efficiency, security, etc) and strategically positioned us to leverage the cloud.
In this talk, we'll briefly discuss the challenges we faced as part of our containerization journey, our strategies/solutions to overcome these challenges and mainly focus on lessons we learned along the way.
Dashboarding Nightmares: What most people forget to scope
Data Mesh in Practice - How to set up a data driven organization
The Data Mesh paradigm is a strong candidate to supersede the centralized data lake and data warehouse as the dominant architectural patterns in data and analytics. It promotes the concept of domain-focused Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgment of data ownership.
Through personal experience with applying the Data Mesh concept in practice, as well as dedicated field research, the presenter discovered the most common pain points at different stages of the journey and identified successful approaches to overcome those challenges. In this talk, you will gain both technical and organizational insights ranging from companies that are just starting to promote a mindset shift of working with data, to companies that are already in the process of transforming their data infrastructure landscape, to advanced companies that are working on federated governance setups for a sustainable data-driven future.
Moving Apple AI/ML Data Infrastructure into the Cloud
Feed your model with Feast Feature Store
What is a feature store ? Why do we need it ? How to use it ? In this session, I would like to show how to use the Feast feature store to build a complete MLOps process. Starting with fetching historical data and model training, thought model versioning and deployment process and finally online features materialization and real-time model inference.
Building a backbone of a data-driven enterprise: Big Delta Lake
Becoming a data-driven organization is a hot topic that keeps busy a lot of companies. Many leaders and executives see value in unleashing analytics potential and using it to impact business growth.
We are going to talk and share experiences about how we…
• democratize data across the different levels of the organization by consolidating, integrating and automating data workflows into a single Data Lake;
• designed and implemented cloud-based scalable & secure architecture;
• impacted business performance by unlocking various use cases;
• executed project based on the consolidated data for one of the largest nutrition brands.
Analytical cubes in the service of data analysis
In the world of AI, ML and Big Data analysis, we have forgotten about our main clients - people who are not interested in querying databases or waiting for the result of data preparation - they want to easily play with data themselves.
In this presentation, I will explain to you:
- what are analytical cubes
- who will use them and how can they do it
- what are the differences between Apache Kylin and Microsoft Analysis Services and how to prepare a cube in these environments
Ingesting trillions of events per day with Apache Spark
Streaming trillions of messages per day (and meeting the agreeed SLA's) was a challenging job. In this session I will present how are we handling some critical aspects like reliability, performance, skewed data, debugging and performance testing. This session is targeted to software engineers passioned about performance/handling large amount of data.
How to take advantage of AI benefits in the financial sector?
The use of Machine Learning (ML) in the financial sector may encounter a number of difficulties. ML, especially Deep Learning, requires a lot of computing power and a huge amount of memory. Mid-sized financial institution in Poland can take advantages from both cloude services and on-premises legacy architecture building a hybrid solution using :
-cloud managed infrastructure services - which can help reduce operational costs and get ready-to-use infrastrucure, just-in-time
-cloud services offering : Hadoop plus Sparc processing, Tensor Flow and Deep Learning
In my speach I explain how to satisfy the legal restriction, security and PII requirements and build ML architcture providing business value within acceptable project budget.
Privacy-preserving machine learning with TensorFlow and Google Cloud Platform
* The new privacy and information security challenges introduced by Machine Learning such as data breaches or privacy loss.
* The concepts and tools to address this issues in Tensorflow framework
* The example architectures of privacy-preserving machine learning models training and serving.
How to keep the Data Lake clean instead of ending up with the Data Swamp using Data Layers a.k.a Bronze / Silver / Gold
See what’s underground via Machine Learning eyes (powered by cloud solutions)
-How to transform from building one deep learning model per month into evaluating and deploying hundreds of them in a single week?
-Building MLOps solution with CICD practices using CDK
-How to detect underground structures based on a bunch of radar signals and no labels?
-Should we avoid manual steps in the automatic Machine Learning pipeline?
-Can we use lambda aliases to differentiate between dev and prod environment?
Scaling your data lake with Apache Iceberg
- Common issues with data lakes
- What is Apache Iceberg? and what problems does it solve
- Building CDC archive at Shopify using Iceberg
- Management / considerations when using Iceberg
- Brief intro into whats next on deck for Shopify + Iceberg (Type-1 dimensions using Iceberg's V2 spec with row-level deletion)
26.04.2022 - WORKSHOP DAY
9.00 - 16.00
3-4 PARALLEL WORKSHOPS (independent workshops, paid entry) | on-site, WARSAW
19.00 - 22.00
EVENING MEETING WITH SPEAKERS on-site, WARSAW
on-site (Warsaw, hotel ****) for about 200 particpants and online for about 300 participants - live speakers from the conference room in the hotel, parallel streaming online
27.04.2022 - 1ST CONFERENCE DAY | ONLINE
9.15 - 9. 40
PLENARY SESSION
9.45 - 10.10
KEYNOTE PRESENTATION
10.15 - 11.10
INDEPENDENT PRESENTERS
11.40 - 13.20
PARALLEL SESSIONS
PARALLEL SESSIONS
PARALLEL SESSIONS
PARALLEL SESSIONS
13.20 - 14.15
LUNCH BREAK
14.15 - 15.55
CASE STUDY
CASE STUDY
CASE STUDY
15.55 - 16.15
BREAK
PEER2PEER SHARING
16.15 - 17.15
ROUNDTABLES (ONSITE)
19.00 - 22.00
EVENING PARTY | on-site, WARSAW
28.04.2022 - 2ND CONFERENCE DAY| ONLINE
9.30 - 12.00
WORKSHOPS
WORKSHOPS
WORKSHOPS
WORKSHOPS
12.00 - 13.00
BREAK
13.00 - 13.10
OPENING
13.10 - 13.35
KEYNOTE SESSION
13.40 - 14.10
PARALLEL SESSIONS
PARALLEL SESSIONS
PARALLEL SESSIONS
PARALLEL SESSIONS
14.15 - 14.45
CASE STUDY
CASE STUDY
CASE STUDY
PEER2PEER SHARING
14.45 - 15.40
ROUNDTABLES (ONLINE)
15.40 - 16.45
CASE STUDY
CASE STUDY
CASE STUDY
16.50 - 17.20
KEYNOTE SESSION
17.20 - 17.30
SUMMARY & CLOSING