February 23, 2021
BIG DATA TECHNOLOGY WARSAW SUMMIT WORKSHOPS DAY 1
9:00 – 13:00
3 WORKSHOPS - DAY I
19.00 - 21.00
EVENING MEETING
(speaker’s presentation + discussion)
Pandemic, data and analytics – how do we might know what happens next with Covid-19?
Special evening meeting prior to the BigData Technology Warsaw Summit.
There are smart people and great research teams working on the forecasting models for pandemic developments. What data do they use, which models, how it could be approached, how accurate it is, what are the major challenges here? How does the bigdata community contribute to fighting the Covid-19 disease? These are the questions we would like to address during this unique online meeting. We have invited the very special guests (including experts from MOCOS and ICM UW) – everyone is encouraged to participate in discussion and ask questions!
♦ What makes the field of pandemia modelling and simulation so interesting and challenging? How to predict risks using available data and proper modelling?
♦ How is it done – large scale geographical microsimulation model for pandemics?
♦ What can we do for a better pandemic forecasting and predicting efficiency of various countermeasures to slow it down? Is an AI/ML enough?
In the meeting agenda:
18.45 – 19.00
Networking online
19.00 – 19.05
Opening remarks
What makes the field of pandemic modelling and simulation so interesting and challenging.
Evention
19.05 – 19.25
How it is done – large scale geographical microsimulation model for pandemics.
ICM University of Warsaw
19.25 – 19.30
Short Q&A
19.30 – 19.45
What can we do for better pandemic forecasting and predicting efficiency of various countermeasures to slow it down. Is AI/ML any good for it.
MOCOS
19.45 – 20.00
Computational side of the algorithm used by MOCOS Group
MOCOS Group
20.00 – 20.10
Q&A
20.10 – 20.30
Collaborative forecasting of COVID-19: Assembling, comparing and combining short-term predictions.
Heidelberg Institute for Theoretical Studies (HITS), Karlsruhe Institute of Technology (KIT)
20.30 -21.00
Open discussion for everybody
February 24, 2021
BIG DATA TECHNOLOGY WARSAW SUMMIT WORKSHOPS DAY 2
9:00 – 13:00
3 WORKSHOPS - DAY II
19.00 - 20.00
EVENING MEETING
(speaker’s presentation + discussion)
All about the jobs in the BigData industry in the (post)-COVID-19 world.
Special evening meeting prior to BigData Technology Warsaw Summit.
Let’s talk about the current situation at the job market for BigData Professionals. What is hot and what is not? Who is now searched by the employers and how do they do it? What are the expectations and requirements? Does the pandemic change a lot here at jobs landscape? What does the ‘remoteness’ of work makes difference? What are the future trends in the way we work together?
During the meeting there will be a discussion with top managers from companies actively acquiring new talents on the BigData market as well as managing Big Data teams.
The meeting is organized in partnership with ING Tech Poland.
In the meeting agenda:
19.00 - 19.10
Welcome Address Speech
ING Tech Poland
19.10 - 19.30
Data in the labour market – salaries and trends
Hays Poland
19.30 - 20.00
Panel discussion with representatives of BigData or AI enterprises recruting technical people
ING Banking Technology Platform, ING
Disney Streaming Services
Allegro
Panel chair:
Evention
February 25, 2021
BIG DATA TECHNOLOGY WARSAW SUMMIT DAY 1
12.30 - 13.00
TIME FOR NETWORKING ONLINE
13.00 - 13.10
CONFERENCE OPENING
GetInData
Evention
PLENARY SESSION
13.10 - 13.35
5 big data trends that redefine Edge to AI journey
During the session we will discuss the key trends redefining the way companies manage data and analytics lifecycle. The presenters will explain:
♦ the importance of disaggregation of compute and storage,
♦ advancements in stateful processing in Kubernetes,
♦ growing role of cloud and real-time processing for businesses in Poland.
Keywords: #DataArchitecture #Kubernetes #Streaming #MachineLearning #Cloud #BusinessAgility
13:35 - 13:50 Q&A Session

3Soft
Cloudera
13.35 - 14.00
High-Performance Data Analytics in a Hybrid and Multi-Cloud World
Many enterprises are re-thinking their data analytics strategy. Some plan to stay on-prem for GDPR reasons. Others are all in for full-cloud but want to stay agnostic. And still others require a hybrid approach: run certain workloads on-prem and move others to the cloud to capitalize on cloud economics. With object stores emerging as the main winners in the post-Hadoop era for cost-effective storage, enterprises are adopting them independently from the evolution of their EDW, Data Lakes, and Data Science platforms. Finally, there’s a convergence movement underway, causing enterprises to unify their data analytics platforms (EDW, Data Lakes and Data Science platforms) and supporting the broadest deployment models. Join us for this session to learn how Vertica can support your vision with a new era of data analytics in a hybrid and cloud-agnostic fashion, supporting a variety of object store technologies.
14:00 - 14:15 Q&A Session
Vertica
SIMULTANEOUS SESSIONS PART I
14.05 - 14.35
Architecture Operations & Cloud
Data Engineering
MLOps
AI, ML and Data Science
The Scalable Gaming Analytics Pipeline at Outfit7: The Next Generation
Have you ever wondered how gaming companies build their analytics pipelines? Particularly scalable ones that are able to collect terabytes of data every day? At Outfit7, this is done with a little help from Google Cloud's top services, including Kubernetes, Dataflow, BigQuery, and Cloud Composer. In this presentation, you'll see how the pipeline is built, starting from ingestion in Kubernetes, through to ending in Jupyter, Tableau, and other BI dashboards. You’ll also find out how the team fights downtime with proactive monitoring and integration tests. And last but not least, you’ll hear about the challenges that Outfit7 faced when the amount of data it had to handle skyrocketed during the peak of the COVID-19 quarantine.
Keywords: #googlecloud #bigquery #events #scalable
14.35 - 14.50 Q&A session

Outift7
Data Quality with 100+ PB: Solved Challenge at Criteo
Data Quality is paramount -we all agree on that- and isn't straightforward even with small data sets. When working with over 120PB of data on Hadoop and thousands of jobs, I can tell you firsthand, it's a challenge! We started to tackle this at Criteo 2 years ago, and I have some tangible results I'll be happy to share. We'll go through this journey, from collecting data, detecting suspect behaviors, and alerting users on data quality incidents, while integrating these new checks into the Criteo Data Platform.
Keywords:#dataquality #dataplatform #metrics #hive
14.35 - 14.50 Q&A session

Criteo
Doctolib
MLOps journey in H&M
In this session you will learn about how H&M evolves reference architecture covering entire MLOps stack addressing a few common challenges in AI and Machine learning product, like development efficiency, end to end traceability, speed to production, etc. This architecture has been adapted by multiple product teams managing 100”s of models across the entire H&M value chain and enables data scientists to develop model in a highly interactive environment, enable engineers to manage large scale model training and model serving pipeline with fully traceability.
The team presenting is currently responsible for ensuring that best practices and reference architecture are implemented on all product teams to accelerate H&M groups’ data driven business decision making journey.
Keywords: #MLOps #AIAtScale #MachineLearning #Engineering #DataScience
14.35 - 14.50 Q&A session

H&M
Building recommender systems: from algorithms to production
Machine learning-powered systems have become an essential part of most businesses. One such example are recommender systems that adapt to customer behavior to provide an organic way to make domains like clothes, books, or music explorable. In order to successfully put such systems into production, we need to bridge the gap from the raw mathematical models and algorithms to robust and scalable software systems. In this talk, we start out with core approaches to recommender systems like collaborative filtering or click probability prediction, and follow this journey to explore how theory and practice come together.
Keywords: #machinelearning, #recommendations, #production, #architecture
14.35 - 14.50 Q&A session

14.40 - 15.10
Architecture Operarations & Cloud
Data Engineering
MLOps
AI, ML and Data Science
Welcome to MLOps candy shop and choose your flavour!
Operationalizing Machine Learning operations (features delivery, model training, deployment and serving) is nowadays one of the most challenging areas in fast-growing data-driven companies. Variety of open source components (Kubeflow, Mlflow, Kedro to name a few) and set of specialized managed services provided by every major cloud provider drive solution architects nuts.
In GetInData we have a solution for it - we used to call it GetInData MLOps Platform. Set of reusable components, following the Unix toolset pattern ("do one thing and be best at it") and portable to any environment. Also - thanks to loose coupling - adjustable to current and future clients' ML-related challenges, like a candy shop where the first person needs super-fast online predictions, the second one requires robust hyperparameter tuning for best possible models and the third person aims for scalable collaboration on features extraction within many data science teams.
During the presentation we will show you two components we're really excited for - Kedro-Kubeflow integration and Feast-based feature store - how we implement these and what clients' use them for. Welcome to our MLOps candy shop that no pandemic can close 😉
Keywords: #MachineLearning #MLops #FeatureStore #Kubeflow #Kedro #OpenSource #Feast
15.10 - 15.25 Q&A session

GetInData
GetInData
Popmon - population shift monitoring made easy
Tracking model performance is crucial to guarantee that a model running in production behaves as designed initially. Changes in the incoming data can affect the performance and make predictions unreliable. Given that input data often change over time, it is important to track changes in both input distributions and delivered predictions periodically, and to act on them when significantly different - for example, to diagnose and retrain an incorrect model in production. To make monitoring both more consistent and semi-automatic, at ING WBAA we have developed a generic Python package called popmon to monitor the stability of data populations over time, using techniques from statistical process control, at: https://github.com/ing-bank/popmon popmon works with both pandas and spark datasets. popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using static or dynamic monitoring business rules.
Keywords: #
15.10 - 15.25 Q&A session
ING WBAA
ING WBAA
ModelOps – Operationalizing Modern Analytics & AI
Modern software development requires a comprehensive approach with well-defined processes of code development, testing, and deployment. That is where DevOps methodology comes in. It helps software developers in the seamless moving of their newly implemented features from development to production. But what about other areas of information technologies like advanced analytics, artificial intelligence, and machine learning? That’s where ModelOps comes to play – an approach that takes what’s best in DevOps and expands it to apply in the analytical world. During the presentation, we will show how continuous integration and deployment with the help of specialized tools, specifically designed for analytical purposes, can be leveraged to implement such an approach. This strategy can dramatically reduce the time-to-value for analytical assets developed within the organization, ensuring all those assets are methodically managed and safely updated with the most advanced, well tested analytical models.
Keywords:#AI #Analytics #ML #MachineLearning #DeepLearning #ModelOps #DevOps #XOps #MLOps
15.10 - 15.25 Q&A session

SAS Institute
SAS Institute
Thrive in the Data Age how Siemens and BMW Group leverage machine learning for cybersecurity, operations and business use cases using Splunk
Machine Learning is an essential part to solve use cases in cybersecurity, operations and various lines of business. This talk provides you with an overview of Splunk’s big data and machine learning technologies that are used to solve real-world use cases. For example, we dive into the technical details of two selected customer examples of applied machine learning. First, we explain how a datacenter division of Siemens uses Splunk with unsupervised and supervised machine learning approaches in cybersecurity to uncover anomalies and to automate the classification of security events. We highlight the technical details of how this use case has been addressed with Splunk’s Machine Learning Toolkit. Second, we explain how BMW Group’s Innovation Lab developed a Predictive Testing Strategy with Splunk and a deep learning approach. Details are provided on how the Deep Learning Toolkit App for Splunk was used to build and evaluate a TensorFlow based model to solve that use case. We conclude the session with an outlook and a wrap up on all available technical resources.
Keywords: #datascience #ai #machinelearning #deeplearning #cybersecurity #operations #analytics
15.10 - 15.25 Q&A session

Splunk
15.10 - 15.15
TECHNICAL BREAK
ROUNDTABLE SESSIONS PART I
15.15 - 16.05
Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable discussion – they are selected professionals with a vast knowledge and experience.
You can choose among such roundtable subjects:
1. Managing a Big Data project – how to make it all work well together?
Kambi
2. Big Data on Kubernetes
Kubernetes - it was created for ‘Stateless’ apps but not for ‘Statefull’…so why we should consider this for BigData?. ‘Statefull’ app together with persistent volume support, but many databases are not supporting it yet – so how Companies can overcome this challenge? Is the Spark + HDP is the only reasonable solution for data transformation on K8s? what about other solutions? Does it make sense to consider any other?. Lets refer to a Telco 5G – requirements – all must go on K8s – but where Hadoop is being replaced by Object Storage solution that could be orchestrated by K8s - to simply the overall architecture. Finally, so what about K8s On-Prem/On-Cloud – to which direction is a way to go?
Vertica
3. Data discovery – building trust around your data
Building trust in data falls under one of the four main pillars of a good data setup - Data Governance. In this roundtable we will do a quick overview of the 4 pillars and how to go about building a trustworthy setup. Topics will cover data lineage, accuracy and completeness vs cost, toolkit available, tactics on recording 'truth' vs business interpretation and how to build a setup that will improve over time, rather than degrade.
Tesco Bank
4. Stream processing engines – features, performance, comparison
Stream processing and real-time data processing is nowadays more and more popular and important. There are a lot of use cases: data capturing, marketing, sales and business analysis, monitoring and reporting, troubleshooting systems, and real-time machine learning like customer/user activity (personalization and recommendation), fraud detection, real-time stock trades. There are a lot of stream data processing frameworks, like Spark (Structured) Streaming, Flink, Storm, Amazon Kinesis, … Let’s talk about them, try to compare, list pros and cons in terms of various problems and challenge like throughput, performance, latency, system recovery and so on.
BAE Systems Applied Intelligence
5. From on-premise to the cloud: an end to end cloud migration journey
GetInData
6. MLOps - how to support the life-cycle of ML models
MLOps is a hot trend concerning the end-to-end lifecycle of ML models from conception to model building and monitoring to decommissioning. How do you govern this lifecycle? Which methodologies and solutions are worth using? What mistakes should be avoided? Let's exchange experiences!
Warsaw University of Technology
7. Transactional Data Lakes with Apache Spark (and Delta Lake, Apache Hudi and Apache Iceberg)
There is a trend in big data management space to add features we all know from relational databases, most notably ACID transactions and versioning. That's the main focus of open source projects Delta Lake, Apache Hudi and Apache Iceberg. They are storage layers on Hadoop DFS-like file systems and object stores that together with Apache Spark's capabilities allow building "reliable data lakes at scale". You're invited to discuss the pros and cons of each and how to use them effectively in our big data projects. All are equally welcome regardless of their experience and expertise. Let's share what we've already learnt and further deepen our understanding learning from others.
8. Operationalizing Analytics – sharing experience and best practices
The promise and potential business value of analytics is endless, which is why companies have spent the last decade investing in the right people, data, processes, and enabling technology. Yet studies show that less than 50% of the best models get deployed, 90% of odels take more than three months to deploy and 44% of models take over seven months to be put into production.
SAS
SAS
9. Monitoring performance of ML models
Monitoring of the ML model running online on production data can be a challenge. Let's discuss what are the biggest difficulties and how to manage them. What kind of tools you are using to detect any problems with the input data and the results.
ING WBAA
10. We've got a model! What are the next challenges of deploying it at scale?
Training a good ML model is only the beginning of the journey. The next question is: how to integrate it with production systems robustly and effectively? Let's discuss your experience with deploying ML models challenges like continuous model training, training-serving skew, data drift, and model serving infrastructure.
OpenX
16.05 - 16.10
TECHNICAL BREAK
SIMULTANEOUS SESSIONS PART II
16.15 - 16.45
Data Engineering I
Data Engineering II
MLOps/ AI, ML and Data Science
Data Strategy and ROI
Casting the Spell: Druid in Practice
At Nielsen Identity, we leverage Druid to provide our customers with real-time analytics tools for various use-cases, including inflight
analytics, reporting and building target audiences. The common challenge of these use-cases is counting distinct elements in real-time at scale. We’ve been using Druid to solve these problems for the past 5 years, and gained a lot of experience with it.
In this talk, we will share some of the best practices and tips we’ve gathered over the years. We will cover the following topics:
* Data modeling
* Ingestion
* Retention and deletion
* Query optimization
Keywords: #BigData #ApacheDruid #RealtimeAnalytics #DataArchitecture #DataEngineering
16.45 - 17.00 Q&A Session

Nielsen Identity
Imply
BigFlow – A Python framework for data processing on the Google Cloud Platform
You will learn about a tool that can improve your big data projects on GCP. Unified structure, configuration, versioning, build, deployment, and more, available for Dataflow/Dataproc/BigQuery.
Keywords: #gcp #python #dataflow #dataproc #bigquery
16.45 - 17.00 Q&A Session

Allegro
Training and deploying machine learning models with Google Cloud Platform
In my presentation I would like to present some approaches, good practices and Google Cloud components that we use in Sotrender to effectively train and deploy our machine learning models, which are used to analyze Social Media data. I will discuss which aspects of DevOps we focus on when developing machine learning models (MLOps), and how these ideas can be easily implemented in your company or startup using Google Cloud Platform.
Keywords: #mlops #gcp #python #nlp #computervision
16.45 - 17.00 Q&A Session

Sotrender
How to optimize time needed to find and understand the data as a part of BigData project.
Many BigData projects focus on implementing technological solutions, forgetting their purpose, i.e. the needs or applications they are to serve.
Investments (BigData projects are quite high budgets) are often made mainly in the IT area, ignoring individual areas of business activities and do not generate much profit from a business point of view, causing a high risk of failure of the entire project. During the presentation, we will tell a bit about this kind of factors that pose a threat to BigData projects. Together, we will consider what are analyst’s needs and who is BIGData's “client” and what they expect. How to lead to effective cooperation between the analyst and the "client" by implementing data management, but most of all - by implementing an interface that creates a bridge between the accumulated knowledge about the data and the data recipient. We will show how much AI is able to support the use of Big Data's potential. Can you get information about data while you have a morning coffee? Yes. By using the Clarite AI Data Assistant all you need to do is ask a question about data in natural language. Using that, you will easily enter the era of human-data communication and big data.
Keywords: : #dataplatform #datamanagement #businessdata #AI #DataGovernance #WatsonKnowledgeCatalog #KnowYourData #AIClariteAssistant
16.45 - 17.00 Q&A Session

Clarite Polska
Clarite Polska
16.50 - 17.20
Data Engineering I
Data Engineering II
MLOps/ AI, ML and Data Science
Data Strategy and ROI
Data lineage and observability with Marquez and OpenLineage
Data is increasingly becoming core to many products. Whether to improve recommendations for users, getting insights on how they use the product or using machine learning to improve the experience. This creates a critical need for understanding how data is flowing through our systems. Data pipelines must be auditable, reliable and run on time. Tracking lineage and metadata is the underlying foundation that enables many use cases related to data. It provides understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It enables governance and compliance and generally helps you keep you data running. Marquez is an open source project part of the LF AI which instruments data pipelines to collect lineage and metadata and enable those use cases. It provides context by making visible dependencies across organisations and technologies and enables lineage governance and discovery.
Keywords: #lineage #observability #dataops
17.20 - 17.35 Q&A session

Datakin
Top 5 Spark anti-patterns that will bite you at scale!
This session looks at real-world systems that use Java big data technologies such as Spark, Hadoop, Cassandra and Kafka, and examines the comedic and sometimes disastrous effects when the code is executed. Session attendees will walk away with an enhanced understanding of how to work with and effectively use these technologies.
Keywords: #java #bigdata #spark #hadoop
17.20 - 17.35 Q&A session

Causal Mediation Analysis in the E-Commerce Industry
Causal mediation analysis is a formal statistical framework to reveal the underlying causal mechanism in randomized controlled experiments. The causal mechanism is referred to as a process that the treatment affects the outcome through some intermediate variables that can be referred to as mediators. Causal mediation analysis has been widely employed in various disciplines. However, it has not been applied to online A/B tests, the large scale online randomized controlled experiments in the daily practice of the internet industry. Perhaps it is because online A/B tests in the internet industry are primarily for evaluation: estimating and testing the average treatment effect. In this talk, we will discuss two of our recent works on the development of causal mediation analysis for producing insights for search and recommendation systems in the e-commerce industry.
Keywords: #causalinference #A/Btests #informationretrieval #searchmetrics #evaluation
17.20 - 17.35 Q&A session

Causal Inference and Experimentation, Udemy
Foundations of Data Teams
Successful data projects are built on solid foundations. What happens when we’re misled or unaware of what a solid foundation for data teams means? When a data team is missing or understaffed, the entire project is at risk of failure. This talk will cover the importance of a solid foundation and what management should do to fix it.
Keywords: #managment #data teams #data engineers #data scientists #operations
17.20 - 17.35 Q&A session

Big Data Institute
PLENARY SESSION
17.25 - 17.55
The Journey to Data Cloud
Snowflake is the leading data platform for the cloud era. I will present its features as a modern data warehouse, uniquely exploiting the cloud capabilities to meet growing users' needs. Then I will discuss how it became the foundation of the Data Cloud, a revolutionary solution that opens the world's data to all organizations.
Keywords: #cloud #SQL #analytics, #scalability #datasharing #datawarehouse
17.55 - 18.10 Q&A session

Snowflake
17.55 - 18.30
SUMMARY OF THE DAY, PRIZE GIVEAWAY AND A SURPRISE*!
GetInData
Evention
+ *🎵 Live DJ performance especially for the participants of the meeting - DJ Michał Stochalski
He sets trends, creates new DJ sets and constantly improves his music skills, according to his motto "Excellence is earned throughout life". As a Video DJ he presents a combination of image and sound mixing live music with corresponding clips.
February 26, 2021
BIG DATA TECHNOLOGY WARSAW SUMMIT DAY 2
09.00 - 09.05
OPENING OF THE SECOND DAY
PLENARY SESSION
09.05 - 09.35
Fast growth iteration via A/B testing
Online A/B testing granted people a powerful tool to do fast examine on whether their product hypothesis is true or not. It is not a secret that most tech giants grow their products leveraging this tool in fast iterations, such as Google, Atlassian. In this talk, we will first review the product growth process used in Google and Atlassian, and then zoom in to give some tips on the common problems encountered when conducting A/B experiment. Lastly we will spend some time to discuss with the participates on their experience and learnings conducting A/B experiment.
Atlassian
SIMULTANEOUS SESSIONS PART III
09.40 - 10.10
Architecture Operarations &Cloud
AI, ML and Data Science
AI, ML and Data Science II
Data Engineering
Management of a cloud Data Lake in practice: How to manage 1000s of ETLs using Apache Spark
Nowadays the problem of speed of processing is seemingly solved. Unless you process tens of petabytes an off-the-shelf toolset will suffice for most of the problems. Currently the main challenges in data lake systems are in the field of data governance:
• how do you make sure data is discoverable, reusable, up to date and of high quality?
• how to avoid huge technical debt when developing massive number of complex data flows?
• how to guarantee that the project can scale despite having access to very scarce human resources and technical talent?
The goal of this talk is to showcase how to design a data lake management system scalable in all the broadest meaning of the word: that is not only scales with the growth of the data, but as well that it scales with the growth of the complexity of the whole enterprise. The talk will outline the business reasoning, key design principles as well as technical solution. Expect some (but not too much) nerdy details related to Apache Spark implementation 😊
Keywords: #DataGovernance #DataLake #DataQuality #Cloud #ApacheSpark #Azure #DataBricks
10.10 - 10.25 Q&A Session

DXC Luxoft
Building an analytics platform from scratch while developing production solutions on top of it.
Story of a year-long journey with an Asian telecom of creating a positive feedback loop between building a data analytics platform and moving analytics into it.
Like in every great story, you can expect:
introduction - reasonably well defined scope and set of main characters,
turning points - how life took us on a journey of changed ownership, discovering new needs, learning by teaching and bold goals achieved by doing progress over perfection,
conclusion - how it all came together to change the way the analytics team works and productizes it's models.
This story includes real-world analytics use-cases such as: cost of network incidents, ARPU (Average Revenue Per User), NBO (Next Best Offer), cost of service, RFM (Recency, Frequency, Monetary Value), churn and a few more.
Everything seasoned with open-source (Spark, Presto, Nbdev) and latest ML Ops technologies (KubeFlow, Feast).
Keywords: #ModelProductization #MachineLearning #FeatureStore #KubeFlow #DataScience #OpenSource #OnPremise
10.10 - 10.25 Q&A Session

GetInData
GetInData
When HR meets Artificial Intelligence.
Digital Transformation takes a lot of our processes to new levels. Especially, since the eruption of the pandemic, we need to look for new solutions for old problems. For example, if we open for remote IT specialists, we can easily get 10x-100x more candidate, while having the same recruitment procedures.
During my presentation, I would like to share ideas of how AI-driven tools are and can be used in HR and Talent Management:
- Current possibilities of what is feasible (including some demos),
- Where we can take these tools in the forseeable future,
- What are the current technical challenges,
- Impact challenges, especially from ethical standpoint.
Keywords: #artificialintelligence #nlp #digitaltransformation #futureofwork #hcm
10.10 - 10.25 Q&A Session

Talent Alpha
AppDynamics platform with massively scalable big data infrastructure components to handle large numbers of events, metrics, and metadata.
The session will focus on:
• AppDynamics platform with massively scalable big data infrastructure components to handle large numbers of events, metrics, and metadata.
• Machine learning baselining for APM metrics and advanced business analytics.
• Anomaly Detection and Automated Transaction Diagnostics
• Intuitive, powerful drill-down data visualizations
Keywords: #APM #EUM #IoT #BiQ #Analytics #ML #AI #BizDevOps #Kafka #AWS #AIOps #AD #ATD
10.10 - 10.25 Q&A Session

Cisco
10.15 - 10.45
Architecture, Operation and Cloud
Streaming and Real-Time Analytics
AI, ML and Data Science
Data Engineering
Expanding your data & analysis ecosystem with public cloud
Going to the cloud sounds fantastic, giving people new opportunities - priceless. Public cloud gives you a lot of ready systems that you want to use. But when you have, already working environment based on your data center, a lot of data, and plenty of people using these tools, your work will have a lot of fascinating challenges. Transferring data, moving from one tool to another, selecting tools without consternation, cost optimization, policies, security, and many others.
Keywords: #hadoop #spark #airflow #gcp #bigquery #composer #dataproc #data analysis
10.45 - 11.00 Q&A Session

Allegro
Streaming SQL - Be Like Water My Friend
Data has to be processed fast, so that a firm can react to changing business conditions in realtime. Streaming SQL gives us a
possibility to make stream processing available for a broader audience but also makes it easier to access data streams. This
presentation will not only give you an brief overview of the data and streaming architecture at InnoGames but also introduces you
to the idea of Streaming SQL in general and how it is implemented in Apache Flink. Furthermore it shows actual examples of how
to use Flink SQL so that you hopefully are inspired to consider this rather new technology to tackle your data challenges.
Keywords: #streaming #streamingsql #flink #dataflow #flinksql
10.45 - 11.00 Q&A Session

InnoGames GmbH
Make it personal: reinforcement learning for mere mortals
During this session we will reflect upon the importance of personalization in e-commerce. What challenges accurred as a results of bridging the gap between Google AlphaGo and real world? Furthermore, we will discuss the Vowpal Wabbit: the Swiss army knife of ML algos.
To sum up, we will introduce a case study exercise, during which participants will be creating a personalized user experience on a webpage.
Keywords: #personalization #ecommerce #vowpalwabbit #reinforcementlearning #opensource
10.45 - 11.00 Q&A Session

eBay Classifieds Group

OLX Group
10.45 - 10.50
TECHNICAL BREAK
SIMULTANEOUS SESSIONS PART IV
10.55 - 11.25
Architecture Operarations & Cloud
Streaming and Real-Time Analytics
AI, ML and Data Science I
Data Strategy and ROI
Presto: SQL-on-Anything & Anywhere
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale at organizations including, Airbnb, Comcast, Fcaebook, FINRA, GrubHub, LinkedIn, Lyft, Netflix, Twitter, Uber, and Zalando, Presto has experienced unprecedented growth in popularity in both on premise and cloud deployments over Object Stores, HDFS, NoSQL, and RDBMS data stores.
Delta Lake, a storage layer originally invented by Databricks and recently open sourced, brings ACID capabilities to big datasets held in Object Storage. While initially designed for Spark, Delta Lake now supports multiple query compute engines. In particular, Starburst, developed a native integration for Presto that leverages Deltaspecific performance optimizations.
Join this session and hear how Starburst Presto deployed on Azure Kubernetes Service (AKS) serves as a fast SQL query engine versus data in ADLS, and enables query-time correlations between IoT data in Delta Lake, customer data in SQL Server, and web log data in Elasticsearch.
You will also gain best practice and real-life insights and lessons learned from production deployment of this integration.
Keywords: #presto #sql #analyticsanywhere #azure
11.25 - 11.40 Q&A session

Starburst
Starburst
Complex event-driven applications with Kafka Streams
At Simply Business we have built a rich stateful application using Kafka Streams to manage leads that are served by our 300-people strong UK call centre. This application combines many different data points from different services and has a few internal data stores to operate. Our initial design of the application became quite complex, so we had to make changes to ensure scalability and reliability . In the talk I'll present what techniques we used to simplify it.
But that's not all. I'll also talk about other key components like schema registry, which helped with the robustness of the solution and then lead scoring that helped to increase the return on the call spent by over 50%.
Keywords: #data-streaming #kafkastreams #schemaregistry #domainevents #evolutionaryarchitecture
11.25 - 11.40 Q&A session

Simply Business
How to build a state-of-the-art weather forecasting AI service
Weather forecasting is important in many fields, and minor improvements in accuracy can have a considerable business impact. Today, weather forecasting is performed using computationally expensive mathematical models based on the Navier-Stokes and mass continuity equations, the first law of thermodynamics, and the ideal gas law. These models simulate the physical world and are making use of expensive supercomputers. Alternatively, AI can be used to learn from data and produce forecasts in a fraction of the time the physical simulations require, and in many cases at a higher degree of accuracy. In this presentation, I will show how an AI was used to produce competitive forecasts, using state-of-the-art AI models and neural architecture search, and how I used React to prototype a weather forecasting service.
11.25 - 11.40 Q&A session

Peltarion
Big Data Instruments and Partnerships - Microsoft ecosystem update
This session will be focused on the strategic side of our big data investments, from two major angles – instruments that we have – and partnerships – how we are impacting and enriching the external ecosystem. It will be a session in-between tech and business.
11.25 - 11.40 Q&A session

Microsoft Russia, Microsoft
11.30 - 12.00
Architecture Operarations & Cloud
Streaming and Real-Time Analytics
AI, ML and Data Science I
Data Strategy and ROI
AWS Spot instances price prediction - towards cost optimization for Big Data
Analytical data processing has become the cornerstone of today's businesses success, and it is facilitated by Big Data platforms that offer virtually limitless scalability. However, minimizing the total cost of ownership (TCO) for the infrastructure can be challenging. By analyzing spot instance price history using ARIMA models, it is feasible to leverage the discounted prices of the cloud spot market with a limited risk of analytical job termination. In particular, we evaluated savings opportunities when using Amazon EC2 spot instances comparing to on-demand resources. During the presentation we show the evaluation of univariate spot price regression models to forecast future prices, and we confirm feasibility of short-term spot price prediction through real data from AWS. This confirms cost savings opportunities up to 80% compared to on-demand and within 1% of the absolute minimum.
Keywords: #TCO #CloudComputing #ARIMA #AWS #Spot
12.00 - 12.15 Q&A Session

Nowa Era
CogniTrek Corp, MAGIX.AI, Saints Cyril and Methodius University
Evolving Bolt from batch jobs to real-time stream processing - migration, lessons learned, value unleashed
We would like to invite you to discuss how Bolt migrated from batch and synchronous to real-time and asynchronous. During our session we will review and evaluate the obstacles we faced along the way and lessons we have learned. We will also focus on the unleashed value of real-time data.
Keywords: #kafka #streaming #data #realtime
12.00 - 12.15 Q&A Session

Bolt
How NoMagic robots improve thanks to software 2.0 improvement cycle supported by an in-house data engine?
A typical Software 2.0 improvement cycle at NoMagic:
* analyse when a data driven algorithm performs poorly
* gather and label data that will improve the algorithm or propose a modified algorithm
* train the model
* test the model
* deploy to production
Making this cycle as frictionless as possible was the focus of NoMagic ML team in 2020.
I will share with you what we achieved and how it changed the way we work at NoMagic.
12.00 - 12.15 Q&A Session
NoMagic
How to plan the unpredictable? 7 good practices of organisation and management of fast-paced large-scale R&D projects
During this session, we aim to review the technical and organisational challenges we faced, while building a complex AI-based app with a short time-to-market. How it was additionally influenced by the dispersion of involved teams around the world (10 time zones)? What unpredictable events that affected our plans, e.g. the COVID pandemic or the development vendor changing in the middle of the project? We will evaluate the examples of failures and successes. Lessons we have learned in the process. We invite you to a broader discussion.
Keywords: #datascience #ai #machinelearning #agile #projectmanagement
12.00 - 12.15 Q&A Session

Pearson
Pearson
12.00 - 12.05
BREAK
ROUNDTABLE SESSIONS PART II
12.05 - 12.55
Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable discussion – they are selected professionals with a vast knowledge and experience.
There will be 2 rounds of discussion, hence every conference participants can take part in 2 discussions
You can choose among such roundtable subjects:
1. Building a world-class Big Data team during the COVID-19 pandemic - recruiting, training, collaborating.
Last year forced us to change a lot in how we work. A lot of us had to switch to working/studying from home, some needed to freeze hiring, others - to redefine onboarding. As hard as last year was, it was also a time of innovation. Join the session to exchange lessons learned and ideas for building a world-class Big Data team leveraging “the new normal”. Everybody is welcome - the more diverse experiences the better.
Zendesk
2. Big Data on Kubernetes
GetInData
3. Best tools for alerting and monitoring of the data platforms
Have you ever been woken up in the middle of the night by a screaming PagerDuty alert on your mobile, 99+ notifications on {YOUR_PIPELINE_NAME}_alerts Slack channel and tens of graphs in Grafana looking like an undreamt art of van Gogh? If yes, welcome, me too. For an engineer working on a Data Platform it is easy to create a new pipeline, a new dataset or add any new integration, especially now in the cloud era. But it is not easy to have a proper monitoring and alerting system ensuring that any potential issues/incidents can be solved as quickly as possible, so offering of our Data Platform is always top quality. In this session we will discuss tools for building a monitoring and alerting system that is efficient, easy to understand, supervises exactly what we want, notifies the ones we want, is not too noisy and scales well with always growing data.
Bolt
4. Stream processing engines – features, performance, comparison
Stream processing and real-time data processing is nowadays more and more popular and important. There are a lot of use cases: data capturing, marketing, sales and business analysis, monitoring and reporting, troubleshooting systems, and real-time machine learning like customer/user activity (personalization and recommendation), fraud detection, real-time stock trades. There are a lot of stream data processing frameworks, like Spark (Structured) Streaming, Flink, Storm, Amazon Kinesis, … Let’s talk about them, try to compare, list pros and cons in terms of various problems and challenge like throughput, performance, latency, system recovery and so on.
BAE Systems Applied Intelligence
5. Using the public cloud effecitively and cost-efficiently
According to Unisys's Cloud Barometer study, only a third of organizations have seen great improvements to their organizational effectiveness as a result of Cloud adoption. What are good practices to be part of those organizations? Let's discuss how to use the public Cloud effectively and cost-efficiently.
TCL Research Europe
6. Building AI/ML systems: from algorithms to production
We're facing very different challenges when writing a scientific paper and when building a production ML system. Things get even more complex when a single project involves both research and application. It's generally understood yet often overlooked: let's get talking! How to scope an ML project? How to get the data yet avoid biases and that multi-million-euro GDPR penalties? What models work in the real world scenarios? How to handle model deployment? And who do you need in your team to succeed?
7. MLOps - how to support the life-cycle of ML models
MLOps is a hot trend concerning the end-to-end lifecycle of ML models from conception to model building and monitoring to decommissioning. How do you govern this lifecycle? Which methodologies and solutions are worth using? What mistakes should be avoided? Let's exchange experiences!
Warsaw University of Technology
9. Data Strategy. The Game.
The format of this round table discussion is the game, where you as Chief Data Officer, has a mission to implement strategic initiatives for $2.7B electronics manufacturer (please watch short video at Tech Zone for more details). You will have a chance to learn how to maximize business value from data, how to design and execute Data Strategy, which strategy approach is the best, how your decisions influences others within organization.
SoftServe
10. Distributed Big Data processing in the cloud – is Hadoop still an option?
Joint Cloudera & 3Soft roundtable to discuss practical experience and highlights of providing self-service access to integrated, secured, multi-function analytics based on Hadoop, cloud-native offerings or custom-tailored solutions. Let us share our knowledge on how to enjoy consistent data security, governance, lineage, and control, while deploying the powerful, easy-to-use solutions business users require and eliminating their need for shadow IT solutions.
3Soft
Cloudera
11. Snowflake Data Cloud – possibilities and limitations. How I can judge whether this is a value proposition for me and my organization?
Snowflake
Snowflake
12.55 - 13.00
TECHNICAL BREAK
PLENARY SESSION
13.00 - 13.30
Lessons from building large-scale, real-time machine learning systems
Unity Ads helps publishers and advertisers reach their business goals, and machine learning is at the core of our product. In this presentation, I will first give an overview of the machine learning systems we built for real-time ads bidding, which process tens of thousands of ad auction requests per second. Then, I will share several generalizable lessons we learned in making our systems performant from machine learning perspective and scalable from engineering perspective.
13.30 - 13.45 Q&A Session
Unity
13.30 - 13.45
CLOSING & SUMMARY, PRIZE GIVEAWAY
GetInData
Evention
*Times may vary.
Flink SQL in 2021: Time to show off!
Four years ago, the Apache Flink community started adding SQL support to ease and unify the processing of static and streaming data. Today, Flink runs business critical batch and streaming SQL queries at Alibaba, Huawei, Lyft, Uber, Yelp, and many others. Although the community made significant progress in the past years, there are still many things on the roadmap and the development is still speeding up. This session will focus on a comprehensive demo of what is possible with Flink SQL in 2021.
Based on a realistic use case scenario, we'll show how to define tables which are backed by various storage systems and how to solve common tasks with streaming SQL queries. We will demonstrate Flink's Hive integration and show how to define and use user-defined functions. We'll close the session with an outlook of upcoming features.
Keywords: #flink #flinksql #streamprocessing #unifieddataprocessing #apache

Ververica
Managing Big Data projects in a constantly changing environment - good practices, use cases
The nature of Big Data projects are nowadays one of its kind - they are not like the data warehousing initiatives in the old days, nor like cloud native applications projects, at least not yet. Variety of technologies, complicated architectures and rapidly changing landscape are just a few challenges that the IT Department is facing in such projects. When you add the number of stakeholders from different departments involved and that Big Data project is sometimes more like an R&D with unpredictable outcome, this makes a mix where the objectives can be easily lost. It is not a surprise that up to 85% of Big Data projects were pure failures (Gartner 2016).
In this talk we will share our experience in planning and executing Big Data initiatives in the organisations, with some use cases and good practices in mind
Keywords: #agile #teammanagement #goodpractices #usecases

GetInData
GetInData
Artificial Intelligence - Building in-house AI capabilities from scratch at Philip Morris International
We will start by sharing how our team is structured, what is that we deliver, and we will continue with sharing more about our journey and the challenges we faced within a big corporation until we got to a good level of maturity inside the organization. An exposition of tangible use-cases will follow, and we will take the other half of the session to talk about technical details, such as the technology stack, CI/CD pipelines, MLOps, and others that help us accelerate delivery.
Keywords: #AI #DL #AIBusiness #Innovation #Productivity #Disruption

Philip Morris International
PMI
Simplifying Stateful Serverless Architectures
Platforms like KNative and FaaS have solved most of the challenges of dealing with stateless applications. Still, when it comes to managing state, developers quickly end up designing and maintaining a complicated architecture without achieving consistency guarantees in the presence of failure. Stateful Functions (StateFun) - developed under the umbrella of Apache Flink - provides consistent messaging and durable state without compromising the serverless experience. Like a database, it exposes its capabilities to application developers in a platform and language agnostic manner: StateFun does not mind if you deploy your application as a set of Python functions on your preferred FaaS platform, a single Spring Boot application on Kubernetes or a mixture of both.
In this demo-centric session you will learn about the core ideas behind the project and you will see how to write, deploy and monitor a simple Stateful Functions application.
Keywords: #ApacheFlink #Serverless #Kubernetes #Event-Driven #Scale

Ververica
CICD Pipeline and delivery of Apache Spark Applications on the cloud using AWS
The session will start from a quick, background introduction to the CSU datalake architecture and dataops framework, where we are going to discuss the principles of CICD and process overview, development unit and integration testing pipleine. Furthermore, we will overview the process and demonstrate how we use AWS codecommit and codebuild to automate the testing and code coverage. The next part of the meeting will focus on the production deployment Pipeline, featuring the overview of the process and demonstration of how we use AWS codecommit,codebuild and codepipeline to deploy spark applications to production environment.
Keywords: #Automation #CICD #HigherEd #Spark

California State University
Data Warehouse Development Lead.
Building scalable and testable data pipeline through a data pipeline domain specific language
Data pipeline architecture, design and builds have similar concerns as any software product development. The purpose of the
presentation is to uncover the concerns and present one of the solutions. The presentation covers aspects of data pipeline such
as:
1. Configuration driven composable data pipeline
2. Testable data pipeline through specification language such as Gherkin
3. Design of the pipeline to solve for the data pipeline concerns
Keywords: #DataEngineering #ComposableDataPipeline
#SOLIDPrincipleInDataPipelineDesign
#GHERKINandDataPipelineSpecification #BDDTDDinDataPipeline

Independent speaker
Creating Confidence in Data at Klarna - A Case Study in Automatic Data Validation
We can all agree that when making big decisions, you want to make them with confidence. In the world of art, this means buying your painting from a well known auction house instead of from the back of an old car. In the world of data, this means validating your data before using it.
During this session, we will show you how to quickly create confidence in data using automatic data validation. We will describe the validation process that we are using at Klarna, and show how this enables rapid improvements of big data transformations.
Keywords: #data-transformation #transformation-improvement #data-confidence #automatic-data-validation #data-validation-tool

Klarna Bank AB
Klarna Bank AB
Modern radars: from simple signal processing towards modern complex data analytics with deep learning
During this session, we will concentrate on data science & AI status in geophysics/geology. We will discuss and present analytical and software challenges with regards to multidimensional radar data. Furthermore, the session will finalize with the evaluation of the open-source, big data technologies application for solving complex analytical problems.
Keywords: #radars #deeplearning #bigdata #opensource

SGPR.TECH
SGPR.TECH
Organising the chaos - metadata in action
Several key issues arise when building data-driven products and services. It is primarily searching for data sets, understanding the possibilities and limitations of individual sets, gaining access to data, using data in a controlled and transparent way.
Keywords: #metadata
Ab Initio Software
Building Data Ingestion Platform using Hadoop
State of data platforms in the tech industry. ING WBDM's vision on the future of data ingestion. Highlights on the ING Data Ingestion Platform main components and features. The Hadoop and FOSS revolution has reshaped the data engineering landscape. Using virtual and bricks machines to give life to a high-availability, disaster recovery ready platform. In the search for creating a cutting-edge data platform at ING, we are faced with challenging new requirements such as cloud-ready deployments into production, whilst ensuring proper data governance, risk and security principals. Please join us in this session, where we will share ING WBDM's experience on how to make a data platform based on open source components both enterprise and cloud ready, with an overview of current state and vision of our platform.
Keywords: #dataingestion #hadoop #nifi

ING
AWS Serverless Pipelines
How to use AWS managed services in near real-time data processing. Recommendations system design.
Keywords: #serverless #cloud #AWS #microservices #recommendedsystems

StepStone Services
Battle lessons for machine data in an Oil Refinery
During this session, we will review the machine data scenario in oil refineries, spreading out the knowledge to SMEsWe will contemplate upon the myth of industrial interoperability. Furthermore, we will find out how to store the data without spending a fortune, answer the question Analytics or Excel vs R.
Keywords: #OPC #timeseries #cloudopex #networksniffing #Excel

CEPSA
Datumize
Data Strategy. The Game!
This is an extension and further context of Taras Bachynskyy roundtable “Data Strategy. The Game!”. Very inspiring approach how to unlock the hidden value of a data. How to maximize business value? Which strategy approach is the best? How your decisions influences others within organization in context of data?
SoftServe
Predicting effectiveness of marketing campaigns on Facebook platform
We will learn about the data acquisition, in particular Facebook Marketing API data structure, while defining target variable. Furthermore, his sessions aims to discuss wide and deep architecture for modeling categorical, text and image data together as well as the global XAI with Shapley values and local XAI with Anchors.
Keywords: #WideandDeep #XAI #Explainability #Facebook #Marketing

Sotrender
Sotrender
Sotrender
Common mistakes that make your chart hard to understand with practical solutions to avoid them.
In the world of big data, data visualization tools and technologies are essential for analyzing massive amounts of information.
Although data visualizations are commonly used, they are often inaccurate and misleading. To support data-driven decisions, it's
crucial to create reliable charts that leave no space for misunderstanding. There are mistakes that can be easily avoided, so let me show you how to do this!
Keywords: #datavisualization #datadesign #dataliteracy #uidesign #graphicacy

Freelancer
Top 10 Big Data Systems Pitfalls - war stories and lessons learned
Would you like to hear war stories about the design and implementation of Big Data solutions? Have you ever wondered why Big Data ecosystem is changing so rapidly and is not yet stable? How to choose proper technologies and solutions which can achieve project goals? I do not promise any silver bullets. I simply want to share my experience gained during last 5 years of crafting data intensive applications.
There are not many experienced Big Data architects in this JVM world. There are even less which are willing to share their lessons learned and tell true war stories. This talk is going to address common problems which are happening all the time in Big Data software development.
Keywords: #BigDataWarStorries#DataAsAService#BigDataROI#LessonsLearned

Datumo