Edition 2021 - Big Data Technology Warsaw Summit | Big Data Technology Warsaw Summit

Agenda

Speakers

Workshops

Organizers

Video: 3Soft, Cloudera

Interview: Vertica

Interview: Cisco

Pandemic, data and analytics – how do we might know what happens next with Covid-19?

Special evening meeting prior to the BigData Technology Warsaw Summit.

There are smart people and great research teams working on the forecasting models for pandemic developments. What data do they use, which models, how it could be approached, how accurate it is, what are the major challenges here? How does the bigdata community contribute to fighting the Covid-19 disease? These are the questions we would like to address during this unique online meeting. We have invited the very special guests (including experts from MOCOS and ICM UW) – everyone is encouraged to participate in discussion and ask questions!

♦ What makes the field of pandemia modelling and simulation so interesting and challenging? How to predict risks using available data and proper modelling?

♦ How is it done – large scale geographical microsimulation model for pandemics?

♦ What can we do for a better pandemic forecasting and predicting efficiency of various countermeasures to slow it down? Is an AI/ML enough?

All about the jobs in the BigData industry in the (post)-COVID-19 world. Special evening meeting prior to BigData Technology Warsaw Summit. Let’s talk about the current situation at the job market for BigData Professionals. What is hot and what is not? Who is now searched by the employers and how do they do it? What are the expectations and requirements? Does the pandemic change a lot here at jobs landscape? What does the ‘remoteness’ of work makes difference? What are the future trends in the way we work together? During the meeting there will be a discussion with top managers from companies actively acquiring new talents on the BigData market as well as managing Big Data teams.

5 big data trends that redefine Edge to AI journey

During the session we will discuss the key trends redefining the way companies manage data and analytics lifecycle. The presenters will explain:
♦ the importance of disaggregation of compute and storage,
♦ advancements in stateful processing in Kubernetes,
♦ growing role of cloud and real-time processing for businesses in Poland.

Keywords: #DataArchitecture #Kubernetes #Streaming #MachineLearning #Cloud #BusinessAgility

13:35 - 13:50 Q&A Session

High-Performance Data Analytics in a Hybrid and Multi-Cloud World

Many enterprises are re-thinking their data analytics strategy. Some plan to stay on-prem for GDPR reasons. Others are all in for full-cloud but want to stay agnostic. And still others require a hybrid approach: run certain workloads on-prem and move others to the cloud to capitalize on cloud economics. With object stores emerging as the main winners in the post-Hadoop era for cost-effective storage, enterprises are adopting them independently from the evolution of their EDW, Data Lakes, and Data Science platforms. Finally, there’s a convergence movement underway, causing enterprises to unify their data analytics platforms (EDW, Data Lakes and Data Science platforms) and supporting the broadest deployment models. Join us for this session to learn how Vertica can support your vision with a new era of data analytics in a hybrid and cloud-agnostic fashion, supporting a variety of object store technologies.

14:00 - 14:15 Q&A Session

The Scalable Gaming Analytics Pipeline at Outfit7: The Next Generation

Have you ever wondered how gaming companies build their analytics pipelines? Particularly scalable ones that are able to collect terabytes of data every day? At Outfit7, this is done with a little help from Google Cloud's top services, including Kubernetes, Dataflow, BigQuery, and Cloud Composer. In this presentation, you'll see how the pipeline is built, starting from ingestion in Kubernetes, through to ending in Jupyter, Tableau, and other BI dashboards. You’ll also find out how the team fights downtime with proactive monitoring and integration tests. And last but not least, you’ll hear about the challenges that Outfit7 faced when the amount of data it had to handle skyrocketed during the peak of the COVID-19 quarantine.

Read less

Keywords: #googlecloud #bigquery #events #scalable 14.35 - 14.50 Q&A session

Data Quality with 100+ PB: Solved Challenge at Criteo

Data Quality is paramount -we all agree on that- and isn't straightforward even with small data sets. When working with over 120PB of data on Hadoop and thousands of jobs, I can tell you firsthand, it's a challenge! We started to tackle this at Criteo 2 years ago, and I have some tangible results I'll be happy to share. We'll go through this journey, from collecting data, detecting suspect behaviors, and alerting users on data quality incidents, while integrating these new checks into the Criteo Data Platform.

Read less

Keywords:#dataquality #dataplatform #metrics #hive 14.35 - 14.50 Q&A session

MLOps journey in H&M

In this session you will learn about how H&M evolves reference architecture covering entire MLOps stack addressing a few common challenges in AI and Machine learning product, like development efficiency, end to end traceability, speed to production, etc. This architecture has been adapted by multiple product teams managing 100”s of models across the entire H&M value chain and enables data scientists to develop model in a highly interactive environment, enable engineers to manage large scale model training and model serving pipeline with fully traceability.
The team presenting is currently responsible for ensuring that best practices and reference architecture are implemented on all product teams to accelerate H&M groups’ data driven business decision making journey.

Read less

Keywords: #MLOps #AIAtScale #MachineLearning #Engineering #DataScience

14.35 - 14.50 Q&A session

Building recommender systems: from algorithms to production

Machine learning-powered systems have become an essential part of most businesses. One such example are recommender systems that adapt to customer behavior to provide an organic way to make domains like clothes, books, or music explorable. In order to successfully put such systems into production, we need to bridge the gap from the raw mathematical models and algorithms to robust and scalable software systems. In this talk, we start out with core approaches to recommender systems like collaborative filtering or click probability prediction, and follow this journey to explore how theory and practice come together.

Read less

Keywords: #machinelearning, #recommendations, #production, #architecture

14.35 - 14.50 Q&A session

Welcome to MLOps candy shop and choose your flavour!

Operationalizing Machine Learning operations (features delivery, model training, deployment and serving) is nowadays one of the most challenging areas in fast-growing data-driven companies. Variety of open source components (Kubeflow, Mlflow, Kedro to name a few) and set of specialized managed services provided by every major cloud provider drive solution architects nuts.
In GetInData we have a solution for it - we used to call it GetInData MLOps Platform. Set of reusable components, following the Unix toolset pattern ("do one thing and be best at it") and portable to any environment. Also - thanks to loose coupling - adjustable to current and future clients' ML-related challenges, like a candy shop where the first person needs super-fast online predictions, the second one requires robust hyperparameter tuning for best possible models and the third person aims for scalable collaboration on features extraction within many data science teams.
During the presentation we will show you two components we're really excited for - Kedro-Kubeflow integration and Feast-based feature store - how we implement these and what clients' use them for. Welcome to our MLOps candy shop that no pandemic can close 😉

Read less

Keywords: #MachineLearning #MLops #FeatureStore #Kubeflow #Kedro #OpenSource #Feast

15.10 - 15.25 Q&A session

Popmon - population shift monitoring made easy

Tracking model performance is crucial to guarantee that a model running in production behaves as designed initially. Changes in the incoming data can affect the performance and make predictions unreliable. Given that input data often change over time, it is important to track changes in both input distributions and delivered predictions periodically, and to act on them when significantly different - for example, to diagnose and retrain an incorrect model in production. To make monitoring both more consistent and semi-automatic, at ING WBAA we have developed a generic Python package called popmon to monitor the stability of data populations over time, using techniques from statistical process control, at: https://github.com/ing-bank/popmon popmon works with both pandas and spark datasets. popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using static or dynamic monitoring business rules.

Read less

Keywords: #

15.10 - 15.25 Q&A session

ModelOps – Operationalizing Modern Analytics & AI

Modern software development requires a comprehensive approach with well-defined processes of code development, testing, and deployment. That is where DevOps methodology comes in. It helps software developers in the seamless moving of their newly implemented features from development to production. But what about other areas of information technologies like advanced analytics, artificial intelligence, and machine learning? That’s where ModelOps comes to play – an approach that takes what’s best in DevOps and expands it to apply in the analytical world. During the presentation, we will show how continuous integration and deployment with the help of specialized tools, specifically designed for analytical purposes, can be leveraged to implement such an approach. This strategy can dramatically reduce the time-to-value for analytical assets developed within the organization, ensuring all those assets are methodically managed and safely updated with the most advanced, well tested analytical models.

Read less

Keywords:#AI #Analytics #ML #MachineLearning #DeepLearning #ModelOps #DevOps #XOps #MLOps

15.10 - 15.25 Q&A session

Thrive in the Data Age how Siemens and BMW Group leverage machine learning for cybersecurity, operations and business use cases using Splunk

Machine Learning is an essential part to solve use cases in cybersecurity, operations and various lines of business. This talk provides you with an overview of Splunk’s big data and machine learning technologies that are used to solve real-world use cases. For example, we dive into the technical details of two selected customer examples of applied machine learning. First, we explain how a datacenter division of Siemens uses Splunk with unsupervised and supervised machine learning approaches in cybersecurity to uncover anomalies and to automate the classification of security events. We highlight the technical details of how this use case has been addressed with Splunk’s Machine Learning Toolkit. Second, we explain how BMW Group’s Innovation Lab developed a Predictive Testing Strategy with Splunk and a deep learning approach. Details are provided on how the Deep Learning Toolkit App for Splunk was used to build and evaluate a TensorFlow based model to solve that use case. We conclude the session with an outlook and a wrap up on all available technical resources.

Read less

Keywords: #datascience #ai #machinelearning #deeplearning #cybersecurity #operations #analytics

15.10 - 15.25 Q&A session

Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable discussion – they are selected professionals with a vast knowledge and experience.

There will be roundtable sessions, hence every conference participants can take part in 2 discussions, one each day of the conference.

1. Managing a Big Data project – how to make it all work well together?

Juliana Araujo

Senior Product Manager
Spotify

2. Big Data on Kubernetes

Kubernetes - it was created for ‘Stateless’ apps but not for ‘Statefull’…so why we should consider this for BigData?. ‘Statefull’ app together with persistent volume support, but many databases are not supporting it yet – so how Companies can overcome this challenge? Is the Spark + HDP is the only reasonable solution for data transformation on K8s? what about other solutions? Does it make sense to consider any other?. Lets refer to a Telco 5G – requirements – all must go on K8s – but where Hadoop is being replaced by Object Storage solution that could be orchestrated by K8s - to simply the overall architecture. Finally, so what about K8s On-Prem/On-Cloud – to which direction is a way to go?

Maciej Paliwoda

Solution Engineer
Vertica

3. Data discovery – building trust around your data

Building trust in data falls under one of the four main pillars of a good data setup - Data Governance. In this roundtable we will do a quick overview of the 4 pillars and how to go about building a trustworthy setup. Topics will cover data lineage, accuracy and completeness vs cost, toolkit available, tactics on recording 'truth' vs business interpretation and how to build a setup that will improve over time, rather than degrade.

Simon Caruana

Senior Solutions Design
Tesco Bank

4. Stream processing engines – features, performance, comparison

Stream processing and real-time data processing is nowadays more and more popular and important. There are a lot of use cases: data capturing, marketing, sales and business analysis, monitoring and reporting, troubleshooting systems, and real-time machine learning like customer/user activity (personalization and recommendation), fraud detection, real-time stock trades. There are a lot of stream data processing frameworks, like Spark (Structured) Streaming, Flink, Storm, Amazon Kinesis, … Let’s talk about them, try to compare, list pros and cons in terms of various problems and challenge like throughput, performance, latency, system recovery and so on.

Arkadiusz Jachnik

Solution Architect
BAE Systems Applied Intelligence

5. From on-premise to the cloud: an end to end cloud migration journey

Tomasz Żukowski

Data Analyst
GetInData

6. MLOps - how to support the life-cycle of ML models

MLOps is a hot trend concerning the end-to-end lifecycle of ML models from conception to model building and monitoring to decommissioning. How do you govern this lifecycle? Which methodologies and solutions are worth using? What mistakes should be avoided? Let's exchange experiences!

Przemysław Biecek

Founder
MI2.AI

7. Transactional Data Lakes with Apache Spark (and Delta Lake, Apache Hudi and Apache Iceberg)

There is a trend in big data management space to add features we all know from relational databases, most notably ACID transactions and versioning. That's the main focus of open source projects Delta Lake, Apache Hudi and Apache Iceberg. They are storage layers on Hadoop DFS-like file systems and object stores that together with Apache Spark's capabilities allow building "reliable data lakes at scale". You're invited to discuss the pros and cons of each and how to use them effectively in our big data projects. All are equally welcome regardless of their experience and expertise. Let's share what we've already learnt and further deepen our understanding learning from others.

Jacek Laskowski

IT freelancer

8. Operationalizing Analytics – sharing experience and best practices

The promise and potential business value of analytics is endless, which is why companies have spent the last decade investing in the right people, data, processes, and enabling technology. Yet studies show that less than 50% of the best models get deployed, 90% of odels take more than three months to deploy and 44% of models take over seven months to be put into production.

Piotr Kramek

Data Science & Engineering Team Leader
SAS

Wioletta Stobieniecka

Analytical Consultant
SAS

9. Monitoring performance of ML models

Monitoring of the ML model running online on production data can be a challenge. Let's discuss what are the biggest difficulties and how to manage them. What kind of tools you are using to detect any problems with the input data and the results.

Max Baak

Chief Data Scientist
ING WBAA

10. We've got a model! What are the next challenges of deploying it at scale?

Training a good ML model is only the beginning of the journey. The next question is: how to integrate it with production systems robustly and effectively? Let's discuss your experience with deploying ML models challenges like continuous model training, training-serving skew, data drift, and model serving infrastructure.

Michał Bryś

Machine Learning Engineer
GetInData | Part of Xebia

Casting the Spell: Druid in Practice

At Nielsen Identity, we leverage Druid to provide our customers with real-time analytics tools for various use-cases, including inflight
analytics, reporting and building target audiences. The common challenge of these use-cases is counting distinct elements in real-time at scale. We’ve been using Druid to solve these problems for the past 5 years, and gained a lot of experience with it.
In this talk, we will share some of the best practices and tips we’ve gathered over the years. We will cover the following topics:
* Data modeling
* Ingestion
* Retention and deletion
* Query optimization

Read less

Keywords: #BigData #ApacheDruid #RealtimeAnalytics #DataArchitecture #DataEngineering

16.45 - 17.00 Q&A Session

Training and deploying machine learning models with Google Cloud Platform

In my presentation I would like to present some approaches, good practices and Google Cloud components that we use in Sotrender to effectively train and deploy our machine learning models, which are used to analyze Social Media data. I will discuss which aspects of DevOps we focus on when developing machine learning models (MLOps), and how these ideas can be easily implemented in your company or startup using Google Cloud Platform.

Read less

Keywords: #mlops #gcp #python #nlp #computervision

16.45 - 17.00 Q&A Session

How to optimize time needed to find and understand the data as a part of BigData project.

Many BigData projects focus on implementing technological solutions, forgetting their purpose, i.e. the needs or applications they are to serve.
Investments (BigData projects are quite high budgets) are often made mainly in the IT area, ignoring individual areas of business activities and do not generate much profit from a business point of view, causing a high risk of failure of the entire project. During the presentation, we will tell a bit about this kind of factors that pose a threat to BigData projects. Together, we will consider what are analyst’s needs and who is BIGData's “client” and what they expect. How to lead to effective cooperation between the analyst and the "client" by implementing data management, but most of all - by implementing an interface that creates a bridge between the accumulated knowledge about the data and the data recipient. We will show how much AI is able to support the use of Big Data's potential. Can you get information about data while you have a morning coffee? Yes. By using the Clarite AI Data Assistant all you need to do is ask a question about data in natural language. Using that, you will easily enter the era of human-data communication and big data.

Read less

Keywords: : #dataplatform #datamanagement #businessdata #AI #DataGovernance #WatsonKnowledgeCatalog #KnowYourData #AIClariteAssistant

16.45 - 17.00 Q&A Session

Data lineage and observability with Marquez and OpenLineage

Data is increasingly becoming core to many products. Whether to improve recommendations for users, getting insights on how they use the product or using machine learning to improve the experience. This creates a critical need for understanding how data is flowing through our systems. Data pipelines must be auditable, reliable and run on time. Tracking lineage and metadata is the underlying foundation that enables many use cases related to data. It provides understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It enables governance and compliance and generally helps you keep you data running. Marquez is an open source project part of the LF AI which instruments data pipelines to collect lineage and metadata and enable those use cases. It provides context by making visible dependencies across organisations and technologies and enables lineage governance and discovery.

Read less

Keywords: #lineage #observability #dataops

17.20 - 17.35 Q&A session

Causal Mediation Analysis in the E-Commerce Industry

Causal mediation analysis is a formal statistical framework to reveal the underlying causal mechanism in randomized controlled experiments. The causal mechanism is referred to as a process that the treatment affects the outcome through some intermediate variables that can be referred to as mediators. Causal mediation analysis has been widely employed in various disciplines. However, it has not been applied to online A/B tests, the large scale online randomized controlled experiments in the daily practice of the internet industry. Perhaps it is because online A/B tests in the internet industry are primarily for evaluation: estimating and testing the average treatment effect. In this talk, we will discuss two of our recent works on the development of causal mediation analysis for producing insights for search and recommendation systems in the e-commerce industry.

Read less

Keywords: #causalinference #A/Btests #informationretrieval #searchmetrics #evaluation

17.20 - 17.35 Q&A session

Fast growth iteration via A/B testing

Online A/B testing granted people a powerful tool to do fast examine on whether their product hypothesis is true or not. It is not a secret that most tech giants grow their products leveraging this tool in fast iterations, such as Google, Atlassian. In this talk, we will first review the product growth process used in Google and Atlassian, and then zoom in to give some tips on the common problems encountered when conducting A/B experiment. Lastly we will spend some time to discuss with the participates on their experience and learnings conducting A/B experiment.

Read less

09.35-9.50 Q&A Session

Management of a cloud Data Lake in practice: How to manage 1000s of ETLs using Apache Spark

Nowadays the problem of speed of processing is seemingly solved. Unless you process tens of petabytes an off-the-shelf toolset will suffice for most of the problems. Currently the main challenges in data lake systems are in the field of data governance:
• how do you make sure data is discoverable, reusable, up to date and of high quality?
• how to avoid huge technical debt when developing massive number of complex data flows?
• how to guarantee that the project can scale despite having access to very scarce human resources and technical talent?

The goal of this talk is to showcase how to design a data lake management system scalable in all the broadest meaning of the word: that is not only scales with the growth of the data, but as well that it scales with the growth of the complexity of the whole enterprise. The talk will outline the business reasoning, key design principles as well as technical solution. Expect some (but not too much) nerdy details related to Apache Spark implementation ?

Read less

Keywords: #DataGovernance #DataLake #DataQuality #Cloud #ApacheSpark #Azure #DataBricks

10.10 - 10.25 Q&A Session

Building an analytics platform from scratch while developing production solutions on top of it.

Story of a year-long journey with an Asian telecom of creating a positive feedback loop between building a data analytics platform and moving analytics into it.

Like in every great story, you can expect:
introduction - reasonably well defined scope and set of main characters,
turning points - how life took us on a journey of changed ownership, discovering new needs, learning by teaching and bold goals achieved by doing progress over perfection,
conclusion - how it all came together to change the way the analytics team works and productizes it's models.

This story includes real-world analytics use-cases such as: cost of network incidents, ARPU (Average Revenue Per User), NBO (Next Best Offer), cost of service, RFM (Recency, Frequency, Monetary Value), churn and a few more.

Everything seasoned with open-source (Spark, Presto, Nbdev) and latest ML Ops technologies (KubeFlow, Feast).

Read less

Keywords: #ModelProductization #MachineLearning #FeatureStore #KubeFlow #DataScience #OpenSource #OnPremise

10.10 - 10.25 Q&A Session

When HR meets Artificial Intelligence.

Digital Transformation takes a lot of our processes to new levels. Especially, since the eruption of the pandemic, we need to look for new solutions for old problems. For example, if we open for remote IT specialists, we can easily get 10x-100x more candidate, while having the same recruitment procedures.
During my presentation, I would like to share ideas of how AI-driven tools are and can be used in HR and Talent Management:
- Current possibilities of what is feasible (including some demos),
- Where we can take these tools in the forseeable future,
- What are the current technical challenges,
- Impact challenges, especially from ethical standpoint.

Read less

Keywords: #artificialintelligence #nlp #digitaltransformation #futureofwork #hcm

10.10 - 10.25 Q&A Session

AppDynamics platform with massively scalable big data infrastructure components to handle large numbers of events, metrics, and metadata.

The session will focus on: • AppDynamics platform with massively scalable big data infrastructure components to handle large numbers of events, metrics, and metadata. • Machine learning baselining for APM metrics and advanced business analytics. • Anomaly Detection and Automated Transaction Diagnostics • Intuitive, powerful drill-down data visualizations

Read less

Keywords: #APM #EUM #IoT #BiQ #Analytics #ML #AI #BizDevOps #Kafka #AWS #AIOps #AD #ATD 10.10 - 10.25 Q&A Session

Expanding your data & analysis ecosystem with public cloud

Going to the cloud sounds fantastic, giving people new opportunities - priceless. Public cloud gives you a lot of ready systems that you want to use. But when you have, already working environment based on your data center, a lot of data, and plenty of people using these tools, your work will have a lot of fascinating challenges. Transferring data, moving from one tool to another, selecting tools without consternation, cost optimization, policies, security, and many others.

Read less

Keywords: #hadoop #spark #airflow #gcp #bigquery #composer #dataproc #data analysis

10.45 - 11.00 Q&A Session

Streaming SQL - Be Like Water My Friend

Data has to be processed fast, so that a firm can react to changing business conditions in realtime. Streaming SQL gives us a
possibility to make stream processing available for a broader audience but also makes it easier to access data streams. This
presentation will not only give you an brief overview of the data and streaming architecture at InnoGames but also introduces you
to the idea of Streaming SQL in general and how it is implemented in Apache Flink. Furthermore it shows actual examples of how
to use Flink SQL so that you hopefully are inspired to consider this rather new technology to tackle your data challenges.

Read less

Keywords: #streaming #streamingsql #flink #dataflow #flinksql

10.45 - 11.00 Q&A Session

Make it personal: reinforcement learning for mere mortals

During this session we will reflect upon the importance of personalization in e-commerce. What challenges accurred as a results of bridging the gap between Google AlphaGo and real world? Furthermore, we will discuss the Vowpal Wabbit: the Swiss army knife of ML algos.

To sum up, we will introduce a case study exercise, during which participants will be creating a personalized user experience on a webpage.

Read less

Keywords: #personalization #ecommerce #vowpalwabbit #reinforcementlearning #opensource

10.45 - 11.00 Q&A Session

Presto: SQL-on-Anything & Anywhere

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale at organizations including, Airbnb, Comcast, Fcaebook, FINRA, GrubHub, LinkedIn, Lyft, Netflix, Twitter, Uber, and Zalando, Presto has experienced unprecedented growth in popularity in both on premise and cloud deployments over Object Stores, HDFS, NoSQL, and RDBMS data stores.

Delta Lake, a storage layer originally invented by Databricks and recently open sourced, brings ACID capabilities to big datasets held in Object Storage. While initially designed for Spark, Delta Lake now supports multiple query compute engines. In particular, Starburst, developed a native integration for Presto that leverages Deltaspecific performance optimizations.

Join this session and hear how Starburst Presto deployed on Azure Kubernetes Service (AKS) serves as a fast SQL query engine versus data in ADLS, and enables query-time correlations between IoT data in Delta Lake, customer data in SQL Server, and web log data in Elasticsearch.

You will also gain best practice and real-life insights and lessons learned from production deployment of this integration.

Read less

Keywords: #presto #sql #analyticsanywhere #azure

11.25 - 11.40 Q&A session

Complex event-driven applications with Kafka Streams

At Simply Business we have built a rich stateful application using Kafka Streams to manage leads that are served by our 300-people strong UK call centre. This application combines many different data points from different services and has a few internal data stores to operate. Our initial design of the application became quite complex, so we had to make changes to ensure scalability and reliability . In the talk I'll present what techniques we used to simplify it.
But that's not all. I'll also talk about other key components like schema registry, which helped with the robustness of the solution and then lead scoring that helped to increase the return on the call spent by over 50%.

Read less

Keywords: #data-streaming #kafkastreams #schemaregistry #domainevents #evolutionaryarchitecture

11.25 - 11.40 Q&A session

How to build a state-of-the-art weather forecasting AI service

Weather forecasting is important in many fields, and minor improvements in accuracy can have a considerable business impact. Today, weather forecasting is performed using computationally expensive mathematical models based on the Navier-Stokes and mass continuity equations, the first law of thermodynamics, and the ideal gas law. These models simulate the physical world and are making use of expensive supercomputers. Alternatively, AI can be used to learn from data and produce forecasts in a fraction of the time the physical simulations require, and in many cases at a higher degree of accuracy. In this presentation, I will show how an AI was used to produce competitive forecasts, using state-of-the-art AI models and neural architecture search, and how I used React to prototype a weather forecasting service.

Read less

11.25 - 11.40 Q&A session

AWS Spot instances price prediction - towards cost optimization for Big Data

Analytical data processing has become the cornerstone of today's businesses success, and it is facilitated by Big Data platforms that offer virtually limitless scalability. However, minimizing the total cost of ownership (TCO) for the infrastructure can be challenging. By analyzing spot instance price history using ARIMA models, it is feasible to leverage the discounted prices of the cloud spot market with a limited risk of analytical job termination. In particular, we evaluated savings opportunities when using Amazon EC2 spot instances comparing to on-demand resources. During the presentation we show the evaluation of univariate spot price regression models to forecast future prices, and we confirm feasibility of short-term spot price prediction through real data from AWS. This confirms cost savings opportunities up to 80% compared to on-demand and within 1% of the absolute minimum.

Read less

Keywords: #TCO #CloudComputing #ARIMA #AWS #Spot

12.00 - 12.15 Q&A Session

How NoMagic robots improve thanks to software 2.0 improvement cycle supported by an in-house data engine?

A typical Software 2.0 improvement cycle at NoMagic:
* analyse when a data driven algorithm performs poorly
* gather and label data that will improve the algorithm or propose a modified algorithm
* train the model
* test the model
* deploy to production

Making this cycle as frictionless as possible was the focus of NoMagic ML team in 2020.
I will share with you what we achieved and how it changed the way we work at NoMagic.

Read less

12.00 - 12.15 Q&A Session

How to plan the unpredictable? 7 good practices of organisation and management of fast-paced large-scale R&D projects

During this session, we aim to review the technical and organisational challenges we faced, while building a complex AI-based app with a short time-to-market. How it was additionally influenced by the dispersion of involved teams around the world (10 time zones)? What unpredictable events that affected our plans, e.g. the COVID pandemic or the development vendor changing in the middle of the project? We will evaluate the examples of failures and successes. Lessons we have learned in the process. We invite you to a broader discussion.

Read less

Keywords: #datascience #ai #machinelearning #agile #projectmanagement

12.00 - 12.15 Q&A Session

1. Building a world-class Big Data team during the COVID-19 pandemic - recruiting, training, collaborating.

Last year forced us to change a lot in how we work. A lot of us had to switch to working/studying from home, some needed to freeze hiring, others - to redefine onboarding. As hard as last year was, it was also a time of innovation. Join the session to exchange lessons learned and ideas for building a world-class Big Data team leveraging “the new normal”. Everybody is welcome - the more diverse experiences the better.

Monika Puchalska

Engineering Manager
Zendesk

2. Big Data on Kubernetes

Albert Lewandowski

Big Data DevOps Engineer
GetInData

3. Best tools for alerting and monitoring of the data platforms

Have you ever been woken up in the middle of the night by a screaming PagerDuty alert on your mobile, 99+ notifications on {YOUR_PIPELINE_NAME}_alerts Slack channel and tens of graphs in Grafana looking like an undreamt art of van Gogh? If yes, welcome, me too. For an engineer working on a Data Platform it is easy to create a new pipeline, a new dataset or add any new integration, especially now in the cloud era. But it is not easy to have a proper monitoring and alerting system ensuring that any potential issues/incidents can be solved as quickly as possible, so offering of our Data Platform is always top quality. In this session we will discuss tools for building a monitoring and alerting system that is efficient, easy to understand, supervises exactly what we want, notifies the ones we want, is not too noisy and scales well with always growing data.

Bartosz Janota

Big Data DevOps Senior Data Engineer
Bolt

4. Stream processing engines – features, performance, comparison

Arkadiusz Jachnik

Solution Architect
BAE Systems Applied Intelligence

5. Using the public cloud effecitively and cost-efficiently

According to Unisys's Cloud Barometer study, only a third of organizations have seen great improvements to their organizational effectiveness as a result of Cloud adoption. What are good practices to be part of those organizations? Let's discuss how to use the public Cloud effectively and cost-efficiently.

Piotr Kalański

Manager of Data Analytics & BI
TrueBlue

6. Building AI/ML systems: from algorithms to production

We're facing very different challenges when writing a scientific paper and when building a production ML system. Things get even more complex when a single project involves both research and application. It's generally understood yet often overlooked: let's get talking! How to scope an ML project? How to get the data yet avoid biases and that multi-million-euro GDPR penalties? What models work in the real world scenarios? How to handle model deployment? And who do you need in your team to succeed?

Mateusz Fedoryszak

ML Engineer
Twitter

7. MLOps - how to support the life-cycle of ML models

Przemysław Biecek

Founder
MI2.AI

9. Data Strategy. The Game.

The format of this round table discussion is the game, where you as Chief Data Officer, has a mission to implement strategic initiatives for $2.7B electronics manufacturer (please watch short video at Tech Zone for more details). You will have a chance to learn how to maximize business value from data, how to design and execute Data Strategy, which strategy approach is the best, how your decisions influences others within organization.

Taras Bachynskyy

AVP Technology
SoftServe

10. Distributed Big Data processing in the cloud – is Hadoop still an option?

Joint Cloudera & 3Soft roundtable to discuss practical experience and highlights of providing self-service access to integrated, secured, multi-function analytics based on Hadoop, cloud-native offerings or custom-tailored solutions. Let us share our knowledge on how to enjoy consistent data security, governance, lineage, and control, while deploying the powerful, easy-to-use solutions business users require and eliminating their need for shadow IT solutions.

Michał Bochenek

Head of Architecture
3Soft

Kiryl Halozhyn

Solutions Engineer
Cloudera

11. Snowflake Data Cloud – possibilities and limitations. How I can judge whether this is a value proposition for me and my organization?

Tomasz Mazurek

EEA Sales Director
Snowflake

Piotr Pietrzkiewicz

Senior Sales Engineer
Snowflake

Lessons from building large-scale, real-time machine learning systems

Unity Ads helps publishers and advertisers reach their business goals, and machine learning is at the core of our product. In this presentation, I will first give an overview of the machine learning systems we built for real-time ads bidding, which process tens of thousands of ad auction requests per second. Then, I will share several generalizable lessons we learned in making our systems performant from machine learning perspective and scalable from engineering perspective.

Read less

13.30 - 13.45 Q&A Session

Flink SQL in 2021: Time to show off!

Four years ago, the Apache Flink community started adding SQL support to ease and unify the processing of static and streaming data. Today, Flink runs business critical batch and streaming SQL queries at Alibaba, Huawei, Lyft, Uber, Yelp, and many others. Although the community made significant progress in the past years, there are still many things on the roadmap and the development is still speeding up. This session will focus on a comprehensive demo of what is possible with Flink SQL in 2021.
Based on a realistic use case scenario, we'll show how to define tables which are backed by various storage systems and how to solve common tasks with streaming SQL queries. We will demonstrate Flink's Hive integration and show how to define and use user-defined functions. We'll close the session with an outlook of upcoming features.

Read less

Keywords: #flink #flinksql #streamprocessing #unifieddataprocessing #apache

Managing Big Data projects in a constantly changing environment - good practices, use cases

The nature of Big Data projects are nowadays one of its kind - they are not like the data warehousing initiatives in the old days, nor like cloud native applications projects, at least not yet. Variety of technologies, complicated architectures and rapidly changing landscape are just a few challenges that the IT Department is facing in such projects. When you add the number of stakeholders from different departments involved and that Big Data project is sometimes more like an R&D with unpredictable outcome, this makes a mix where the objectives can be easily lost. It is not a surprise that up to 85% of Big Data projects were pure failures (Gartner 2016). In this talk we will share our experience in planning and executing Big Data initiatives in the organisations, with some use cases and good practices in mind

Read less

Keywords: #agile #teammanagement #goodpractices #usecases

Artificial Intelligence - Building in-house AI capabilities from scratch at Philip Morris International

We will start by sharing how our team is structured, what is that we deliver, and we will continue with sharing more about our journey and the challenges we faced within a big corporation until we got to a good level of maturity inside the organization. An exposition of tangible use-cases will follow, and we will take the other half of the session to talk about technical details, such as the technology stack, CI/CD pipelines, MLOps, and others that help us accelerate delivery.

Read less

Keywords: #AI #DL #AIBusiness #Innovation #Productivity #Disruption

Simplifying Stateful Serverless Architectures

Platforms like KNative and FaaS have solved most of the challenges of dealing with stateless applications. Still, when it comes to managing state, developers quickly end up designing and maintaining a complicated architecture without achieving consistency guarantees in the presence of failure. Stateful Functions (StateFun) - developed under the umbrella of Apache Flink - provides consistent messaging and durable state without compromising the serverless experience. Like a database, it exposes its capabilities to application developers in a platform and language agnostic manner: StateFun does not mind if you deploy your application as a set of Python functions on your preferred FaaS platform, a single Spring Boot application on Kubernetes or a mixture of both.
In this demo-centric session you will learn about the core ideas behind the project and you will see how to write, deploy and monitor a simple Stateful Functions application.

Read less

Keywords: #ApacheFlink #Serverless #Kubernetes #Event-Driven #Scale

CICD Pipeline and delivery of Apache Spark Applications on the cloud using AWS

The session will start from a quick, background introduction to the CSU datalake architecture and dataops framework, where we are going to discuss the principles of CICD and process overview, development unit and integration testing pipleine. Furthermore, we will overview the process and demonstrate how we use AWS codecommit and codebuild to automate the testing and code coverage. The next part of the meeting will focus on the production deployment Pipeline, featuring the overview of the process and demonstration of how we use AWS codecommit,codebuild and codepipeline to deploy spark applications to production environment.

Read less

Keywords: #Automation #CICD #HigherEd #Spark

Building scalable and testable data pipeline through a data pipeline domain specific language

Data pipeline architecture, design and builds have similar concerns as any software product development. The purpose of the presentation is to uncover the concerns and present one of the solutions. The presentation covers aspects of data pipeline such as: 1. Configuration driven composable data pipeline 2. Testable data pipeline through specification language such as Gherkin 3. Design of the pipeline to solve for the data pipeline concerns

Read less

Keywords: #DataEngineering #ComposableDataPipeline #SOLIDPrincipleInDataPipelineDesign #GHERKINandDataPipelineSpecification #BDDTDDinDataPipeline

Creating Confidence in Data at Klarna - A Case Study in Automatic Data Validation

We can all agree that when making big decisions, you want to make them with confidence. In the world of art, this means buying your painting from a well known auction house instead of from the back of an old car. In the world of data, this means validating your data before using it.

During this session, we will show you how to quickly create confidence in data using automatic data validation. We will describe the validation process that we are using at Klarna, and show how this enables rapid improvements of big data transformations.

Read less

Keywords: #data-transformation #transformation-improvement #data-confidence #automatic-data-validation #data-validation-tool

Modern radars: from simple signal processing towards modern complex data analytics with deep learning

During this session, we will concentrate on data science & AI status in geophysics/geology. We will discuss and present analytical and software challenges with regards to multidimensional radar data. Furthermore, the session will finalize with the evaluation of the open-source, big data technologies application for solving complex analytical problems.

Read less

Keywords: #radars #deeplearning #bigdata #opensource

Building Data Ingestion Platform using Hadoop

State of data platforms in the tech industry. ING WBDM's vision on the future of data ingestion. Highlights on the ING Data Ingestion Platform main components and features. The Hadoop and FOSS revolution has reshaped the data engineering landscape. Using virtual and bricks machines to give life to a high-availability, disaster recovery ready platform. In the search for creating a cutting-edge data platform at ING, we are faced with challenging new requirements such as cloud-ready deployments into production, whilst ensuring proper data governance, risk and security principals. Please join us in this session, where we will share ING WBDM's experience on how to make a data platform based on open source components both enterprise and cloud ready, with an overview of current state and vision of our platform.

Read less

Keywords: #dataingestion #hadoop #nifi

Common mistakes that make your chart hard to understand with practical solutions to avoid them.

In the world of big data, data visualization tools and technologies are essential for analyzing massive amounts of information. Although data visualizations are commonly used, they are often inaccurate and misleading. To support data-driven decisions, it's crucial to create reliable charts that leave no space for misunderstanding. There are mistakes that can be easily avoided, so let me show you how to do this!

Read less

Keywords: #datavisualization #datadesign #dataliteracy #uidesign #graphicacy

Top 10 Big Data Systems Pitfalls - war stories and lessons learned

Would you like to hear war stories about the design and implementation of Big Data solutions? Have you ever wondered why Big Data ecosystem is changing so rapidly and is not yet stable? How to choose proper technologies and solutions which can achieve project goals? I do not promise any silver bullets. I simply want to share my experience gained during last 5 years of crafting data intensive applications. There are not many experienced Big Data architects in this JVM world. There are even less which are willing to share their lessons learned and tell true war stories. This talk is going to address common problems which are happening all the time in Big Data software development.

Read less

Keywords: #BigDataWarStorries#DataAsAService#BigDataROI#LessonsLearned

Krzysztof Adamski

Technical Lead for Data Analytics Platform
ING Hubs Poland

Jesse Anderson

Data Engineer, Creative Engineer and Managing Director
Big Data Institute

Juliana Araujo

Senior Product Manager
Spotify

Max Baak

Chief Data Scientist
ING WBAA

Taras Bachynskyy

AVP Technology
SoftServe

Lidor Bahar

Senior Data Scientist
Eleos Health

Kamil Bajda-Pawlikowski

Co-founder and CTO
Starburst

Konrad Banachewicz

Lead Data Scientist
eBay Classifieds Group

Bartłomiej Bęczkowski

Senior software engineer
Allegro

Adrian Bednarz

Big Data Engineer
GetInData

Alex Belotserkovskiy

Technical Lead for Strategic Partners & Startups ,CIS, Lead of Open Source Community
Microsoft Russia, Microsoft

Juan Benavente

Industry 4.0 Expert
CEPSA

Mitja Bezenšek

Principal software engineer
Outift7

Przemysław Biecek

Founder
MI2.AI

Wojciech Biela

Co-founder & Senior Director of Engineering
Starburst

Michał Bochenek

Head of Architecture
3Soft

Marcin Bodych

Software Researcher and Data Sciencist
MOCOS Group

Miłosz Bolibrzuch

Data Science Lead
Twigeo

Johannes Bracher

Postdoctoral Researcher
Heidelberg Institute for Theoretical Studies (HITS), Karlsruhe Institute of Technology (KIT)

Simon Brugman

Data Scientist
ING WBAA

Maciej Bryński

Big Data Lead
GetInData

Michał Bryś

Machine Learning Engineer
GetInData | Part of Xebia

Yakir Buskilla

SVP R&D and GM Israel
Nielsen Identity

Dan Buzarnescu

Solution Architect
Databricks

Simon Caruana

Senior Solutions Design
Tesco Bank

Claudiu Colesnicencu

Software Developer
ING

Nicolas Correa

IT Manager AI
Philip Morris International

Subash D’Souza

Director, Cloud Data Engineering, Office of the Chancellor
California State University

Roksolana Diachuk

Big Data Developer
Captify

Philipp Drieger

Principal Machine Learning Architect
Splunk

Maciej Durzewski

Field Consultant
Ab Initio

Michał Dyrda

Lead Data Scientist
SGPR.TECH

Mateusz Fedoryszak

ML Engineer
Twitter

Kamil Folkert

Chief Technology Officer
3Soft

Maria Fung

CSU
Data Warehouse Development Lead.

Weronika Gawarska-Tywonek

Data Visualization Specialist
Freelancer

Krzysztof Gawronski

AppDynamics, Cisco Architect for the EMEA region
Cisco

Ruslan Gibaiev

Data Architect
Bolt

Dawid Głowacki

Head of Presales and Professional Services
MDSap

Michael Gregory

Principal Data Platform Architect, Field CTO Office
Snowflake

Ewa Gruszka

Customer Engineer
Google Cloud Poland

Marek Grzegorowski

Lead Solution Architect
Nowa Era

Łukasz Grzeszczyk

Head of IT Perm Recruitment, Key Accounts Director
Hays Poland

Michał Gutowski

Solutions Engineer
Cloudera

Piotr Guzik

CEO and Co-founder
Datumo

Josef Habdank

Principal Solution Architect for BigData Analytics & Engineering, North & Central Europe
DXC Luxoft

Kiryl Halozhyn

Solutions Engineer
Cloudera

Alex Holmes

Software engineer, author, speaker and blogger

Konrad Hoszowski

Technical Account Manager
Ab Initio Software

Arkadiusz Jachnik

Solution Architect
BAE Systems Applied Intelligence

Bartosz Janota

Big Data DevOps Senior Data Engineer
Bolt

Volker Janz

Expert Software Developer Analytics
InnoGames GmbH

Krzysztof Jędrzejewski

Principal Data Scientist
Pearson

Piotr Kaczyński

Senior Business Solutions Manager
SAS Institute

Piotr Kalański

Manager of Data Analytics & BI
TrueBlue

Ewelina Kamińska

Centre of Expertise - AI
ING Bank Śląski

Michał Karykowski

Data Platform Engineer
Allegro

Jacek Kaszuba

Solution Architect
Clarite Polska

Rob Keevil

Data Analytics Platform Lead
ING Banking Technology Platform, ING

Konstantin Knauf

Staff Product Manager
Ververica

Lidia Kołakowska

Data Scientist
Sotrender

Grzegorz Kołakowski

Data Engineer
GetInData

Stefan Konopnicki

Head of Data Services & Operations
OLX Group

Mariusz Koprowski

Big Data Group Director
Adform

Piotr Kramek

Data Science & Engineering Team Leader
SAS

Tyll Krüger

Founder
MOCOS

Michał Krzyżanowski

Advanced Data Analytics Competence Center Manager
ASTEK Polska

Sudhir Kumar

Senior Engineering Manager
Independent speaker

Nacho Lafuente

Founder & CEO
Datumize

Jacek Laskowski

IT freelancer

Julien Le Dem

CTO, Co-Founder
Datakin

Albert Lewandowski

Big Data DevOps Engineer
GetInData

Michał Lubasiński

Head of Data and Innovation
ING Hubs Poland

Rafał Małanij

Business Executive
GetInData

Maciej Marek

Senior Data Scientist
SGPR.TECH

Rafel Massei

Senior Product Marketing Manager
Snowflake

Tomasz Mazurek

EEA Sales Director
Snowflake

Vladyslav Melnyk

IT Solution Architect AI
PMI

Piotr Menclewicz

Data Analyst
GetInData

Alex Merced

Developer Advocate
Dremio

Tomasz Mirowski

Chief Technology Officer
3Soft

Anna Naroska

Data Engineer
GetInData | Part of Xebia

Mariusz Olszewski

Analytics & Automation Director
Clarite Polska

Maciej Paliwoda

Solution Engineer
Vertica

Johan Petrini

Senior data engineer
Klarna Bank AB

Maciej Pieńkosz

Machine Learning Engineer
Sotrender

Piotr Pietrzkiewicz

Senior Sales Engineer
Snowflake

Kamil Pochodaj

Senior Solutions Engineer
Cloudera

Paweł Potasiński

Senior Program Manager
Microsoft

Wojciech Ptak

Engineering Executive & Head of Engineering for Revolut Business
Revolut

Monika Puchalska

Engineering Manager
Zendesk

Grzegorz Puchawski

Managing Director of Data Science and Recommendations
Disney Streaming Services

Mateusz Pytel

Google Certified Professional - Cloud Architect
GetInData

Franciszek Rakowski

Project Manager
ICM University of Warsaw

Saveen Reddy

Group Product Manager – Azure Engineering
Microsoft

Babu Repaka

Raya Rizk

Software engineer
Klarna Bank AB

Dominika Sagan

Data Scientist
Sotrender

Daniel Sand

Senior Researcher and Data Scientist
Eleos Health

Natalia Sikora-Zimna

Senior Technical Project Manager
Pearson

Kornel Skałkowski

Senior Data Scientist
GetInData

Borys Sobiegraj

Head of Advanced Analytics
GetInData

Radosław Stankiewicz

Data SCE at Professional Services Organization
Google Cloud Poland

Mats Stellwall

Principal Sales Engineer, Data Science
Snowflake

Wioletta Stobieniecka

Analytical Consultant
SAS

Mariusz Strzelecki

Senior Machine Learning Engineer
GetInData | Part of Xebia

Michał Szeja

Team Leader
StepStone Services

Michał Szymczyk

Senior Data Scientist
GetInData

Shivendra Upadhyay

Senior Data Scientist
Ericsson India R&D

Tomasz Waleń

ING Tech Poland

Timo Walther

Staff Engineer
Ververica

Keven(Qi) Wang

Competence lead , Machine learning engineer
H&M

Rafał Wojdan

Head of AI
Sotrender

Michał Wróbel

Staff data engineer
Simply Business

Itai Yaffe

Senior Solutions Architect
Databricks

Xuan Yin

Senior Staff Data Scientist
Causal Inference and Experimentation, Udemy

Petr Zacek

Managing Consultant
Ataccama

Rafał Zalewski

Senior IT Project Manager / Scrum Master
GetInData

Piotr Zawada

Senior Pre-Sales Solutions Architect
SAS Institute

Anita Zbieg

CEO
Network Perspective

Eftim Zdravevski

Chief Architect, Founder and CEO, Assistant Professor,
CogniTrek Corp, MAGIX.AI, Saints Cyril and Methodius University

Maryna Zenkova

Data Platform Associate
Point72

Tomasz Żukowski

Data Analyst
GetInData

Michał Żyliński

Cloud Customer Engineering Manager
Google Cloud Poland

REAL - TIME STREAM PROCESSING

DETAILS:

How to process unbounded streams of data in real-time using popular open-source frameworks? We focus mostly on Apache Flink and Apache Kafka. We simulate real-world end-to-end scenario – processing logs generated by users interacting with a mobile application in real-time. The technologies that we use include Kafka, Flink, HDFS and YARN. All exercises will be done on the remote multi-node clusters.

SESSION LEADERS:

Grzegorz Kołakowski

Data Engineer
GetInData

Adrian Bednarz

Big Data Engineer
GetInData

FOUNDATIONS OF DATA ENGINEERING WITH GOOGLE CLOUD

DETAILS:

While getting familiar with services like Google Cloud Storage, BigQuery or DataFlow we will walk through the common data flow patterns adopted by companies migrating to the cloud. The workshop will contain a series of exercises that will help you get a hands on experience with Google Cloud Platform as well as an opportunity to discuss best practices, security, scalability and cost management aspects

SESSION LEADERS:

Mateusz Pytel

Google Certified Professional - Cloud Architect
GetInData

Evention

Evention is a company that specializes in increasing the value of ICT business meetings. We strongly believe that business events are integral and irreplaceable factor when it comes to the creation and maintanance of relations and improvement of communication between companies and people that create them. We are constantly searching for innovative business meeting formulas to adress current needs, expectations and aspirations of ICT managers.

GetInData | Part of Xebia

GetInData | Part of Xebia is a leading polish expert company delivering cutting-edge Big Data, Cloud, Analytics, and ML/AI solutions. The company was founded in 2014 by data engineers and today brings together 120 big data specialists. We work with international clients from many industries, e.g. media, e-commerce, retail, fintech, banking, and telco. Our clients are both fast-growing scaleups and large corporations that are leaders in their industries. We maintain laser focus on data technologies, cultivate very strong engineering culture and support extensive knowledge sharing both within a company and outside through meetups, conferences and contributions to open-source. We are a go-to partner for companies that need tailored and highly scalable data processing and analytics platforms that give competitive advantage and unlock full business potential of their data.

Cloudera

At Cloudera, we believe data can make what is impossible today possible tomorrow. We empower people to transform data anywhere into trusted enterprise AI so they can reduce costs and risks, increase productivity, and accelerate business performance. Our open data lakehouse enables secure data management and portable cloud-native data analytics, helping organizations manage and analyze data of all types on any cloud, public or private. With as much data under management as the hyperscalers, we’re the preferred data partner for the top companies in almost every industry. Cloudera has guided the world on the value and future of data and continues to lead a vibrant ecosystem powered by the relentless innovation of the open-source community. Learn more at cloudera.com

3Soft

3Soft supports companies in maximizing their business potential.

We create dedicated solutions that facilitate data management and automate internal and external processes. We enrich the prepared systems with the benefits of artificial intelligence. This enables faster analysis of information, discovery of non-obvious relationships and drawing accurate conclusions.

We are trusted by companies in the area of small and medium-sized enterprises, the largest banks in Poland and world leaders in the retail, fuel and energy, manufacturing and automotive industries.

Snowflake

Snowflake enables every organization to mobilize their data with Snowflake’s Data Cloud. Customers use the Data Cloud to unite siloed data, discover and securely share data, power data applications, and execute diverse AI/ML and analytic workloads. Wherever data or users live, Snowflake delivers a single data experience that spans multiple clouds and geographies. Thousands of customers across many industries, including 639 of the 2023 Forbes Global 2000 (G2K) as of July 31, 2023, use Snowflake Data Cloud to power their businesses. Learn more at snowflake.com.

Vertica

Vertica is the unified analytics platform, powering data-driven businesses with predictive insights based on advanced AI and machine learning, at blazing speed, and at petabyte scale. Available in a fully managed SaaS option, or as a customer-managed platform, Vertica offers the widest range of deployment configurations in the data analytics industry. With Vertica, data analytics teams can combine data siloes that are growing exponentially—without moving the data for analytics. They can manage analytic workloads in the public clouds, on-premises, on Hadoop, or any hybrid combination. And with separation of compute and storage, Vertica allows teams to spin up storage and compute resources as you need them, then spin down afterwards to reduce costs. Learn more about us at Vertica.com, and follow us on Twitter @VerticaUnified.

RTB House

RTB House is a global company that provides state-of-the-art retargeting technology for top brands worldwide. Its proprietary ad buying engine is the first and only in the world to be powered entirely by deep learning algorithms, enabling advertisers to generate outstanding results and reach their short, mid and long-term goals.
Founded in 2012, RTB House serves over a thousand campaigns across EMEA, APAC and the Americas regions with main locations in New York, London, Tokyo, Singapore, São Paulo, Moscow, Istanbul, Dubai and Warsaw.

Allegro

At Allegro we make apps that thanks to their scalability and reliability has gained fans all over Central and Eastern Europe. It was not an easy task. Every day we face challenges in architectural and design area, as well as in process of choosing the right technology, providing code quality, and in further phase: implementing and maintaining a product.
Allegro tech is our idea for sharing our experience by organizing conferences, workshops, meetups and hackathons.
You can find more information on our website: www.allegro.tech
Follow us https://www.meetup.com/allegrotech and join our meetups
https://www.facebook.com/allegro.tech

Cisco

Cisco (NASDAQ: CSCO) is the worldwide technology leader that has been making the Internet work since 1984. Our people, products, and partners help society securely connect and seize tomorrow’s digital opportunity today, www.cisco.com

Clarite

Clarite Polska SA is a company with an established position within advanced IT solution supplier market, listed by GSC as one of companies operating critical infrastructure in Poland. We are specialized in three areas: Advanced Analytics/Big Data, System Integration, ECM/BPM.
Within BigData area of interest we inspire and advise our customers in solution architecture, we verify analytic needs and design end-to-end solutions. Drawing upon our experience along with supporting technologies such as Data Governance tools and Workload Automation, we provide analytical departments with more thorough and up to date information on data possessed contributing to more efficient BigData implementations.
Offering is completed by adding Robotic Process Automation supported by new generation of intelligent chatbots.

ING Tech Poland

ING Tech Poland is an IT company located in Katowice and Warsaw (Poland), which provides IT and operational services to all ING units worldwide. In terms of IT, we deliver IT security, hosting, remote management and application services. Our operational services are delivered by three units, RiskHub and CardsHub and Know Your Customer (KYC). The first one was established as part of ING Tech Poland and is our Modelling Expertise Centre shaping the future of risk modelling and data analysis in Poland. Our ambition is to we shape the future of risk modelling and data analysis in Poland and build the position of employer of choice in Warsaw labour market.

Luxoft

Luxoft, a DXC Technology Company, is a digital transformation and software engineering company that provides customized IT solutions that drive business change for customers in every corner of the globe. Currently, Luxoft employs over 12,700 people in 40 locations around the world.
We combine high-quality services and in-depth industry knowledge, specializing in the automotive, financial services, media and telecommunications sectors, as well as many other. We hire experienced specialists, engaging them in long-term projects without slowing down our pace of output, so we can confidently say that we are a stable employer even in difficult times.
Luxoft Poland is well known for consistently high levels of supply, adroitness in complex project management, the talent of its highly qualified experts in the field of digital engineering, exceptional customer orientation, as well as its agility, creativity, and remarkable problem solving capabilities.

Microsoft

Microsoft enables digital transformation for the era of an intelligent cloud and an intelligent edge. Its mission is to empower every person and every organization on the planet to achieve more.
Whether you're just starting your developer career or an experienced professional, our hands-on approach helps you arrive at your goals faster, with more confidence and at your own pace. Visit the home for Microsoft documentation and learning for developers and technology professionals.
#InventwithPurpose

SAS

SAS is the leader in analytics and AI. Through innovative software and services, SAS empowers and inspires customers around the world to transform data into intelligence to make smart decisions and drive relevant change. With SAS, data scientists have a rich set of tools, including statistical methods, optimization, data visualizations, machine learning and deep learning. You can serve predictions and embed AI models in applications through REST APIs and programming languages such as Python, R, Java, Lua and Scala. The SAS platform runs in a variety of environments, whether you are deploying in the cloud or on premises. SAS also provides natural language processing and speech to text tools. These methods are accessible through APIs and enable users to build intelligent applications such as Chatbots. SAS platform allows to operationalize data product on any scale in real time. SAS helps customers at more than 83,000 sites in 147 countries. Incorporated in 1976, company employs nearly 14 000 employees. www.sas.com

SoftServe

SoftServe is more than just technology and ideas – we are the place where talented and ambitious people can develop their passion!
We operate at the cutting edge of technology, exploring, transforming, accelerating, and optimizing the way large enterprises and software companies do business. We use the latest technologies and think outside of the box. We are the team of 1000+ people experienced in such areas as Big Data, Data Science, AI, VR, Machine Learning, as well as in core software development, Experience Design, DevOps, and many others. We have offices in 7 locations in Poland – Wroclaw, Gliwice, Bialystok, Warsaw, Gdansk, Lodz and Cracow. And we are still growing!

Splunk

Splunk is the data platform leader for security and observability. Our extensible data platform powers enterprise observability, unified security and limitless custom applications. Splunk helps tens of thousands of organizations turn data into doing so they can unlock innovation, enhance security and drive resilience. More info on www.splunk.com

Ab Initio

Ab Initio Software is the top supplier of enterprise data and metadata management software. Numerous leading global corporations use the Ab Initio software for the development of enterprise mission-critical applications with unmatched performance and full scalability. The apps encompass the full spectrum of data processing tasks – from the real-time operating systems to batch processing analytical systems, from data warehouses to transactional systems.
Core to Ab Initio is a simple idea that everything should be graphical. Applications should be graphical. Rules should be graphical. Orchestration, metadata, data management, and so on – no matter how big or complex, all should be graphical.
With this in mind, we think that our experience supporting the mass scale and data volume, combined with our proven Hadoop capabilities, forms the basis of your needs today and in the future.

Big Data Passion

BigDataPassion’s mission is to enable everyone interested to learn more about Big Data, Cloud Computing, Artificial Intelligence, Natural Language Processing, Configuration Management, Application Deployment, as well as other tech news. Technology is our passion and we like to share our knowledge, show new trends and well-known solutions to problems with a list of tools needed. We write it, because we like it – check it out!

CodersCrew

CodersCrew Association is a non-profit organization that brings together enthusiasts of new technologies. Innovation is our second name, we design and code to change the world for the better. Find more at https://coderscrew.pl

Data KRK

Data KRK brings practical Data Science and Data Engineering to Krakow. It is everything you always wanted to know about Big Data, NoSQL or Machine Learning, but were afraid to ask.

Data Science Warsaw

Data Science Warsaw is a community of data scientists based in Warsaw. We are a non-profit professional organisation dedicated to the free, open, dissemination of data science. We meet to discuss the tools, methods and technologies used to ingest, transform, explore, analyze; visualize data, obtain predictive ; prescriptive insight, develop data products, and exploit business opportunities from data products. The organizers of the Data Science Warsaw meeting are ICM and Foundation DataSci.

Inhire

Inhire is a platform that helps the best IT specialists to get a new job more effectively. It was created to automate recruitment in the tech industry, making it faster, more effective and spam-free for candidates and companies. Inhire is different on many levels: – Anonymous : You can go through matched job offers completely anonymously, only when you are interested in a job offer you can reveal your data to companies with a 1-click application. – Automated: Candidates specify their skills and expectations and our system presents them currently open positions that meet their requirements. This way candidates are always up to date with available offers that are out there. – No spam: We respect candidates’ time. We only notify them about perfect job matches waiting for them just one click away. Over 400 purely tech companies from Poland and abroad are already looking for IT specialists at Inhire.

Interdisciplinary Center for Mathematical and Computational Modelling (ICM), University of Warsaw

The Interdisciplinary Centre for Mathematical and Computational Modelling (ICM) University of Warsaw has become one of the top High Performance Computing (HPC) centers in Poland, which also in the domain of Big Data, HPC, cloud services and storage supports approximately 1,000 users from Poland. The popular meteo.pl weather portal has been using ICM computed weather forecasts. ICM researchers study problems related to civil aviation (collaborating with ICAO), modeling of social processes and most recently working on ICM Epidemiological Model for the COVID-19 epidemic in Poland. ICM took part in securing access for Polish scientists to the entire body of scientific literature, including over 8,000 journal titles, by maintaining the Virtual Library of Science. ICM networking team has participated in a number of cutting edge networking solutions, both for high throughput and low latency requirements. Check ICM’s projects here: https://expodubai.icm.edu.pl/

PyData

This is a group for anyone interested in Machine Learning and Artificial Intelligence and their applications such as Big Data, predictive analytics, data science and robotics.

SysOps/DevOps Polska

Fundacja SysOps/DevOps Polska (w skrócie SO/DO) to największa społeczność administratorów
systemowych i sieciowych, DevOpsów i innych specjalistów IT w Polsce. Od ponad
9 lat zrzeszamy ekspertów, którzy pomagają sobie wzajemnie rozwiązywać problemy w pracy
i przekładają to na znajomości offline. Dziś to ponad 30 tysięcy profesjonalistów, którzy dzielą
się swoimi doświadczeniami i przemyśleniami współtworząc community.
Misją SO/DO jest edukacja i tworzenie społeczności. Oprócz grup na Facebooku, regularnych
MeetUpów stacjonarnych, setek prelekcji na YouTube i szkoleń, realizujemy inne niestandardowe projekty.

The London Java Community (LJC)

The London Java Community (LJC) is a group of Java Enthusiasts who are interested in benefiting from shared knowledge in the industry. Through our forum and regular meetings you can keep in touch with the latest industry developments, learn new Java (& other JVM) technologies, meet other developers, discuss technical/non technical issues and network further throughout the Java Community.

The National Information Processing Institute

The National Information Processing Institute is an interdisciplinary scientific institute and a leader in software development for Polish science and higher education. We knowledge on almost every Polish scientist, their projects, and their research apparatus. Gathering, analysing, and compiling information on the research and development sector allows us to influence the direction of Polish scientific policy. We develop intelligent information systems both for the public sector and for commercial use. The key areas of research at the institute include: machine learning algorithms, natural language processing algorithms, sentiment analysis, neural networks, discovering knowledge from text data, human-computer trust, computer assisted decision making systems, and artificial intelligence. Our research is driven by interdisciplinarity, and is conducted in seven laboratories, which employ specialists in various fields. Our team of information technology experts is supported by economists, sociologists, lawyers, statisticians, and psychologists. This fusing of different scientific approaches is conducive to in-depth analysis of research issues, and is a driving force for innovation.

Warsaw Data Tech Talks

Warsaw Data Tech Talks (formerly Warsaw Hadoop User Group) - still as WHUG, was one of the first European Hadoop supporters. From the beginning of the group's operation (April 2012), we managed to successfully organize 36 meetings and convince 1974 people to join the group. We had the pleasure to host companies such as Spotify, Criteo, GetinData, dataArtisans, GridGain, TouK and many others.

Many enterprises are re-thinking their data analytics strategy. Some plan to stay on-prem, others are all in for full-cloud, and still others require a hybrid approach. They are adopting object stores independently from unifying their data analytics platforms and supporting the broadest deployment models. We talked to Mark Lyons, Chief Product Officer, Vertica, one of the keynote speakers at Big Data Tech Warsaw 2021 about new era of data analytics, hybrid and cloud-agnostic approach, unified solutions and the future of big data analytics.

IT is more and more hybrid these days. What does it mean from the big data analytics perspective? What are the main challenges of adding cloud to the mix?

Mark Lyons [ML]: One of the big challenges of moving to a more hybrid data architecture is when the way data is analyzed in the cloud doesn’t match the way your company is accustomed to analyzing data on premises. When the analytics consumers in an organization have to change the way they do things depending on where the data is stored, it causes problems for everyone. When you have weird restrictions like “egress fees” that make it cost the company money to move data from one location to another, this can also be a big barrier to making analytics ubiquitous and decisions data-driven.

Does the MULTI-cloud approach present additional challenge?

ML: The key to making a multi-cloud approach work, other than having an analytics platform that works on multiple clouds, is having a single pane of glass to manage it all. If you can manage analytics clusters from one interface, it vastly simplifies things. Where you’re doing the analysis, whether it’s on AWS, Google or Alibaba doesn’t matter nearly as much if you can spin clusters up and down, troubleshoot, optimize, and otherwise track things regardless of location. That simplifies hybrid as well, if your single management interface works without regard to deployment platform.

And what are the other main challenges the customers or users face?

ML: Change is the biggest challenge any company with any longevity faces. If you’re a tiny startup, you might grow exponentially and have to grow your data architecture to match. If you’re all in the cloud, some new regulation may require you to move your analytics on-premises. That report you’ve been generating weekly, well, now the C-suite wants to see it updated hourly, oh, and can you have it project a prediction forward three weeks? The only constant in data management is that nothing stays constant.

How has Vertica evolved to help businesses tackle these challenges?

ML: Vertica is a single, software-only code base, a single RPM if you will. This means that it works exactly the same on-premises as on any cloud – AWS, Google, Azure or Alibaba. It has a single Management Console for all of an organization’s databases, regardless of location. It even allows you to hibernate a database – shut all compute down so it just stores data – on-premises, and revive it in the cloud, hibernate it in one cloud, revive it in another. The database works, regardless of deployment environment.

What is next? How do you see the future of big data analytics? How will Vertica evolve?

ML: The concept of a separate data lake and data warehouse, a separate analytics platform on-premises and on the cloud, a separate platform for business intelligence and data science, these are all becoming obsolete. A single analytics solution that works for whatever analytics your organization needs to do, reaches whatever data you need to analyze, and deploys wherever you need to work, that’s what we’re becoming.

What technologies do you see as key drivers of change in this area? AI/ML? Automation? RPA? HPC? Exascale?

ML: Machine learning and advanced analytics like time series analysis and geospatial analysis are the future. Vertica is already leading the market in these capabilities, but we intend to expand further in that area. Expect to see more automation to make Vertica simpler to manage and deploy, and even more added to our already market-leading analytics, no matter what deployment model you want to use.

Business digitalization transforms application performance monitoring. For the interconnected, increasingly complex, ever-growing, and globalizing environment, the traditional approach is no longer enough. To keep up with the rate of change organizations desperately need highly automated, AI-driven open platforms for IT operations. AppDynamics is an intelligent, flexible, and massively scalable SaaS platform that provides Big Data infrastructure components to handle large numbers of events, metrics, and metadata - says Krzysztof Gawroński, AppDynamics, Cisco Architect for the EMEA region.

What are today’s key challenges for application monitoring?

Krzysztof Gawroński [KG]: Let me start with complexity. Applications are deployed in a hybrid, multi-cloud, or data center type of environments. Today on average an application has about 15 internal or third-party APIs. So, complexities are higher and higher. Then we have a growing number of cloud and SaaS initiatives. It creates more and more internet dependence. Many organizations adopt the SaaS-first approach, so applications are migrated from data centers to public clouds. Today’s business environment is hybrid, composed of external cloud services and on-prem software. We also live in a global economy. The number of end-users is constantly growing, and another aspect is the Covid-19 crisis. Many people started to work from home, and it increases the challenge to monitor applications because users access the company’s systems from home and mobile devices. Because of that, we must monitor not just internal systems and networking, but also end user devices and internet performance. We have much more data that needs to be understood and collected.

How AIOps can help?

KG: AIOps is essentially big data and machine learning artificial intelligence functionality that helps IT organizations to process data efficiently. The biggest problem is analyzing vast amounts of data and AI reduces human resources that are required to analyze data and find issues. It dramatically accelerates root cause analysis and makes it more accurate. It can also help with identifying problems before they have a significant impact on business or end-users. Finally, AI helps with the consolidation and aggregation of alerts and monitoring tools.

What are the building blocks for AIOps?

KG: We need to collect this data and they come from many technologies. So, we need scalable metrics ingestion module and of course, we need a data platform to store all this information. We will also need a query platform to be able to access the data that we collect and a user interface platform for our operators to find all the interesting information that relates to the data. We need machine learning or artificial intelligence that will analyze the data and inform us about anomalies or maybe discover patterns that are interesting from a business perspective. What is also important is an action platform or APIs via which we can integrate AppDynamics to external systems.

AppDynamics is an AIOps SaaS cloud platform. How its architecture looks like?

KG: If you evaluate or select AppDynamics, it can be deployed on premises or in the public or private cloud, but we encourage customers to go to AppDynamics SaaS deployment model. It is more flexible solution. Processing Big Data efficiently requires scalable cloud foundation. We are in in the AWS cloud. Agents send metrics via F5 firewalls. We also use, just to name a few, Kafka technology, Kubernetes, or Elastic Search storages. All these technologies support containers or clusters and provide us with flexibility and scalability. Our platform is also open, so it is easy to integrate with other solutions or products.

How is AI or machine learning used in AppDynamics?

KG: We use machine learning extensively. Good example is automatic baselining for all collected APM metrics and advanced business analytics. For every APM, custom or BiQ related metric, we want to know what the expected value should be. Our Machine Learning engine calculates the expected level for every metric hour by hour and of course based on historical data. There are different flavors of baselines. It can be done on a daily basis, which tells us what is normal for the given metric for the given hour of the day. It can also be a weekly baseline. Because what is normal nine o'clock on Monday can be different from nine o'clock on Saturday or Sunday.

Once the baseline is established then we can calculate the deviation from the norm. This is fully dynamic. There is no static threshold here and it is automatically calculated for thousands or even millions of metrics, out of the box. There is no need to set it up. If anything deviates more than given value form the norm, then alert can go off now.

What happens when anomaly is detected?

KG: We collect so-called snapshots and snapshots are still another type of data. Next to metrics. Snapshots are representation of the state of the systems or applications at the time when anomaly was detected. We only store the snapshots of transactions that happens during anomaly.

Traditionally the problem was that even if anomalies are automatically detected and snapshots are stored IT operations had to analyze all these snapshots. It was not easy to conclude what is the real issue.

AI comes into the picture again. Now all these snapshots are analyzed. Automated transaction diagnostics identify precisely where is the bottleneck. It dramatically speeds up finding root cause. Efficiency of drill down into performance data is also largely improved thanks to advanced dynamic visualizations.

What type of visualizations do you use?

KG: We use many different visualizations. It can be application flow maps showing all application components, user experience journey maps, business journey widgets or funnels . You can choose flow map of all the components of your application or you can have a flow map of one specific business transaction, and you can have also flow map in a snapshot. We can also drill down from these views. Maps are done fully dynamically and automatically.

Edition 2021 - Big Data Technology Warsaw Summit

February 23, 2021

BIG DATA TECHNOLOGY WARSAW SUMMIT WORKSHOPS DAY 1

February 24, 2021

BIG DATA TECHNOLOGY WARSAW SUMMIT WORKSHOPS DAY 2

February 25, 2021

BIG DATA TECHNOLOGY WARSAW SUMMIT DAY 1

PLENARY SESSION

SIMULTANEOUS SESSIONS PART I

ROUNDTABLE SESSIONS PART I

SIMULTANEOUS SESSIONS PART II

PLENARY SESSION

February 26, 2021

BIG DATA TECHNOLOGY WARSAW SUMMIT DAY 2

PLENARY SESSION

SIMULTANEOUS SESSIONS PART III

SIMULTANEOUS SESSIONS PART IV

ROUNDTABLE SESSIONS PART II

PLENARY SESSION