Edition 2021 - Big Data Technology Warsaw Summit
February 23, 2021
BIG DATA TECHNOLOGY WARSAW SUMMIT WORKSHOPS DAY 1
9:00 – 13:00
3 WORKSHOPS - DAY I
19.00 - 21.00
EVENING MEETING
(speaker’s presentation + discussion)
Pandemic, data and analytics – how do we might know what happens next with Covid-19?
Special evening meeting prior to the BigData Technology Warsaw Summit.
There are smart people and great research teams working on the forecasting models for pandemic developments. What data do they use, which models, how it could be approached, how accurate it is, what are the major challenges here? How does the bigdata community contribute to fighting the Covid-19 disease? These are the questions we would like to address during this unique online meeting. We have invited the very special guests (including experts from MOCOS and ICM UW) – everyone is encouraged to participate in discussion and ask questions!
♦ What makes the field of pandemia modelling and simulation so interesting and challenging? How to predict risks using available data and proper modelling?
♦ How is it done – large scale geographical microsimulation model for pandemics?
♦ What can we do for a better pandemic forecasting and predicting efficiency of various countermeasures to slow it down? Is an AI/ML enough?
In the meeting agenda:
18.45 – 19.00
Networking online
19.00 – 19.05
Opening remarks
What makes the field of pandemic modelling and simulation so interesting and challenging.
Evention
19.05 – 19.25
How it is done – large scale geographical microsimulation model for pandemics.
ICM University of Warsaw
19.25 – 19.30
Short Q&A
19.30 – 19.45
What can we do for better pandemic forecasting and predicting efficiency of various countermeasures to slow it down. Is AI/ML any good for it.
MOCOS
19.45 – 20.00
Computational side of the algorithm used by MOCOS Group
MOCOS Group
20.00 – 20.10
Q&A
20.10 – 20.30
Collaborative forecasting of COVID-19: Assembling, comparing and combining short-term predictions.
Heidelberg Institute for Theoretical Studies (HITS), Karlsruhe Institute of Technology (KIT)
20.30 -21.00
Open discussion for everybody
February 24, 2021
BIG DATA TECHNOLOGY WARSAW SUMMIT WORKSHOPS DAY 2
9:00 – 13:00
3 WORKSHOPS - DAY II
19.00 - 20.00
EVENING MEETING
(speaker’s presentation + discussion)
The meeting is organized in partnership with ING Tech Poland.
In the meeting agenda:
19.00 - 19.10
Welcome Address Speech
ING Hubs Poland
19.10 - 19.30
Data in the labour market – salaries and trends
Hays Poland
19.30 - 20.00
Panel discussion with representatives of BigData or AI enterprises recruting technical people
ING Banking Technology Platform, ING
Disney Streaming Services
Allegro
Panel chair:
Evention
February 25, 2021
BIG DATA TECHNOLOGY WARSAW SUMMIT DAY 1
12.30 - 13.00
TIME FOR NETWORKING ONLINE
13.00 - 13.10
CONFERENCE OPENING
GetInData | Part of Xebia
Evention
PLENARY SESSION
13.10 - 13.35
5 big data trends that redefine Edge to AI journey
During the session we will discuss the key trends redefining the way companies manage data and analytics lifecycle. The presenters will explain:
♦ the importance of disaggregation of compute and storage,
♦ advancements in stateful processing in Kubernetes,
♦ growing role of cloud and real-time processing for businesses in Poland.
Keywords: #DataArchitecture #Kubernetes #Streaming #MachineLearning #Cloud #BusinessAgility
13:35 - 13:50 Q&A Session
3Soft
Cloudera
13.35 - 14.00
High-Performance Data Analytics in a Hybrid and Multi-Cloud World
Many enterprises are re-thinking their data analytics strategy. Some plan to stay on-prem for GDPR reasons. Others are all in for full-cloud but want to stay agnostic. And still others require a hybrid approach: run certain workloads on-prem and move others to the cloud to capitalize on cloud economics. With object stores emerging as the main winners in the post-Hadoop era for cost-effective storage, enterprises are adopting them independently from the evolution of their EDW, Data Lakes, and Data Science platforms. Finally, there’s a convergence movement underway, causing enterprises to unify their data analytics platforms (EDW, Data Lakes and Data Science platforms) and supporting the broadest deployment models. Join us for this session to learn how Vertica can support your vision with a new era of data analytics in a hybrid and cloud-agnostic fashion, supporting a variety of object store technologies.
14:00 - 14:15 Q&A Session
Vertica
SIMULTANEOUS SESSIONS PART I
14.05 - 14.35
Architecture Operations & Cloud
Data Engineering
MLOps
AI, ML and Data Science
Outift7
Criteo
Doctolib
MLOps journey in H&M
In this session you will learn about how H&M evolves reference architecture covering entire MLOps stack addressing a few common challenges in AI and Machine learning product, like development efficiency, end to end traceability, speed to production, etc. This architecture has been adapted by multiple product teams managing 100”s of models across the entire H&M value chain and enables data scientists to develop model in a highly interactive environment, enable engineers to manage large scale model training and model serving pipeline with fully traceability.
The team presenting is currently responsible for ensuring that best practices and reference architecture are implemented on all product teams to accelerate H&M groups’ data driven business decision making journey.
Keywords: #MLOps #AIAtScale #MachineLearning #Engineering #DataScience
14.35 - 14.50 Q&A session
H&M
Building recommender systems: from algorithms to production
Machine learning-powered systems have become an essential part of most businesses. One such example are recommender systems that adapt to customer behavior to provide an organic way to make domains like clothes, books, or music explorable. In order to successfully put such systems into production, we need to bridge the gap from the raw mathematical models and algorithms to robust and scalable software systems. In this talk, we start out with core approaches to recommender systems like collaborative filtering or click probability prediction, and follow this journey to explore how theory and practice come together.
Keywords: #machinelearning, #recommendations, #production, #architecture
14.35 - 14.50 Q&A session
14.40 - 15.10
Architecture Operarations & Cloud
Data Engineering
MLOps
AI, ML and Data Science
Welcome to MLOps candy shop and choose your flavour!
Operationalizing Machine Learning operations (features delivery, model training, deployment and serving) is nowadays one of the most challenging areas in fast-growing data-driven companies. Variety of open source components (Kubeflow, Mlflow, Kedro to name a few) and set of specialized managed services provided by every major cloud provider drive solution architects nuts.
In GetInData we have a solution for it - we used to call it GetInData MLOps Platform. Set of reusable components, following the Unix toolset pattern ("do one thing and be best at it") and portable to any environment. Also - thanks to loose coupling - adjustable to current and future clients' ML-related challenges, like a candy shop where the first person needs super-fast online predictions, the second one requires robust hyperparameter tuning for best possible models and the third person aims for scalable collaboration on features extraction within many data science teams.
During the presentation we will show you two components we're really excited for - Kedro-Kubeflow integration and Feast-based feature store - how we implement these and what clients' use them for. Welcome to our MLOps candy shop that no pandemic can close 😉
Keywords: #MachineLearning #MLops #FeatureStore #Kubeflow #Kedro #OpenSource #Feast
15.10 - 15.25 Q&A session
GetInData
GetInData | Part of Xebia
Popmon - population shift monitoring made easy
Tracking model performance is crucial to guarantee that a model running in production behaves as designed initially. Changes in the incoming data can affect the performance and make predictions unreliable. Given that input data often change over time, it is important to track changes in both input distributions and delivered predictions periodically, and to act on them when significantly different - for example, to diagnose and retrain an incorrect model in production. To make monitoring both more consistent and semi-automatic, at ING WBAA we have developed a generic Python package called popmon to monitor the stability of data populations over time, using techniques from statistical process control, at: https://github.com/ing-bank/popmon popmon works with both pandas and spark datasets. popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using static or dynamic monitoring business rules.
Keywords: #
15.10 - 15.25 Q&A session
ING WBAA
ING WBAA
ModelOps – Operationalizing Modern Analytics & AI
Modern software development requires a comprehensive approach with well-defined processes of code development, testing, and deployment. That is where DevOps methodology comes in. It helps software developers in the seamless moving of their newly implemented features from development to production. But what about other areas of information technologies like advanced analytics, artificial intelligence, and machine learning? That’s where ModelOps comes to play – an approach that takes what’s best in DevOps and expands it to apply in the analytical world. During the presentation, we will show how continuous integration and deployment with the help of specialized tools, specifically designed for analytical purposes, can be leveraged to implement such an approach. This strategy can dramatically reduce the time-to-value for analytical assets developed within the organization, ensuring all those assets are methodically managed and safely updated with the most advanced, well tested analytical models.
Keywords:#AI #Analytics #ML #MachineLearning #DeepLearning #ModelOps #DevOps #XOps #MLOps
15.10 - 15.25 Q&A session
SAS Institute
SAS Institute
Thrive in the Data Age how Siemens and BMW Group leverage machine learning for cybersecurity, operations and business use cases using Splunk
Machine Learning is an essential part to solve use cases in cybersecurity, operations and various lines of business. This talk provides you with an overview of Splunk’s big data and machine learning technologies that are used to solve real-world use cases. For example, we dive into the technical details of two selected customer examples of applied machine learning. First, we explain how a datacenter division of Siemens uses Splunk with unsupervised and supervised machine learning approaches in cybersecurity to uncover anomalies and to automate the classification of security events. We highlight the technical details of how this use case has been addressed with Splunk’s Machine Learning Toolkit. Second, we explain how BMW Group’s Innovation Lab developed a Predictive Testing Strategy with Splunk and a deep learning approach. Details are provided on how the Deep Learning Toolkit App for Splunk was used to build and evaluate a TensorFlow based model to solve that use case. We conclude the session with an outlook and a wrap up on all available technical resources.
Keywords: #datascience #ai #machinelearning #deeplearning #cybersecurity #operations #analytics
15.10 - 15.25 Q&A session
Splunk
15.10 - 15.15
TECHNICAL BREAK
ROUNDTABLE SESSIONS PART I
15.15 - 16.05
Parallel roundtables discussions are the part of the conference that engage all participants. It has few purposes. First of all, participants have the opportunity to exchange their opinions and experiences about specific issue that is important to that group. Secondly, participants can meet and talk with the leader/host of the roundtable discussion – they are selected professionals with a vast knowledge and experience.
1. Managing a Big Data project – how to make it all work well together?
Spotify
2. Big Data on Kubernetes
Kubernetes - it was created for ‘Stateless’ apps but not for ‘Statefull’…so why we should consider this for BigData?. ‘Statefull’ app together with persistent volume support, but many databases are not supporting it yet – so how Companies can overcome this challenge? Is the Spark + HDP is the only reasonable solution for data transformation on K8s? what about other solutions? Does it make sense to consider any other?. Lets refer to a Telco 5G – requirements – all must go on K8s – but where Hadoop is being replaced by Object Storage solution that could be orchestrated by K8s - to simply the overall architecture. Finally, so what about K8s On-Prem/On-Cloud – to which direction is a way to go?
Vertica
3. Data discovery – building trust around your data
Building trust in data falls under one of the four main pillars of a good data setup - Data Governance. In this roundtable we will do a quick overview of the 4 pillars and how to go about building a trustworthy setup. Topics will cover data lineage, accuracy and completeness vs cost, toolkit available, tactics on recording 'truth' vs business interpretation and how to build a setup that will improve over time, rather than degrade.
Tesco Bank
4. Stream processing engines – features, performance, comparison
Stream processing and real-time data processing is nowadays more and more popular and important. There are a lot of use cases: data capturing, marketing, sales and business analysis, monitoring and reporting, troubleshooting systems, and real-time machine learning like customer/user activity (personalization and recommendation), fraud detection, real-time stock trades. There are a lot of stream data processing frameworks, like Spark (Structured) Streaming, Flink, Storm, Amazon Kinesis, … Let’s talk about them, try to compare, list pros and cons in terms of various problems and challenge like throughput, performance, latency, system recovery and so on.
BAE Systems Applied Intelligence
5. From on-premise to the cloud: an end to end cloud migration journey
GetInData
6. MLOps - how to support the life-cycle of ML models
MLOps is a hot trend concerning the end-to-end lifecycle of ML models from conception to model building and monitoring to decommissioning. How do you govern this lifecycle? Which methodologies and solutions are worth using? What mistakes should be avoided? Let's exchange experiences!
MI2.AI
7. Transactional Data Lakes with Apache Spark (and Delta Lake, Apache Hudi and Apache Iceberg)
There is a trend in big data management space to add features we all know from relational databases, most notably ACID transactions and versioning. That's the main focus of open source projects Delta Lake, Apache Hudi and Apache Iceberg. They are storage layers on Hadoop DFS-like file systems and object stores that together with Apache Spark's capabilities allow building "reliable data lakes at scale". You're invited to discuss the pros and cons of each and how to use them effectively in our big data projects. All are equally welcome regardless of their experience and expertise. Let's share what we've already learnt and further deepen our understanding learning from others.
8. Operationalizing Analytics – sharing experience and best practices
The promise and potential business value of analytics is endless, which is why companies have spent the last decade investing in the right people, data, processes, and enabling technology. Yet studies show that less than 50% of the best models get deployed, 90% of odels take more than three months to deploy and 44% of models take over seven months to be put into production.
SAS
SAS
9. Monitoring performance of ML models
Monitoring of the ML model running online on production data can be a challenge. Let's discuss what are the biggest difficulties and how to manage them. What kind of tools you are using to detect any problems with the input data and the results.
ING WBAA
10. We've got a model! What are the next challenges of deploying it at scale?
Training a good ML model is only the beginning of the journey. The next question is: how to integrate it with production systems robustly and effectively? Let's discuss your experience with deploying ML models challenges like continuous model training, training-serving skew, data drift, and model serving infrastructure.
GetInData | Part of Xebia
16.05 - 16.10
TECHNICAL BREAK
SIMULTANEOUS SESSIONS PART II
16.15 - 16.45
Data Engineering I
Data Engineering II
MLOps/ AI, ML and Data Science
Data Strategy and ROI
Casting the Spell: Druid in Practice
At Nielsen Identity, we leverage Druid to provide our customers with real-time analytics tools for various use-cases, including inflight
analytics, reporting and building target audiences. The common challenge of these use-cases is counting distinct elements in real-time at scale. We’ve been using Druid to solve these problems for the past 5 years, and gained a lot of experience with it.
In this talk, we will share some of the best practices and tips we’ve gathered over the years. We will cover the following topics:
* Data modeling
* Ingestion
* Retention and deletion
* Query optimization
Keywords: #BigData #ApacheDruid #RealtimeAnalytics #DataArchitecture #DataEngineering
16.45 - 17.00 Q&A Session
Nielsen Identity
Databricks
BigFlow – A Python framework for data processing on the Google Cloud Platform
You will learn about a tool that can improve your big data projects on GCP. Unified structure, configuration, versioning, build, deployment, and more, available for Dataflow/Dataproc/BigQuery.
Keywords: #gcp #python #dataflow #dataproc #bigquery
16.45 - 17.00 Q&A Session
Allegro
Training and deploying machine learning models with Google Cloud Platform
In my presentation I would like to present some approaches, good practices and Google Cloud components that we use in Sotrender to effectively train and deploy our machine learning models, which are used to analyze Social Media data. I will discuss which aspects of DevOps we focus on when developing machine learning models (MLOps), and how these ideas can be easily implemented in your company or startup using Google Cloud Platform.
Keywords: #mlops #gcp #python #nlp #computervision
16.45 - 17.00 Q&A Session
Sotrender
How to optimize time needed to find and understand the data as a part of BigData project.
Many BigData projects focus on implementing technological solutions, forgetting their purpose, i.e. the needs or applications they are to serve.
Investments (BigData projects are quite high budgets) are often made mainly in the IT area, ignoring individual areas of business activities and do not generate much profit from a business point of view, causing a high risk of failure of the entire project. During the presentation, we will tell a bit about this kind of factors that pose a threat to BigData projects. Together, we will consider what are analyst’s needs and who is BIGData's “client” and what they expect. How to lead to effective cooperation between the analyst and the "client" by implementing data management, but most of all - by implementing an interface that creates a bridge between the accumulated knowledge about the data and the data recipient. We will show how much AI is able to support the use of Big Data's potential. Can you get information about data while you have a morning coffee? Yes. By using the Clarite AI Data Assistant all you need to do is ask a question about data in natural language. Using that, you will easily enter the era of human-data communication and big data.
Keywords: : #dataplatform #datamanagement #businessdata #AI #DataGovernance #WatsonKnowledgeCatalog #KnowYourData #AIClariteAssistant
16.45 - 17.00 Q&A Session
Clarite Polska
Clarite Polska
16.50 - 17.20
Data Engineering I
Data Engineering II
MLOps/ AI, ML and Data Science
Data Strategy and ROI
Data lineage and observability with Marquez and OpenLineage
Data is increasingly becoming core to many products. Whether to improve recommendations for users, getting insights on how they use the product or using machine learning to improve the experience. This creates a critical need for understanding how data is flowing through our systems. Data pipelines must be auditable, reliable and run on time. Tracking lineage and metadata is the underlying foundation that enables many use cases related to data. It provides understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It enables governance and compliance and generally helps you keep you data running. Marquez is an open source project part of the LF AI which instruments data pipelines to collect lineage and metadata and enable those use cases. It provides context by making visible dependencies across organisations and technologies and enables lineage governance and discovery.
Keywords: #lineage #observability #dataops
17.20 - 17.35 Q&A session
Datakin
Causal Mediation Analysis in the E-Commerce Industry
Causal mediation analysis is a formal statistical framework to reveal the underlying causal mechanism in randomized controlled experiments. The causal mechanism is referred to as a process that the treatment affects the outcome through some intermediate variables that can be referred to as mediators. Causal mediation analysis has been widely employed in various disciplines. However, it has not been applied to online A/B tests, the large scale online randomized controlled experiments in the daily practice of the internet industry. Perhaps it is because online A/B tests in the internet industry are primarily for evaluation: estimating and testing the average treatment effect. In this talk, we will discuss two of our recent works on the development of causal mediation analysis for producing insights for search and recommendation systems in the e-commerce industry.
Keywords: #causalinference #A/Btests #informationretrieval #searchmetrics #evaluation
17.20 - 17.35 Q&A session
Causal Inference and Experimentation, Udemy
Foundations of Data Teams
Successful data projects are built on solid foundations. What happens when we’re misled or unaware of what a solid foundation for data teams means? When a data team is missing or understaffed, the entire project is at risk of failure. This talk will cover the importance of a solid foundation and what management should do to fix it.
Keywords: #managment #data teams #data engineers #data scientists #operations
17.20 - 17.35 Q&A session
Big Data Institute
PLENARY SESSION
17.25 - 17.55
The Journey to Data Cloud
Snowflake is the leading data platform for the cloud era. I will present its features as a modern data warehouse, uniquely exploiting the cloud capabilities to meet growing users' needs. Then I will discuss how it became the foundation of the Data Cloud, a revolutionary solution that opens the world's data to all organizations.
Keywords: #cloud #SQL #analytics, #scalability #datasharing #datawarehouse
17.55 - 18.10 Q&A session
Snowflake
17.55 - 18.30
SUMMARY OF THE DAY, PRIZE GIVEAWAY AND A SURPRISE*!
GetInData | Part of Xebia
Evention
+ *? Live DJ performance especially for the participants of the meeting - DJ Michał Stochalski
He sets trends, creates new DJ sets and constantly improves his music skills, according to his motto "Excellence is earned throughout life". As a Video DJ he presents a combination of image and sound mixing live music with corresponding clips.
February 26, 2021
BIG DATA TECHNOLOGY WARSAW SUMMIT DAY 2
09.00 - 09.05
OPENING OF THE SECOND DAY
PLENARY SESSION
09.05 - 09.35
Fast growth iteration via A/B testing
Online A/B testing granted people a powerful tool to do fast examine on whether their product hypothesis is true or not. It is not a secret that most tech giants grow their products leveraging this tool in fast iterations, such as Google, Atlassian. In this talk, we will first review the product growth process used in Google and Atlassian, and then zoom in to give some tips on the common problems encountered when conducting A/B experiment. Lastly we will spend some time to discuss with the participates on their experience and learnings conducting A/B experiment.
Atlassian
SIMULTANEOUS SESSIONS PART III
09.40 - 10.10
Architecture Operarations &Cloud
AI, ML and Data Science
AI, ML and Data Science II
Data Engineering
Management of a cloud Data Lake in practice: How to manage 1000s of ETLs using Apache Spark
Nowadays the problem of speed of processing is seemingly solved. Unless you process tens of petabytes an off-the-shelf toolset will suffice for most of the problems. Currently the main challenges in data lake systems are in the field of data governance:
• how do you make sure data is discoverable, reusable, up to date and of high quality?
• how to avoid huge technical debt when developing massive number of complex data flows?
• how to guarantee that the project can scale despite having access to very scarce human resources and technical talent?
The goal of this talk is to showcase how to design a data lake management system scalable in all the broadest meaning of the word: that is not only scales with the growth of the data, but as well that it scales with the growth of the complexity of the whole enterprise. The talk will outline the business reasoning, key design principles as well as technical solution. Expect some (but not too much) nerdy details related to Apache Spark implementation ?
Keywords: #DataGovernance #DataLake #DataQuality #Cloud #ApacheSpark #Azure #DataBricks
10.10 - 10.25 Q&A Session
DXC Luxoft
Building an analytics platform from scratch while developing production solutions on top of it.
Story of a year-long journey with an Asian telecom of creating a positive feedback loop between building a data analytics platform and moving analytics into it.
Like in every great story, you can expect:
introduction - reasonably well defined scope and set of main characters,
turning points - how life took us on a journey of changed ownership, discovering new needs, learning by teaching and bold goals achieved by doing progress over perfection,
conclusion - how it all came together to change the way the analytics team works and productizes it's models.
This story includes real-world analytics use-cases such as: cost of network incidents, ARPU (Average Revenue Per User), NBO (Next Best Offer), cost of service, RFM (Recency, Frequency, Monetary Value), churn and a few more.
Everything seasoned with open-source (Spark, Presto, Nbdev) and latest ML Ops technologies (KubeFlow, Feast).
Keywords: #ModelProductization #MachineLearning #FeatureStore #KubeFlow #DataScience #OpenSource #OnPremise
10.10 - 10.25 Q&A Session
GetInData
GetInData
When HR meets Artificial Intelligence.
Digital Transformation takes a lot of our processes to new levels. Especially, since the eruption of the pandemic, we need to look for new solutions for old problems. For example, if we open for remote IT specialists, we can easily get 10x-100x more candidate, while having the same recruitment procedures.
During my presentation, I would like to share ideas of how AI-driven tools are and can be used in HR and Talent Management:
- Current possibilities of what is feasible (including some demos),
- Where we can take these tools in the forseeable future,
- What are the current technical challenges,
- Impact challenges, especially from ethical standpoint.
Keywords: #artificialintelligence #nlp #digitaltransformation #futureofwork #hcm
10.10 - 10.25 Q&A Session
Revolut
Cisco
10.15 - 10.45
Architecture, Operation and Cloud
Streaming and Real-Time Analytics
AI, ML and Data Science
Data Engineering
Expanding your data & analysis ecosystem with public cloud
Going to the cloud sounds fantastic, giving people new opportunities - priceless. Public cloud gives you a lot of ready systems that you want to use. But when you have, already working environment based on your data center, a lot of data, and plenty of people using these tools, your work will have a lot of fascinating challenges. Transferring data, moving from one tool to another, selecting tools without consternation, cost optimization, policies, security, and many others.
Keywords: #hadoop #spark #airflow #gcp #bigquery #composer #dataproc #data analysis
10.45 - 11.00 Q&A Session
Allegro
Streaming SQL - Be Like Water My Friend
Data has to be processed fast, so that a firm can react to changing business conditions in realtime. Streaming SQL gives us a
possibility to make stream processing available for a broader audience but also makes it easier to access data streams. This
presentation will not only give you an brief overview of the data and streaming architecture at InnoGames but also introduces you
to the idea of Streaming SQL in general and how it is implemented in Apache Flink. Furthermore it shows actual examples of how
to use Flink SQL so that you hopefully are inspired to consider this rather new technology to tackle your data challenges.
Keywords: #streaming #streamingsql #flink #dataflow #flinksql
10.45 - 11.00 Q&A Session
InnoGames GmbH
Make it personal: reinforcement learning for mere mortals
During this session we will reflect upon the importance of personalization in e-commerce. What challenges accurred as a results of bridging the gap between Google AlphaGo and real world? Furthermore, we will discuss the Vowpal Wabbit: the Swiss army knife of ML algos.
To sum up, we will introduce a case study exercise, during which participants will be creating a personalized user experience on a webpage.
Keywords: #personalization #ecommerce #vowpalwabbit #reinforcementlearning #opensource
10.45 - 11.00 Q&A Session
eBay Classifieds Group
OLX Group
10.45 - 10.50
TECHNICAL BREAK
SIMULTANEOUS SESSIONS PART IV
10.55 - 11.25
Architecture Operarations & Cloud
Streaming and Real-Time Analytics
AI, ML and Data Science I
Data Strategy and ROI
Presto: SQL-on-Anything & Anywhere
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale at organizations including, Airbnb, Comcast, Fcaebook, FINRA, GrubHub, LinkedIn, Lyft, Netflix, Twitter, Uber, and Zalando, Presto has experienced unprecedented growth in popularity in both on premise and cloud deployments over Object Stores, HDFS, NoSQL, and RDBMS data stores.
Delta Lake, a storage layer originally invented by Databricks and recently open sourced, brings ACID capabilities to big datasets held in Object Storage. While initially designed for Spark, Delta Lake now supports multiple query compute engines. In particular, Starburst, developed a native integration for Presto that leverages Deltaspecific performance optimizations.
Join this session and hear how Starburst Presto deployed on Azure Kubernetes Service (AKS) serves as a fast SQL query engine versus data in ADLS, and enables query-time correlations between IoT data in Delta Lake, customer data in SQL Server, and web log data in Elasticsearch.
You will also gain best practice and real-life insights and lessons learned from production deployment of this integration.
Keywords: #presto #sql #analyticsanywhere #azure
11.25 - 11.40 Q&A session
Starburst
Starburst
Complex event-driven applications with Kafka Streams
At Simply Business we have built a rich stateful application using Kafka Streams to manage leads that are served by our 300-people strong UK call centre. This application combines many different data points from different services and has a few internal data stores to operate. Our initial design of the application became quite complex, so we had to make changes to ensure scalability and reliability . In the talk I'll present what techniques we used to simplify it.
But that's not all. I'll also talk about other key components like schema registry, which helped with the robustness of the solution and then lead scoring that helped to increase the return on the call spent by over 50%.
Keywords: #data-streaming #kafkastreams #schemaregistry #domainevents #evolutionaryarchitecture
11.25 - 11.40 Q&A session
Simply Business
How to build a state-of-the-art weather forecasting AI service
Weather forecasting is important in many fields, and minor improvements in accuracy can have a considerable business impact. Today, weather forecasting is performed using computationally expensive mathematical models based on the Navier-Stokes and mass continuity equations, the first law of thermodynamics, and the ideal gas law. These models simulate the physical world and are making use of expensive supercomputers. Alternatively, AI can be used to learn from data and produce forecasts in a fraction of the time the physical simulations require, and in many cases at a higher degree of accuracy. In this presentation, I will show how an AI was used to produce competitive forecasts, using state-of-the-art AI models and neural architecture search, and how I used React to prototype a weather forecasting service.
11.25 - 11.40 Q&A session
Peltarion
Big Data Instruments and Partnerships - Microsoft ecosystem update
This session will be focused on the strategic side of our big data investments, from two major angles – instruments that we have – and partnerships – how we are impacting and enriching the external ecosystem. It will be a session in-between tech and business.
11.25 - 11.40 Q&A session
Microsoft Russia, Microsoft
11.30 - 12.00
Architecture Operarations & Cloud
Streaming and Real-Time Analytics
AI, ML and Data Science I
Data Strategy and ROI
AWS Spot instances price prediction - towards cost optimization for Big Data
Analytical data processing has become the cornerstone of today's businesses success, and it is facilitated by Big Data platforms that offer virtually limitless scalability. However, minimizing the total cost of ownership (TCO) for the infrastructure can be challenging. By analyzing spot instance price history using ARIMA models, it is feasible to leverage the discounted prices of the cloud spot market with a limited risk of analytical job termination. In particular, we evaluated savings opportunities when using Amazon EC2 spot instances comparing to on-demand resources. During the presentation we show the evaluation of univariate spot price regression models to forecast future prices, and we confirm feasibility of short-term spot price prediction through real data from AWS. This confirms cost savings opportunities up to 80% compared to on-demand and within 1% of the absolute minimum.
Keywords: #TCO #CloudComputing #ARIMA #AWS #Spot
12.00 - 12.15 Q&A Session
Nowa Era
CogniTrek Corp, MAGIX.AI, Saints Cyril and Methodius University
Evolving Bolt from batch jobs to real-time stream processing - migration, lessons learned, value unleashed
We would like to invite you to discuss how Bolt migrated from batch and synchronous to real-time and asynchronous. During our session we will review and evaluate the obstacles we faced along the way and lessons we have learned. We will also focus on the unleashed value of real-time data.
Keywords: #kafka #streaming #data #realtime
12.00 - 12.15 Q&A Session
Bolt
How NoMagic robots improve thanks to software 2.0 improvement cycle supported by an in-house data engine?
A typical Software 2.0 improvement cycle at NoMagic:
* analyse when a data driven algorithm performs poorly
* gather and label data that will improve the algorithm or propose a modified algorithm
* train the model
* test the model
* deploy to production
Making this cycle as frictionless as possible was the focus of NoMagic ML team in 2020.
I will share with you what we achieved and how it changed the way we work at NoMagic.
12.00 - 12.15 Q&A Session
NoMagic
How to plan the unpredictable? 7 good practices of organisation and management of fast-paced large-scale R&D projects
During this session, we aim to review the technical and organisational challenges we faced, while building a complex AI-based app with a short time-to-market. How it was additionally influenced by the dispersion of involved teams around the world (10 time zones)? What unpredictable events that affected our plans, e.g. the COVID pandemic or the development vendor changing in the middle of the project? We will evaluate the examples of failures and successes. Lessons we have learned in the process. We invite you to a broader discussion.
Keywords: #datascience #ai #machinelearning #agile #projectmanagement
12.00 - 12.15 Q&A Session
Pearson
Pearson
12.00 - 12.05
BREAK
ROUNDTABLE SESSIONS PART II
12.05 - 12.55
You can choose among such roundtable subjects:
1. Building a world-class Big Data team during the COVID-19 pandemic - recruiting, training, collaborating.
Last year forced us to change a lot in how we work. A lot of us had to switch to working/studying from home, some needed to freeze hiring, others - to redefine onboarding. As hard as last year was, it was also a time of innovation. Join the session to exchange lessons learned and ideas for building a world-class Big Data team leveraging “the new normal”. Everybody is welcome - the more diverse experiences the better.
Zendesk
2. Big Data on Kubernetes
GetInData
3. Best tools for alerting and monitoring of the data platforms
Have you ever been woken up in the middle of the night by a screaming PagerDuty alert on your mobile, 99+ notifications on {YOUR_PIPELINE_NAME}_alerts Slack channel and tens of graphs in Grafana looking like an undreamt art of van Gogh? If yes, welcome, me too. For an engineer working on a Data Platform it is easy to create a new pipeline, a new dataset or add any new integration, especially now in the cloud era. But it is not easy to have a proper monitoring and alerting system ensuring that any potential issues/incidents can be solved as quickly as possible, so offering of our Data Platform is always top quality. In this session we will discuss tools for building a monitoring and alerting system that is efficient, easy to understand, supervises exactly what we want, notifies the ones we want, is not too noisy and scales well with always growing data.
Bolt
4. Stream processing engines – features, performance, comparison
Stream processing and real-time data processing is nowadays more and more popular and important. There are a lot of use cases: data capturing, marketing, sales and business analysis, monitoring and reporting, troubleshooting systems, and real-time machine learning like customer/user activity (personalization and recommendation), fraud detection, real-time stock trades. There are a lot of stream data processing frameworks, like Spark (Structured) Streaming, Flink, Storm, Amazon Kinesis, … Let’s talk about them, try to compare, list pros and cons in terms of various problems and challenge like throughput, performance, latency, system recovery and so on.
BAE Systems Applied Intelligence
5. Using the public cloud effecitively and cost-efficiently
According to Unisys's Cloud Barometer study, only a third of organizations have seen great improvements to their organizational effectiveness as a result of Cloud adoption. What are good practices to be part of those organizations? Let's discuss how to use the public Cloud effectively and cost-efficiently.
TrueBlue
6. Building AI/ML systems: from algorithms to production
We're facing very different challenges when writing a scientific paper and when building a production ML system. Things get even more complex when a single project involves both research and application. It's generally understood yet often overlooked: let's get talking! How to scope an ML project? How to get the data yet avoid biases and that multi-million-euro GDPR penalties? What models work in the real world scenarios? How to handle model deployment? And who do you need in your team to succeed?
7. MLOps - how to support the life-cycle of ML models
MLOps is a hot trend concerning the end-to-end lifecycle of ML models from conception to model building and monitoring to decommissioning. How do you govern this lifecycle? Which methodologies and solutions are worth using? What mistakes should be avoided? Let's exchange experiences!
MI2.AI
9. Data Strategy. The Game.
The format of this round table discussion is the game, where you as Chief Data Officer, has a mission to implement strategic initiatives for $2.7B electronics manufacturer (please watch short video at Tech Zone for more details). You will have a chance to learn how to maximize business value from data, how to design and execute Data Strategy, which strategy approach is the best, how your decisions influences others within organization.
SoftServe
10. Distributed Big Data processing in the cloud – is Hadoop still an option?
Joint Cloudera & 3Soft roundtable to discuss practical experience and highlights of providing self-service access to integrated, secured, multi-function analytics based on Hadoop, cloud-native offerings or custom-tailored solutions. Let us share our knowledge on how to enjoy consistent data security, governance, lineage, and control, while deploying the powerful, easy-to-use solutions business users require and eliminating their need for shadow IT solutions.
3Soft
Cloudera
11. Snowflake Data Cloud – possibilities and limitations. How I can judge whether this is a value proposition for me and my organization?
Snowflake
Snowflake
12.55 - 13.00
TECHNICAL BREAK
PLENARY SESSION
13.00 - 13.30
Lessons from building large-scale, real-time machine learning systems
Unity Ads helps publishers and advertisers reach their business goals, and machine learning is at the core of our product. In this presentation, I will first give an overview of the machine learning systems we built for real-time ads bidding, which process tens of thousands of ad auction requests per second. Then, I will share several generalizable lessons we learned in making our systems performant from machine learning perspective and scalable from engineering perspective.
13.30 - 13.45 Q&A Session
Unity
13.30 - 13.45
GetInData | Part of Xebia
Evention
*Times may vary.
Flink SQL in 2021: Time to show off!
Four years ago, the Apache Flink community started adding SQL support to ease and unify the processing of static and streaming data. Today, Flink runs business critical batch and streaming SQL queries at Alibaba, Huawei, Lyft, Uber, Yelp, and many others. Although the community made significant progress in the past years, there are still many things on the roadmap and the development is still speeding up. This session will focus on a comprehensive demo of what is possible with Flink SQL in 2021.
Based on a realistic use case scenario, we'll show how to define tables which are backed by various storage systems and how to solve common tasks with streaming SQL queries. We will demonstrate Flink's Hive integration and show how to define and use user-defined functions. We'll close the session with an outlook of upcoming features.
Keywords: #flink #flinksql #streamprocessing #unifieddataprocessing #apache
Ververica
GetInData
GetInData
Artificial Intelligence - Building in-house AI capabilities from scratch at Philip Morris International
We will start by sharing how our team is structured, what is that we deliver, and we will continue with sharing more about our journey and the challenges we faced within a big corporation until we got to a good level of maturity inside the organization. An exposition of tangible use-cases will follow, and we will take the other half of the session to talk about technical details, such as the technology stack, CI/CD pipelines, MLOps, and others that help us accelerate delivery.
Keywords: #AI #DL #AIBusiness #Innovation #Productivity #Disruption
Philip Morris International
PMI
Simplifying Stateful Serverless Architectures
Platforms like KNative and FaaS have solved most of the challenges of dealing with stateless applications. Still, when it comes to managing state, developers quickly end up designing and maintaining a complicated architecture without achieving consistency guarantees in the presence of failure. Stateful Functions (StateFun) - developed under the umbrella of Apache Flink - provides consistent messaging and durable state without compromising the serverless experience. Like a database, it exposes its capabilities to application developers in a platform and language agnostic manner: StateFun does not mind if you deploy your application as a set of Python functions on your preferred FaaS platform, a single Spring Boot application on Kubernetes or a mixture of both.
In this demo-centric session you will learn about the core ideas behind the project and you will see how to write, deploy and monitor a simple Stateful Functions application.
Keywords: #ApacheFlink #Serverless #Kubernetes #Event-Driven #Scale
Ververica
California State University
Data Warehouse Development Lead.
Independent speaker
Creating Confidence in Data at Klarna - A Case Study in Automatic Data Validation
We can all agree that when making big decisions, you want to make them with confidence. In the world of art, this means buying your painting from a well known auction house instead of from the back of an old car. In the world of data, this means validating your data before using it.
During this session, we will show you how to quickly create confidence in data using automatic data validation. We will describe the validation process that we are using at Klarna, and show how this enables rapid improvements of big data transformations.
Keywords: #data-transformation #transformation-improvement #data-confidence #automatic-data-validation #data-validation-tool
Klarna Bank AB
Klarna Bank AB
SGPR.TECH
SGPR.TECH
Organising the chaos - metadata in action
Several key issues arise when building data-driven products and services. It is primarily searching for data sets, understanding the possibilities and limitations of individual sets, gaining access to data, using data in a controlled and transparent way.
Keywords: #metadata
Ab Initio Software
ING
StepStone Services
CEPSA
Datumize
Data Strategy. The Game!
This is an extension and further context of Taras Bachynskyy roundtable “Data Strategy. The Game!”. Very inspiring approach how to unlock the hidden value of a data. How to maximize business value? Which strategy approach is the best? How your decisions influences others within organization in context of data?
SoftServe
Sotrender
Sotrender
Sotrender
Freelancer
Datumo
Keynote Speakers BDTWS 2021
Speakers
Alex Belotserkovskiy
Microsoft Russia, Microsoft
Johannes Bracher
Heidelberg Institute for Theoretical Studies (HITS), Karlsruhe Institute of Technology (KIT)
Subash D’Souza
California State University
Josef Habdank
DXC Luxoft
Eftim Zdravevski
CogniTrek Corp, MAGIX.AI, Saints Cyril and Methodius University
Selection Committee BDTWS 2021
Fabian Hueske
Apache Flink
All three workshops will take place on 23rd and 24th of February
- 2x4 hours each of these days (8 hours in total each workshop)
BIG DATA ON KUBERNETES
DETAILS:
How to use Kubernetes in AWS and run different Big Data tools on top of it? We simulate real-world architecture – data processing real-time pipeline: reading data from web applications, processing it and storing results to distributed storage. The technologies that we will be using include Kafka, Spark 3.0 and S3. All exercises will be done on the remote Kubernetes clusters.
SESSION LEADER:
GetInData
REAL - TIME STREAM PROCESSING
DETAILS:
How to process unbounded streams of data in real-time using popular open-source frameworks? We focus mostly on Apache Flink and Apache Kafka. We simulate real-world end-to-end scenario – processing logs generated by users interacting with a mobile application in real-time. The technologies that we use include Kafka, Flink, HDFS and YARN. All exercises will be done on the remote multi-node clusters.
SESSION LEADERS:
GetInData
GetInData
FOUNDATIONS OF DATA ENGINEERING WITH GOOGLE CLOUD
DETAILS:
While getting familiar with services like Google Cloud Storage, BigQuery or DataFlow we will walk through the common data flow patterns adopted by companies migrating to the cloud. The workshop will contain a series of exercises that will help you get a hands on experience with Google Cloud Platform as well as an opportunity to discuss best practices, security, scalability and cost management aspects
SESSION LEADERS:
GetInData
ORGANIZERS
Evention
Evention is a company that specializes in increasing the value of ICT business meetings. We strongly believe that business events are integral and irreplaceable factor when it comes to the creation and maintanance of relations and improvement of communication between companies and people that create them. We are constantly searching for innovative business meeting formulas to adress current needs, expectations and aspirations of ICT managers.
GetInData | Part of Xebia
GetInData | Part of Xebia is a leading polish expert company delivering cutting-edge Big Data, Cloud, Analytics, and ML/AI solutions. The company was founded in 2014 by data engineers and today brings together 120 big data specialists. We work with international clients from many industries, e.g. media, e-commerce, retail, fintech, banking, and telco. Our clients are both fast-growing scaleups and large corporations that are leaders in their industries. We maintain laser focus on data technologies, cultivate very strong engineering culture and support extensive knowledge sharing both within a company and outside through meetups, conferences and contributions to open-source. We are a go-to partner for companies that need tailored and highly scalable data processing and analytics platforms that give competitive advantage and unlock full business potential of their data.GENERAL PARTNERS
Cloudera
At Cloudera, we believe data can make what is impossible today possible tomorrow. We empower people to transform data anywhere into trusted enterprise AI so they can reduce costs and risks, increase productivity, and accelerate business performance. Our open data lakehouse enables secure data management and portable cloud-native data analytics, helping organizations manage and analyze data of all types on any cloud, public or private. With as much data under management as the hyperscalers, we’re the preferred data partner for the top companies in almost every industry. Cloudera has guided the world on the value and future of data and continues to lead a vibrant ecosystem powered by the relentless innovation of the open-source community. Learn more at cloudera.com
3Soft
3Soft supports companies in maximizing their business potential.
We create dedicated solutions that facilitate data management and automate internal and external processes. We enrich the prepared systems with the benefits of artificial intelligence. This enables faster analysis of information, discovery of non-obvious relationships and drawing accurate conclusions.
We are trusted by companies in the area of small and medium-sized enterprises, the largest banks in Poland and world leaders in the retail, fuel and energy, manufacturing and automotive industries.
STRATEGIC PARTNERS
Snowflake
Snowflake enables every organization to mobilize their data with Snowflake’s Data Cloud. Customers use the Data Cloud to unite siloed data, discover and securely share data, power data applications, and execute diverse AI/ML and analytic workloads. Wherever data or users live, Snowflake delivers a single data experience that spans multiple clouds and geographies. Thousands of customers across many industries, including 639 of the 2023 Forbes Global 2000 (G2K) as of July 31, 2023, use Snowflake Data Cloud to power their businesses. Learn more at snowflake.com.
Vertica
Vertica is the unified analytics platform, powering data-driven businesses with predictive insights based on advanced AI and machine learning, at blazing speed, and at petabyte scale. Available in a fully managed SaaS option, or as a customer-managed platform, Vertica offers the widest range of deployment configurations in the data analytics industry. With Vertica, data analytics teams can combine data siloes that are growing exponentially—without moving the data for analytics. They can manage analytic workloads in the public clouds, on-premises, on Hadoop, or any hybrid combination. And with separation of compute and storage, Vertica allows teams to spin up storage and compute resources as you need them, then spin down afterwards to reduce costs. Learn more about us at Vertica.com, and follow us on Twitter @VerticaUnified.CONTENT PARTNERS
RTB House
RTB House is a global company that provides state-of-the-art retargeting technology for top brands worldwide. Its proprietary ad buying engine is the first and only in the world to be powered entirely by deep learning algorithms, enabling advertisers to generate outstanding results and reach their short, mid and long-term goals.Founded in 2012, RTB House serves over a thousand campaigns across EMEA, APAC and the Americas regions with main locations in New York, London, Tokyo, Singapore, São Paulo, Moscow, Istanbul, Dubai and Warsaw.
Allegro
At Allegro we make apps that thanks to their scalability and reliability has gained fans all over Central and Eastern Europe. It was not an easy task. Every day we face challenges in architectural and design area, as well as in process of choosing the right technology, providing code quality, and in further phase: implementing and maintaining a product.Allegro tech is our idea for sharing our experience by organizing conferences, workshops, meetups and hackathons.
You can find more information on our website: www.allegro.tech
Follow us https://www.meetup.com/allegrotech and join our meetupshttps://www.facebook.com/allegro.tech
Cisco
Cisco (NASDAQ: CSCO) is the worldwide technology leader that has been making the Internet work since 1984. Our people, products, and partners help society securely connect and seize tomorrow’s digital opportunity today, www.cisco.comClarite
Clarite Polska SA is a company with an established position within advanced IT solution supplier market, listed by GSC as one of companies operating critical infrastructure in Poland. We are specialized in three areas: Advanced Analytics/Big Data, System Integration, ECM/BPM.Within BigData area of interest we inspire and advise our customers in solution architecture, we verify analytic needs and design end-to-end solutions. Drawing upon our experience along with supporting technologies such as Data Governance tools and Workload Automation, we provide analytical departments with more thorough and up to date information on data possessed contributing to more efficient BigData implementations.
Offering is completed by adding Robotic Process Automation supported by new generation of intelligent chatbots.
ING Tech Poland
ING Tech Poland is an IT company located in Katowice and Warsaw (Poland), which provides IT and operational services to all ING units worldwide. In terms of IT, we deliver IT security, hosting, remote management and application services. Our operational services are delivered by three units, RiskHub and CardsHub and Know Your Customer (KYC). The first one was established as part of ING Tech Poland and is our Modelling Expertise Centre shaping the future of risk modelling and data analysis in Poland. Our ambition is to we shape the future of risk modelling and data analysis in Poland and build the position of employer of choice in Warsaw labour market.Luxoft
Luxoft, a DXC Technology Company, is a digital transformation and software engineering company that provides customized IT solutions that drive business change for customers in every corner of the globe. Currently, Luxoft employs over 12,700 people in 40 locations around the world.We combine high-quality services and in-depth industry knowledge, specializing in the automotive, financial services, media and telecommunications sectors, as well as many other. We hire experienced specialists, engaging them in long-term projects without slowing down our pace of output, so we can confidently say that we are a stable employer even in difficult times.
Luxoft Poland is well known for consistently high levels of supply, adroitness in complex project management, the talent of its highly qualified experts in the field of digital engineering, exceptional customer orientation, as well as its agility, creativity, and remarkable problem solving capabilities.
Microsoft
Microsoft enables digital transformation for the era of an intelligent cloud and an intelligent edge. Its mission is to empower every person and every organization on the planet to achieve more.Whether you're just starting your developer career or an experienced professional, our hands-on approach helps you arrive at your goals faster, with more confidence and at your own pace. Visit the home for Microsoft documentation and learning for developers and technology professionals.
#InventwithPurpose
SAS
SAS is the leader in analytics and AI. Through innovative software and services, SAS empowers and inspires customers around the world to transform data into intelligence to make smart decisions and drive relevant change. With SAS, data scientists have a rich set of tools, including statistical methods, optimization, data visualizations, machine learning and deep learning. You can serve predictions and embed AI models in applications through REST APIs and programming languages such as Python, R, Java, Lua and Scala. The SAS platform runs in a variety of environments, whether you are deploying in the cloud or on premises. SAS also provides natural language processing and speech to text tools. These methods are accessible through APIs and enable users to build intelligent applications such as Chatbots. SAS platform allows to operationalize data product on any scale in real time. SAS helps customers at more than 83,000 sites in 147 countries. Incorporated in 1976, company employs nearly 14 000 employees. www.sas.comSoftServe
SoftServe is more than just technology and ideas – we are the place where talented and ambitious people can develop their passion!We operate at the cutting edge of technology, exploring, transforming, accelerating, and optimizing the way large enterprises and software companies do business. We use the latest technologies and think outside of the box. We are the team of 1000+ people experienced in such areas as Big Data, Data Science, AI, VR, Machine Learning, as well as in core software development, Experience Design, DevOps, and many others. We have offices in 7 locations in Poland – Wroclaw, Gliwice, Bialystok, Warsaw, Gdansk, Lodz and Cracow. And we are still growing!
Splunk
Splunk is the data platform leader for security and observability. Our extensible data platform powers enterprise observability, unified security and limitless custom applications. Splunk helps tens of thousands of organizations turn data into doing so they can unlock innovation, enhance security and drive resilience. More info on www.splunk.comSUPPORTING PARTNER
Ab Initio
Ab Initio Software is the top supplier of enterprise data and metadata management software. Numerous leading global corporations use the Ab Initio software for the development of enterprise mission-critical applications with unmatched performance and full scalability. The apps encompass the full spectrum of data processing tasks – from the real-time operating systems to batch processing analytical systems, from data warehouses to transactional systems.
Core to Ab Initio is a simple idea that everything should be graphical. Applications should be graphical. Rules should be graphical. Orchestration, metadata, data management, and so on – no matter how big or complex, all should be graphical.
With this in mind, we think that our experience supporting the mass scale and data volume, combined with our proven Hadoop capabilities, forms the basis of your needs today and in the future.
PATRONAGE
Big Data Passion
BigDataPassion’s mission is to enable everyone interested to learn more about Big Data, Cloud Computing, Artificial Intelligence, Natural Language Processing, Configuration Management, Application Deployment, as well as other tech news. Technology is our passion and we like to share our knowledge, show new trends and well-known solutions to problems with a list of tools needed. We write it, because we like it – check it out!Data Science Warsaw
Data Science Warsaw is a community of data scientists based in Warsaw. We are a non-profit professional organisation dedicated to the free, open, dissemination of data science. We meet to discuss the tools, methods and technologies used to ingest, transform, explore, analyze; visualize data, obtain predictive ; prescriptive insight, develop data products, and exploit business opportunities from data products. The organizers of the Data Science Warsaw meeting are ICM and Foundation DataSci.Inhire
Inhire is a platform that helps the best IT specialists to get a new job more effectively. It was created to automate recruitment in the tech industry, making it faster, more effective and spam-free for candidates and companies. Inhire is different on many levels: – Anonymous : You can go through matched job offers completely anonymously, only when you are interested in a job offer you can reveal your data to companies with a 1-click application. – Automated: Candidates specify their skills and expectations and our system presents them currently open positions that meet their requirements. This way candidates are always up to date with available offers that are out there. – No spam: We respect candidates’ time. We only notify them about perfect job matches waiting for them just one click away. Over 400 purely tech companies from Poland and abroad are already looking for IT specialists at Inhire.Interdisciplinary Center for Mathematical and Computational Modelling (ICM), University of Warsaw
The Interdisciplinary Centre for Mathematical and Computational Modelling (ICM) University of Warsaw has become one of the top High Performance Computing (HPC) centers in Poland, which also in the domain of Big Data, HPC, cloud services and storage supports approximately 1,000 users from Poland. The popular meteo.pl weather portal has been using ICM computed weather forecasts. ICM researchers study problems related to civil aviation (collaborating with ICAO), modeling of social processes and most recently working on ICM Epidemiological Model for the COVID-19 epidemic in Poland. ICM took part in securing access for Polish scientists to the entire body of scientific literature, including over 8,000 journal titles, by maintaining the Virtual Library of Science. ICM networking team has participated in a number of cutting edge networking solutions, both for high throughput and low latency requirements. Check ICM’s projects here: https://expodubai.icm.edu.pl/SysOps/DevOps Polska
Fundacja SysOps/DevOps Polska (w skrócie SO/DO) to największa społeczność administratorów
systemowych i sieciowych, DevOpsów i innych specjalistów IT w Polsce. Od ponad
9 lat zrzeszamy ekspertów, którzy pomagają sobie wzajemnie rozwiązywać problemy w pracy
i przekładają to na znajomości offline. Dziś to ponad 30 tysięcy profesjonalistów, którzy dzielą
się swoimi doświadczeniami i przemyśleniami współtworząc community.
Misją SO/DO jest edukacja i tworzenie społeczności. Oprócz grup na Facebooku, regularnych
MeetUpów stacjonarnych, setek prelekcji na YouTube i szkoleń, realizujemy inne niestandardowe projekty.
The London Java Community (LJC)
The London Java Community (LJC) is a group of Java Enthusiasts who are interested in benefiting from shared knowledge in the industry. Through our forum and regular meetings you can keep in touch with the latest industry developments, learn new Java (& other JVM) technologies, meet other developers, discuss technical/non technical issues and network further throughout the Java Community.The National Information Processing Institute
The National Information Processing Institute is an interdisciplinary scientific institute and a leader in software development for Polish science and higher education. We knowledge on almost every Polish scientist, their projects, and their research apparatus. Gathering, analysing, and compiling information on the research and development sector allows us to influence the direction of Polish scientific policy. We develop intelligent information systems both for the public sector and for commercial use. The key areas of research at the institute include: machine learning algorithms, natural language processing algorithms, sentiment analysis, neural networks, discovering knowledge from text data, human-computer trust, computer assisted decision making systems, and artificial intelligence. Our research is driven by interdisciplinarity, and is conducted in seven laboratories, which employ specialists in various fields. Our team of information technology experts is supported by economists, sociologists, lawyers, statisticians, and psychologists. This fusing of different scientific approaches is conducive to in-depth analysis of research issues, and is a driving force for innovation.Warsaw Data Tech Talks
Warsaw Data Tech Talks (formerly Warsaw Hadoop User Group) - still as WHUG, was one of the first European Hadoop supporters. From the beginning of the group's operation (April 2012), we managed to successfully organize 36 meetings and convince 1974 people to join the group. We had the pleasure to host companies such as Spotify, Criteo, GetinData, dataArtisans, GridGain, TouK and many others.
Mark Lyons
Chief Product Officer, Vertica
The future is unified
Many enterprises are re-thinking their data analytics strategy. Some plan to stay on-prem, others are all in for full-cloud, and still others require a hybrid approach. They are adopting object stores independently from unifying their data analytics platforms and supporting the broadest deployment models. We talked to Mark Lyons, Chief Product Officer, Vertica, one of the keynote speakers at Big Data Tech Warsaw 2021 about new era of data analytics, hybrid and cloud-agnostic approach, unified solutions and the future of big data analytics.
IT is more and more hybrid these days. What does it mean from the big data analytics perspective? What are the main challenges of adding cloud to the mix?
Mark Lyons [ML]: One of the big challenges of moving to a more hybrid data architecture is when the way data is analyzed in the cloud doesn’t match the way your company is accustomed to analyzing data on premises. When the analytics consumers in an organization have to change the way they do things depending on where the data is stored, it causes problems for everyone. When you have weird restrictions like “egress fees” that make it cost the company money to move data from one location to another, this can also be a big barrier to making analytics ubiquitous and decisions data-driven.
Does the MULTI-cloud approach present additional challenge?
ML: The key to making a multi-cloud approach work, other than having an analytics platform that works on multiple clouds, is having a single pane of glass to manage it all. If you can manage analytics clusters from one interface, it vastly simplifies things. Where you’re doing the analysis, whether it’s on AWS, Google or Alibaba doesn’t matter nearly as much if you can spin clusters up and down, troubleshoot, optimize, and otherwise track things regardless of location. That simplifies hybrid as well, if your single management interface works without regard to deployment platform.
And what are the other main challenges the customers or users face?
ML: Change is the biggest challenge any company with any longevity faces. If you’re a tiny startup, you might grow exponentially and have to grow your data architecture to match. If you’re all in the cloud, some new regulation may require you to move your analytics on-premises. That report you’ve been generating weekly, well, now the C-suite wants to see it updated hourly, oh, and can you have it project a prediction forward three weeks? The only constant in data management is that nothing stays constant.
How has Vertica evolved to help businesses tackle these challenges?
ML: Vertica is a single, software-only code base, a single RPM if you will. This means that it works exactly the same on-premises as on any cloud – AWS, Google, Azure or Alibaba. It has a single Management Console for all of an organization’s databases, regardless of location. It even allows you to hibernate a database – shut all compute down so it just stores data – on-premises, and revive it in the cloud, hibernate it in one cloud, revive it in another. The database works, regardless of deployment environment.
What is next? How do you see the future of big data analytics? How will Vertica evolve?
ML: The concept of a separate data lake and data warehouse, a separate analytics platform on-premises and on the cloud, a separate platform for business intelligence and data science, these are all becoming obsolete. A single analytics solution that works for whatever analytics your organization needs to do, reaches whatever data you need to analyze, and deploys wherever you need to work, that’s what we’re becoming.
What technologies do you see as key drivers of change in this area? AI/ML? Automation? RPA? HPC? Exascale?
ML: Machine learning and advanced analytics like time series analysis and geospatial analysis are the future. Vertica is already leading the market in these capabilities, but we intend to expand further in that area. Expect to see more automation to make Vertica simpler to manage and deploy, and even more added to our already market-leading analytics, no matter what deployment model you want to use.
Krzysztof Gawroński
AppDynamics, Cisco Architect for the EMEA region, Cisco
The smart way to intelligently monitor applications
Business digitalization transforms application performance monitoring. For the interconnected, increasingly complex, ever-growing, and globalizing environment, the traditional approach is no longer enough. To keep up with the rate of change organizations desperately need highly automated, AI-driven open platforms for IT operations. AppDynamics is an intelligent, flexible, and massively scalable SaaS platform that provides Big Data infrastructure components to handle large numbers of events, metrics, and metadata - says Krzysztof Gawroński, AppDynamics, Cisco Architect for the EMEA region.
What are today’s key challenges for application monitoring?
Krzysztof Gawroński [KG]: Let me start with complexity. Applications are deployed in a hybrid, multi-cloud, or data center type of environments. Today on average an application has about 15 internal or third-party APIs. So, complexities are higher and higher. Then we have a growing number of cloud and SaaS initiatives. It creates more and more internet dependence. Many organizations adopt the SaaS-first approach, so applications are migrated from data centers to public clouds. Today’s business environment is hybrid, composed of external cloud services and on-prem software. We also live in a global economy. The number of end-users is constantly growing, and another aspect is the Covid-19 crisis. Many people started to work from home, and it increases the challenge to monitor applications because users access the company’s systems from home and mobile devices. Because of that, we must monitor not just internal systems and networking, but also end user devices and internet performance. We have much more data that needs to be understood and collected.
How AIOps can help?
KG: AIOps is essentially big data and machine learning artificial intelligence functionality that helps IT organizations to process data efficiently. The biggest problem is analyzing vast amounts of data and AI reduces human resources that are required to analyze data and find issues. It dramatically accelerates root cause analysis and makes it more accurate. It can also help with identifying problems before they have a significant impact on business or end-users. Finally, AI helps with the consolidation and aggregation of alerts and monitoring tools.
What are the building blocks for AIOps?
KG: We need to collect this data and they come from many technologies. So, we need scalable metrics ingestion module and of course, we need a data platform to store all this information. We will also need a query platform to be able to access the data that we collect and a user interface platform for our operators to find all the interesting information that relates to the data. We need machine learning or artificial intelligence that will analyze the data and inform us about anomalies or maybe discover patterns that are interesting from a business perspective. What is also important is an action platform or APIs via which we can integrate AppDynamics to external systems.
AppDynamics is an AIOps SaaS cloud platform. How its architecture looks like?
KG: If you evaluate or select AppDynamics, it can be deployed on premises or in the public or private cloud, but we encourage customers to go to AppDynamics SaaS deployment model. It is more flexible solution. Processing Big Data efficiently requires scalable cloud foundation. We are in in the AWS cloud. Agents send metrics via F5 firewalls. We also use, just to name a few, Kafka technology, Kubernetes, or Elastic Search storages. All these technologies support containers or clusters and provide us with flexibility and scalability. Our platform is also open, so it is easy to integrate with other solutions or products.
How is AI or machine learning used in AppDynamics?
KG: We use machine learning extensively. Good example is automatic baselining for all collected APM metrics and advanced business analytics. For every APM, custom or BiQ related metric, we want to know what the expected value should be. Our Machine Learning engine calculates the expected level for every metric hour by hour and of course based on historical data. There are different flavors of baselines. It can be done on a daily basis, which tells us what is normal for the given metric for the given hour of the day. It can also be a weekly baseline. Because what is normal nine o'clock on Monday can be different from nine o'clock on Saturday or Sunday.
Once the baseline is established then we can calculate the deviation from the norm. This is fully dynamic. There is no static threshold here and it is automatically calculated for thousands or even millions of metrics, out of the box. There is no need to set it up. If anything deviates more than given value form the norm, then alert can go off now.
What happens when anomaly is detected?
KG: We collect so-called snapshots and snapshots are still another type of data. Next to metrics. Snapshots are representation of the state of the systems or applications at the time when anomaly was detected. We only store the snapshots of transactions that happens during anomaly.
Traditionally the problem was that even if anomalies are automatically detected and snapshots are stored IT operations had to analyze all these snapshots. It was not easy to conclude what is the real issue.
AI comes into the picture again. Now all these snapshots are analyzed. Automated transaction diagnostics identify precisely where is the bottleneck. It dramatically speeds up finding root cause. Efficiency of drill down into performance data is also largely improved thanks to advanced dynamic visualizations.
What type of visualizations do you use?
KG: We use many different visualizations. It can be application flow maps showing all application components, user experience journey maps, business journey widgets or funnels . You can choose flow map of all the components of your application or you can have a flow map of one specific business transaction, and you can have also flow map in a snapshot. We can also drill down from these views. Maps are done fully dynamically and automatically.