Here are some highlights of the presentations from 2021 edition!
Data Quality with 100+ PB: Solved Challenge at Criteo
Data Quality is paramount -we all agree on that- and isn't straightforward even with small data sets. When working with over 120PB of data on Hadoop and thousands of jobs, I can tell you firsthand, it's a challenge! We started to tackle this at Criteo 2 years ago, and I have some tangible results I'll be happy to share. We'll go through this journey, from collecting data, detecting suspect behaviors, and alerting users on data quality incidents, while integrating these new checks into the Criteo Data Platform.
Criteo
Criteo
How to build a state-of-the-art weather forecasting AI service
Weather forecasting is important in many fields, and minor improvements in accuracy can have a considerable business impact. Today, weather forecasting is performed using computationally expensive mathematical models based on the Navier-Stokes and mass continuity equations, the first law of thermodynamics, and the ideal gas law. These models simulate the physical world and are making use of expensive supercomputers. Alternatively, AI can be used to learn from data and produce forecasts in a fraction of the time the physical simulations require, and in many cases at a higher degree of accuracy. In this presentation, I will show how an AI was used to produce competitive forecasts, using state-of-the-art AI models and neural architecture search, and how I used React to prototype a weather forecasting service.
Peltarion
Foundations of Data Teams
Successful data projects are built on solid foundations. What happens when we’re misled or unaware of what a solid foundation for data teams means? When a data team is missing or understaffed, the entire project is at risk of failure. This talk will cover the importance of a solid foundation and what management should do to fix it.
Keywords: #managment #data teams #data engineers #data scientists #operations
Big Data Institute
Presto: SQL-on-Anything & Anywhere
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale at organizations including, Airbnb, Comcast, Fcaebook, FINRA, GrubHub, LinkedIn, Lyft, Netflix, Twitter, Uber, and Zalando, Presto has experienced unprecedented growth in popularity in both on premise and cloud deployments over Object Stores, HDFS, NoSQL, and RDBMS data stores.
Delta Lake, a storage layer originally invented by Databricks and recently open sourced, brings ACID capabilities to big datasets held in Object Storage. While initially designed for Spark, Delta Lake now supports multiple query compute engines. In particular, Starburst, developed a native integration for Presto that leverages Deltaspecific performance optimizations.
Join this session and hear how Starburst Presto deployed on Azure Kubernetes Service (AKS) serves as a fast SQL query engine versus data in ADLS, and enables query-time correlations between IoT data in Delta Lake, customer data in SQL Server, and web log data in Elasticsearch.
You will also gain best practice and real-life insights and lessons learned from production deployment of this integration.
Keywords: #presto #sql #analyticsanywhere #azure
Starburst
Starburst
Make it personal: reinforcement learning for mere mortals
During this session we will reflect upon the importance of personalization in e-commerce. What challenges accurred as a results of bridging the gap between Google AlphaGo and real world? Furthermore, we will discuss the Vowpal Wabbit: the Swiss army knife of ML algos.
To sum up, we will introduce a case study exercise, during which participants will be creating a personalized user experience on a webpage.
Keywords: #personalization #ecommerce #vowpalwabbit #reinforcementlearning #opensource
eBay Classifieds Group
BigFlow – A Python framework for data processing on the Google Cloud Platform
You will learn about a tool that can improve your big data projects on GCP. Unified structure, configuration, versioning, build, deployment, and more, available for Dataflow/Dataproc/BigQuery.
Keywords: #gcp #python #dataflow #dataproc #bigquery
Allegro
Battle lessons for machine data in an Oil Refinery
During this session, we will review the machine data scenario in oil refineries, spreading out the knowledge to SMEsWe will contemplate upon the myth of industrial interoperability. Furthermore, we will find out how to store the data without spending a fortune, answer the question Analytics or Excel vs R.
Keywords: #OPC #timeseries #cloudopex #networksniffing #Excel
CEPSA
Datumize
The Scalable Gaming Analytics Pipeline at Outfit7: The Next Generation
Have you ever wondered how gaming companies build their analytics pipelines? Particularly scalable ones that are able to collect terabytes of data every day? At Outfit7, this is done with a little help from Google Cloud's top services, including Kubernetes, Dataflow, BigQuery, and Cloud Composer. In this presentation, you'll see how the pipeline is built, starting from ingestion in Kubernetes, through to ending in Jupyter, Tableau, and other BI dashboards. You’ll also find out how the team fights downtime with proactive monitoring and integration tests. And last but not least, you’ll hear about the challenges that Outfit7 faced when the amount of data it had to handle skyrocketed during the peak of the COVID-19 quarantine.
Keywords: #googlecloud #bigquery #events #scalable
Outift7
Casting the Spell: Druid in Practice
At Nielsen Identity, we leverage Druid to provide our customers with real-time analytics tools for various use-cases, including inflight
analytics, reporting and building target audiences. The common challenge of these use-cases is counting distinct elements in real-time at scale. We’ve been using Druid to solve these problems for the past 5 years, and gained a lot of experience with it.
In this talk, we will share some of the best practices and tips we’ve gathered over the years. We will cover the following topics:
* Data modeling
* Ingestion
* Retention and deletion
* Query optimization
Keywords: #BigData #ApacheDruid #RealtimeAnalytics #DataArchitecture #DataEngineering
Nielsen Identity
Imply
Running Apache NiFi in the shadow of cloud-native applications
Applications can fall short of the requirements of cloud-native in reasonable ways and for good reasons. In this talk I discuss the
properties of cloud-native applications and look at where Apache NiFi can satisfy these criteria and be deployed, configured, and
extended to be cloud-smart if not cloud-native. The audience will come away with in-depth knowledge about running Apache NiFi at
scale in the cloud as well as a general understanding of considerations for running other non-cloud-native applications in public
clouds.
Keywords: #apachenifi #nifi #cloudnative #cloud #oss
Microsoft
Artificial Intelligence - Building an in-house team from scratch in Philip Morris International
Building an AI Team from scratch in a big Corporation The challenges faced during its adoption across the organization The technology stack and use of the cloud that helps us deliver fast, supported by CI/CD pipelines.
Keywords: #AI #DL #AIBusiness #Innovation #Productivity #Disruption
Philip Morris International
PMI
CICD Pipeline and delivery of Apache Spark Applications on the cloud using AWS
The session will start from a quick, background introduction to the CSU datalake architecture and dataops framework, where we are going to discuss the principles of CICD and process overview, development unit and integration testing pipleine. Furthermore, we will overview the process and demonstrate how we use AWS codecommit and codebuild to automate the testing and code coverage. The next part of the meeting will focus on the production deployment Pipeline, featuring the overview of the process and demonstration of how we use AWS codecommit,codebuild and codepipeline to deploy spark applications to production environment.
Keywords: #Automation #CICD #HigherEd #Spark
California State University
Data Warehouse Development Lead.
Modern radars: from simple signal processing towards modern complex data analytics with deep learning
During this session, we will concentrate on data science & AI status in geophysics/geology. We will discuss and present analytical and software challenges with regards to multidimensional radar data. Furthermore, the session will finalize with the evaluation of the open-source, big data technologies application for solving complex analytical problems.
Keywords: #radars #deeplearning #bigdata #opensource
SGPR.TECH
SGPR.TECH
Common mistakes that make your chart hard to understand with practical solutions to avoid them.
In the world of big data, data visualization tools and technologies are essential for analyzing massive amounts of information.
Although data visualizations are commonly used, they are often inaccurate and misleading. To support data-driven decisions, it's
crucial to create reliable charts that leave no space for misunderstanding. There are mistakes that can be easily avoided, so let me show you how to do this!
Keywords: #datavisualization #datadesign #dataliteracy #uidesign #graphicacy
Freelancer
Evolving Bolt from batch jobs to real-time stream processing - migration, lessons learned, value unleashed
We would like to invite you to discuss how Bolt migrated from batch and synchronous to real-time and asynchronous. During our session we will review and evaluate the obstacles we faced along the way and lessons we have learned. We will also focus on the unleashed value of real-time data.
Keywords: #kafka #streaming #data #realtime
Bolt
Top 5 Spark anti-patterns that will bite you at scale!
This session looks at real-world systems that use Java big data technologies such as Spark, Hadoop, Cassandra and Kafka, and examines the comedic and sometimes disastrous effects when the code is executed. Session attendees will walk away with an enhanced understanding of how to work with and effectively use these technologies.
Keywords: #java #bigdata #spark #hadoop
Streaming SQL - Be Like Water My Friend
Data has to be processed fast, so that a firm can react to changing business conditions in realtime. Streaming SQL gives us a
possibility to make stream processing available for a broader audience but also makes it easier to access data streams. This
presentation will not only give you an brief overview of the data and streaming architecture at InnoGames but also introduces you
to the idea of Streaming SQL in general and how it is implemented in Apache Flink. Furthermore it shows actual examples of how
to use Flink SQL so that you hopefully are inspired to consider this rather new technology to tackle your data challenges.
Keywords: #streaming #streamingsql #flink #dataflow #flinksql
InnoGames GmbH
How to plan the unpredictable? 7 good practices of organisation and management of fast-paced large-scale R&D projects
During this session, we aim to review the technical and organisational challenges we faced, while building a complex AI-based app with a short time-to-market. How it was additionally influenced by the dispersion of involved teams around the world (10 time zones)? What unpredictable events that affected our plans, e.g. the COVID pandemic or the development vendor changing in the middle
of the project? We will evaluate the examples of failures and successes. Lessons we have learned in the process. We invite you to a broader discussion.
Keywords: #datascience #ai #machinelearning #agile #projectmanagement
Pearson
Pearson
Predicting effectiveness of marketing campaigns on Facebook platform
We will learn about the data acquisition, in particular Facebook Marketing API data structure, while defining target variable. Furthermore, his sessions aims to discuss wide and deep architecture for modeling categorical, text and image data together as well as the global XAI with Shapley values and local XAI with Anchors.
Keywords: #WideandDeep #XAI #Explainability #Facebook #Marketing
Sotrender
Sotrender
Sotrender
Building scalable and testable data pipeline through a data pipeline domain specific language
Data pipeline architecture, design and builds have similar concerns as any software product development. The purpose of the
presentation is to uncover the concerns and present one of the solutions. The presentation covers aspects of data pipeline such
as:
1. Configuration driven composable data pipeline
2. Testable data pipeline through specification language such as Gherkin
3. Design of the pipeline to solve for the data pipeline concerns
Keywords: #DataEngineering #ComposableDataPipeline
#SOLIDPrincipleInDataPipelineDesign
#GHERKINandDataPipelineSpecification #BDDTDDinDataPipeline
Independent speaker
Data lineage and observability with Marquez
Data is increasingly becoming core to many products. Whether to improve recommendations for users, getting insights on how they use the product or using machine learning to improve the experience. This creates a critical need for understanding how data is flowing through our systems. Data pipelines must be auditable, reliable and run on time. Tracking lineage and metadata is the underlying foundation that enables many use cases related to data. It provides understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It enables governance and compliance and generally helps you keep you data running. Marquez is an open source project part of the LF AI which instruments data pipelines to collect lineage and metadata and enable those use cases. It provides context by making visible dependencies across organisations and technologies and enables lineage governance and discovery.
Keywords: #lineage #observability #dataops
Datakin
Diftong: a tool for validating big data workflows
During this session, we will contemplate upon a complex data landscape - how to make big decisions with confidence? Furthermore, regarding Big data validation we will answer the question - how to build confidence in data? Finally, we will talk about the Diftong tool - automating the validating process and allowing for a more agile way of updating data transformation workflows with preserved data quality.
Keywords: #Dataworkflow #Dataquality #Datavalidation #Agiledatavalidationprocess #Datavalidationtool
Klarna Bank AB
Klarna Bank AB
Training and deploying machine learning models with Google Cloud Platform
In my presentation I would like to present some approaches, good practices and Google Cloud components that we use in Sotrender to effectively train and deploy our machine learning models, which are used to analyze Social Media data. I will discuss which aspects of DevOps we focus on when developing machine learning models (MLOps), and how these ideas can be easily implemented in your company or startup using Google Cloud Platform.
Keywords: #mlops #gcp #python #nlp #computervision
Sotrender
AWS Serverless Pipelines
How to use AWS managed services in near real-time data processing. Recommendations system design.
Keywords: #serverless #cloud #AWS #microservices #recommendedsystems
StepStone Services
Telecom systems Degradation Prediction using ML
Predict the unavailability (service degradation/outage events) of Charging System in advance by using historical fault & performance data with ML model. Will result in Highly improved user experience in pre-paid calls. 30-40% reduction in man hours for L2 support engineers for operations teams.Efficient services will result in higher customer retention.
Keywords: #Telecom #Datascience #Operationsmanagement #ArtificialIntelligence #MLops
Ericsson India R&D
MLOps journey in H&M
In this session you will learn about how H&M evolves reference architecture covering entire MLOps stack addressing a few common challenges in AI and Machine learning product, like development efficiency, end to end traceability, speed to production, etc. This architecture has been adapted by multiple product teams managing 100”s of models across the entire H&M value chain and enables data scientists to develop model in a highly interactive environment, enable engineers to manage large scale model training and model serving pipeline with fully traceability.
The team presenting is currently responsible for ensuring that best practices and reference architecture are implemented on all product teams to accelerate H&M groups’ data driven business decision making journey.
Keywords: #MLOps #AIAtScale #MachineLearning #Engineering #DataScience
H&M
Building a complex stream-processing app with Kafka Streams
At Simply Business we have built a rich stateful application using Kafka Streams to manage the leads that will be served by our 300-people strong UK call centre. This application combines X different data points from different services and has many internal data stores to operate and has been in production for over 2 years. Thanks to the increased observability and the ability to ingest lead scoring data from another streaming ML project, we have been able to increase the return of each call by over 60%.
Keywords: #data-streaming #kafkastreams #schemaregistry #domainevents #evolutionaryarchitecture
Simply Business
Building Data Ingestion Platform using Hadoop
State of data platforms in the tech industry. ING WBDM's vision on the future of data ingestion. Highlights on the ING Data Ingestion Platform main components and features. The Hadoop and FOSS revolution has reshaped the data engineering landscape. Using virtual and bricks machines to give life to a high-availability, disaster recovery ready platform. In the search for creating a cutting-edge data platform at ING, we are faced with challenging new requirements such as cloud-ready deployments into production, whilst ensuring proper data governance, risk and security principals. Please join us in this session, where we will share ING WBDM's experience on how to make a data platform based on open source components both enterprise and cloud ready, with an overview of current state and vision of our platform.
Keywords: #dataingestion #hadoop #nifi
ING
AWS Spot instances price prediction - towards cost optimization for Big Data
Analytical data processing has become the cornerstone of today's businesses success, and it is facilitated by Big Data platforms that offer virtually limitless scalability. However, minimizing the total cost of ownership (TCO) for the infrastructure can be challenging. By analyzing spot instance price history using ARIMA models, it is feasible to leverage the discounted prices of the cloud spot market with a limited risk of analytical job termination. In particular, we evaluated savings opportunities when using Amazon EC2 spot instances comparing to on-demand resources. During the presentation we show the evaluation of univariate spot price regression models to forecast future prices, and we confirm feasibility of short-term spot price prediction through real data from AWS. This confirms cost savings opportunities up to 80% compared to on-demand and within 1% of the absolute minimum.
Keywords: #TCO #CloudComputing #ARIMA #AWS #Spot
Nowa Era
CogniTrek Corp, MAGIX.AI, Saints Cyril and Methodius University
Data Philosophy in e-Commerce
Data is the key to build any components in E-commerce. The way of designing, gathering, storing, and using data directly affects
quality of components we are building. This talk will use examples of real components in e-commerce companies to present some
rules and principles of using data. Most examples in the talk are related to search, personalization, and recommendation.
Keywords: #dataalgorithms #ecommercesearch #personalization #recommendation
Managing Big Data projects in a constantly changing environment - good practices, use cases
The nature of Big Data projects are nowadays one of its kind - they are not like the data warehousing initiatives in the old days, nor like cloud native applications projects, at least not yet. Variety of technologies, complicated architectures and rapidly changing landscape are just a few challenges that the IT Department is facing in such projects. When you add the number of stakeholders from different departments involved and that Big Data project is sometimes more like an R&D with unpredictable outcome, this makes a mix where the objectives can be easily lost. It is not a surprise that up to 85% of Big Data projects were pure failures (Gartner 2016).
In this talk we will share our experience in planning and executing Big Data initiatives in the organisations, with some use cases and good practices in mind
Keywords: #agile #teammanagement #goodpractices #usecases
Get In Data
Get In Data
February 23, 2021 - Big Data Technology Warsaw Summit Workshops Day
9:00 – 13:00 Workshops part I
20.00 - 21.00 Evening meeting (speaker’s presentation + discussion)
February 24, 2021 - Big Data Technology Warsaw Summit Workshops Day 2
9:00 – 13:00 Workshops part II
20.00 - 21.00 Evening meeting (speaker’s presentation + discussion)
February 25, 2021 - Big Data Technology Warsaw Summit Day 1
12.30 - 13.00 Networking online
13.00 - 14.00 Plenary Session
14.00 - 15.10 Simultaneous sessions Part I (2 presentations in 4 parrallel tracks)
15.15 - 16.10 Roundtables sessions Part 1
16.15 - 17.25 Simultaneous sessions Part 2 (2 presentations in 4 parrallel tracks)
17.25 - 17.30 Summary
19.30 - 20.30 Evening meeting, including Prize Giveaway
February 26, 2021 - Big Data Technology Warsaw Summit Day 2
09.00 - 09.30 Plenary Session (consists of 1 presentation x 25 minutes)
09.30 - 10.40 Simultaneous sessions Part I (2 presentations in 4 parrallel tracks)
10.40 - 11.50 Simultaneous sessions Part II (2 presentations in 4 parrallel tracks)
12.00 - 12.55 Roundtables sessions Part 2
12.55 - 13.45 Plenary Session (consists of 2 presentations x 20 minutes)
13.45 - 13.55 Closing & Summary, Prize Giveaway
In BigData Technology Warsaw Summit 2021 agenda presentations will belong to one of the following tracks:
Architecture, Operations, and Cloud
This track is dedicated to architects, administrators and experts with DevOps skills who are interested in technologies and best practices for designing, building, operating and securing their Big Data infrastructures and platforms in enterprise environments – both on-premise and the cloud.
Data Engineering
This track is the place for engineers to learn about tools, techniques, and battle-proven solutions to collect, store, and process large amounts of data. It covers topics like data collection, ingestion, ETL, job scheduling, metadata and schema management, distributed processing engines, distributed datastores, and more.
Streaming and Real-Time Analytics
This track covers technologies, techniques, and valid use-cases for building event streaming systems and implementing real-time applications that enable actionable insights and interactions not previously possible with classic batch systems. This includes solutions for data stream ingestion and applying various real-time algorithms and machine learning models to process events coming from IoT sensors, devices, front-end applications, and users.
MLOps
This new track focuses on the full life-cycle of ML models, from experimentation and feature engineering through model training to its productization. It describes real-world use-cases, technologies to build own AI/ML platforms and feature stores, as well as other technical challenges needed to solve to avoid hidden technical debt in ML projects.
AI, ML and Data Science
This track includes real-world case studies demonstrating how data & technology are used together to address a wide range of complex problems in the domain of machine learning, artificial intelligence, and data science.
Data Analytics, BI & Visualisation
This track focuses on day-to-day analytics including SQL & Python-based solutions for data analytics, productive BI solutions as well as convenient tools for data visualization.
Data Strategy and ROI
This track is for data and business professionals who are interested in learning how data and analytics can be used to generate growth, value added, and positive financial impact. It will contain presentations about real-world use cases that cover useful data-focused solutions, new business models, and various data monetization strategies. Presentations will also explain necessary technical, cultural, and leadership aspects that are key to successful Big Data initiatives at enterprises to avoid wasting money and getting a positive return on investment (ROI).
Academia, the incubating projects, and POCs
This track contains presentations about innovative use-cases and solutions that are still in the research or incubating phases, and when eventually completed, they can inspire community or positively impact big data & cloud landscape.








































