JOIN Online Webinars and WIN a ticket for Big Data Tech 2023!
Big Data Technology Warsaw Summit is around the corner, but before we meet in Warsaw, we are happy to invite you for a free online webinars!
We prepare for you two events with 2 presentations in the field of data, analytics, ML and cloud on each of them. On February 16th we will host guests from Big Data Institute and Dremio.io, on March 9th the experts from GetInData | Part of Xebia. Please check the details about the presentations below.
During the events, you will be able to exchange your knowledge with experts but also you will be able to take part in our competition and WIN an invitation to the Big Data Technology Summit 2023, and access to the recordings from the last edition!
Sign up once and get an invitation for both of them!
Data Lakes have been built to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format, which addresses some of these problems, but falls short on data, user, and application scale.
Apache Iceberg is an open table format designed specifically to address these problems. However, querying 100s of petabytes of data demands optimized query speed specifically when data accumulates over a period of time. We have to ensure that the queries remain efficient because over time you may end up with a lot of small files and your data might not be optimally set out for queries. In this talk, we will go over the various data optimization strategies available by default in Apache Iceberg such as compaction, hierarchical sorting & Z-order clustering that helps us achieve faster performance in data lakes.
In this talk we will go through the various data & file optimization strategies that help to achieve robust performance in Apache Iceberg.
Specifically, we will cover:
- Small file problem in Iceberg: Compaction strategy
- Reorganization of data within data files
- Sorting, Hierarchical sorting
- Problems with normal sorting strategies
- Z-order clustering for multiple dimensions
Successful data projects are built on solid foundations. What happens when we’re misled or unaware of what a solid foundation for data teams means? When a data team is missing or understaffed, the entire project is at risk of failure.
This talk will cover the importance of a solid foundation and what management should do to fix it. To do this I’ll be sharing a real-life analogy to show how we can be misled and what that means for our success rates.
We will talk about the teams in data teams: data science, data engineering, and operations. This will include detailing what each is, does, and the unique skills for the team. It will cover what happens when a team is missing and the effect on the other teams.
The analogy will come from my own experience with a house that had major cracks in the foundation. We were going to simply remodel the kitchen. We weren’t ever told about the cracks and the house needs a completely new foundation. In a similar way, most managers think adding in advanced analytics such as machine learning is a simple addition (remodel the kitchen). However, management isn’t ever told that you need all three data teams to do it right. Instead, management has to go all the way back to the foundation and fix it. If they don’t, the house (team) will crumble underneath the strain.
"*" indicates required fields