Yuan Jiang is a senior staff engineer leading the storage engine team for the Interactive Analytics product at Alibaba. At Big Data Tech Warsaw Summit 2020, he told a story of a large-scale real-time data warehouse product developed in-house. He talked solutions for a real-time data warehouse, its architecture, and typical scenarios.
What is Interactive Analytics service, and is it available for the general public?
It is a sub-second real-time data warehouse. It offers the ability to analyze a massive amount of data interactively, and it is fully compatible with PostgreSQL. We combined large scale Computational Storage – low cost, high performance, and high availability with highly-performed ingestion and query that offers low latency, high throughput, and high concurrency.
Interactive Analytics is widely used inside Alibaba. It is adopted internally by Search, Recommendation, and Ads products and also is available in AliCloud. Lots of customers use it as a private cloud. The public cloud is in the beta stage. It is available in Chinese. We are actually in the process of translating it to English, and we plan general availability in a couple of months.
What are the primary technology highlights of Interactive Analytics?
The goal was to build a low cost for user, large scale computational storage. It is serverless, and user doesn’t have to worry about buying servers and the standard set of things. We will take care of that, and it’s low-cost. It’s a cloud-native so we can do a separation of compute and storage and also unify the storage from both streaming processing and batch processing. For the engineering effort, we build a C++ Native Execution Engine and Query Optimizer and also a storage engine.
Why did the company decide to build this service from the ground up?
We were looking for a product for simple queries. But there was nothing on the market. First, at Alibaba, there were internal talks about some existing computational storage systems, mainly open source products. We took a close look at Apache databases – HBase and Cassandra. They are suitable for highly concurrent, simple queries. They are also called NoSQL databases, but today often they have some SQL layers – at HBase, they have Apache Phoenix, and Cassandra has CQL. Internally you have a straightforward interface, so it’s very suitable for simple queries. We also looked at products like Druid and Apache Kudu. Actually, Druid is the system that we used inside Alibaba before we build Interactive Analytics. They are not suitable for simple queries. They are best for complex big scans. And we wanted to create a system to support both scenarios.
But you also decided to use some existing systems for this product?
Before we do the system, we had to make a decision: should we build our own client system, so the user writes the new code to deal with it? We decided that it’s not good for the user to adopt a new system. It’s actually really hard, so we looked at the existing systems. We found that we can just leverage the PostgreSQL system to do that. It’s already widely used, and there are lots of tools. We have a command line, clients, and Tableau. So you will speed up the adoption rate.
What was the main challenge for the Interactive Analytics team?
Alibaba is the biggest e-commerce company in China. It has close to 1 bln customers. So every day we have a lot of data coming. We just needed extreme performance. So in every component in our system, we want to get the best performance. That is why we support row-oriented and column-oriented storage.
We have vectorization in query execution, high concurrency, and we efficiently use compute resources.
We use C++ to make sure stable low latency; we cost-base query optimize and leverage the character of storage. We have highly efficient resource management and scheduling service.
What are the typical business scenarios for the service?
The first typical business scenario is online processing, real-time A/B testing. We start at the activity logs the user clicks, and it goes through the internal DataHub. Then you go through the real-time compute – inside Alibaba, we use the enhanced version of Flink. It offers performance ten times higher than open-source Apache Flink, and five to ten times higher than Apache Spark for some performance metrics. Then the data goes to the Interactive Analytics, and immediately it can be used for real-time reporting and also customer computer models to analyze user behavior.
The second business scenario is offline processing acceleration. We have MaxCompute. It is Alibaba’s version of the MapReduce system. It offers more performance and more scale than the existing open-source version of Apache Hive. So user store the data in MaxCompute. It’s a MapReduce job to get the result. However, they have high latency. So if the user wants to speed up to get the quick response, we can ingest part of the Maxcompute data and then use some other tool like Tableau, SmartBI, Fine Report, Quick BI or Data Service to use it immediately.
The third scenario is when we want to combine offline processing and data streaming processing data into one place so that it can immediately be used in the dashboard and real-time reporting and online application.
So, in summary, the scenarios we try to support are dashboard, real-time BI reporting, user profiling, and monitoring, and alerting. The system we build has already been used inside Alibaba for e-commerce, IoT, and the financial department. It is also used outside our organization. We sold it for some customers as a private cloud to use it in the area of Public Safety.