Welcome#
Apache Kyuubi™ is a distributed and multi-tenant gateway to provide serverless SQL on Data Warehouses and Lakehouses.
Kyuubi builds distributed SQL query engines on top of various kinds of modern computing frameworks, e.g., Apache Spark, Flink, Doris, Hive, Trino, and StarRocks, etc., to query massive datasets distributed over fleets of machines from heterogeneous data sources.
The Kyuubi Server lane of the below swimlane divides our prospective users into end users and administrators. On the one hand, it hides the technical details of computing and storage from the end users. Thus, they can focus on their business and data with familiar tools. On the other hand, it hides the complexity of business logic from the administrators. Therefore, they can upgrade components on the server side with zero maintenance downtime, optimize workloads with a clear view of what end users are doing, ensure authentication, authorization, and auditing for both cluster and data security, and so forth.
In general, the complete ecosystem of Kyuubi falls into the hierarchies shown in the above figure, with each layer loosely coupled to the other. It’s a child’s play to combine some of the components above to build a modern data stack. For example, you can use Kyuubi, Spark and Iceberg to build and manage Data Lakehouse with pure SQL for both data processing, e.g. ETL, and online analytics processing(OLAP), e.g. BI. All workloads can be done on one platform, using one copy of data, with one SQL interface.
A Unified Gateway#
The Server module plays the role of a unified gateway. The server enables simplified, secure access to any cluster resource through an entry point to deploy different workloads for end(remote) users. Behind this single entry, administrators have a single point for configuration, security, and control of remote access to clusters. And end users have an improved experience with seamless data processing with any the Kyuubi engine they need.
Application Programming Interface#
End users can use the application programming interface listed below for connectivity and interoperation between supported clients and a Kyuubi server. The current implementations are:
- Hive Thrift Protocol
A HiveServer2-compatible interface that allows end users to use a thrift client(cross-language support, both tcp and http), a Java Database Connectivity(JDBC) interface over thrift, or an Open Database Connectivity (ODBC) interface over a JDBC-to-ODBC bridge to communicate with Kyuubi.
- RESTful APIs
It provides system management APIs, including engines, sessions, operations, and miscellaneous ones.
It provides methods that allow clients to submit SQL queries and receive the query results, submit metadata requests and receive metadata results.
It enables easy submission of self-contained applications for batch processing, such as Spark jobs.
- MySQL Protocol
A MySQL-compatible interface that allows end users to use MySQL Connectors, such as Connector/J, to communicate with Kyuubi.
- We’ve planned to add more
Please join our mailing lists if you have any ideas or questions.
Multi-tenancy#
Kyuubi supports the end-to-end multi-tenancy. On the control plane, the Kyuubi server provides a centralized authentication layer to reduce the risk of data and resource breaches. It supports various protocols, such as LDAP and Kerberos, for securing networking between clients and servers. On the data plane, the Kyuubi engines use the same trusted client identities to instantiate themselves. The resource acquirement and data and metadata access all happen within their own engine. Thus, cluster managers and storage providers can easily guarantee data and resource security. Besides, Kyuubi also provides engine authorization extensions to optimize the data security model to fine-grained row/column level. Please see the security page for more information.
High Availability#
Kyuubi is designed with High availability (HA), ensuring it operates continuously without failure for a designated period. HA works to provide Kyuubi that meets an agreed-upon operational performance level.
- Load balancing
It becomes necessary for Kyuubi in a real-world production environment to ensure high availability because of multi-tenant access.
It effectively prevents single point of failures.
It helps achieve zero downtime for planned system maintenance
- Failure detectability
Failures and system load of kyuubi server and engines are visible via metrics, logs, and so forth.
Serverless SQL and More#
Serverless SQL on Lakehouses makes it easier for end users to gain insight from the data universe and optimize data pipelines. It enables:
The same user experience as an RDBMS using familiar SQL for various workloads.
Extensive and secure data access capability across diverse data sources.
High performance on large volumes of data with scalable computing resources.
Besides, Kyuubi also supports submissions of code snippets and self-contained applications serverlessly for more advanced usage.
Ease of Use#
End users could have an optimized experience exploring their data universe in a serverless way, using either JDBC + SQL or REST + code. For most scenarios, the superpower of corresponding engines, such as Spark, and Flink, is no longer necessary. That is, most work related to deployment, runtime optimization, etc., should be done by professionals on the Kyuubi server side. It is suitable for the following scenarios:
- Basic discovery and exploration
Quickly reason about the data in various formats (Parquet, CSV, JSON, text) in your data lake in cloud storage or an on-prem HDFS cluster.
- Lakehouse formation and analytics
Easily build an ACID table storage layer via Hudi, Iceberg, Delta Lake or/and Paimon.
- Logical data warehouse
Provide a relational abstraction on top of disparate data without ETL jobs, from collecting to connecting.
Run Anywhere at Any Scale#
Most of the Kyuubi engine types have a distributed backend or can schedule distributed tasks at runtime. They can process data on single-node machines or clusters, such as YARN and Kubernetes. Besides, the Kyuubi server also supports running on bare metal or in a docker.
High Performance#
Query performance is one of the critical factors in implementing Serverless SQL. Implementing serviceability on state-of-the-art query engines for bigdata lays the foundation for us to achieve this goal:
State-of-the-art query engines
Multiple application for high throughput
Sharable execution runtime for low latency
Server-side global and continuous optimization
Auxiliary performance plugins, such as Z-Ordering, Query Optimizer, and so on
Another goal of Serverless SQL is to make end users need not or rarely care about tricky performance optimization issues.