1. Terminologies#
1.1. Kyuubi#
Kyuubi is a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.
1.1.1. JDBC#
The Java Database Connectivity (JDBC) API is the industry standard for database-independent connectivity between the Java programming language and a wide range of databases SQL databases and other tabular data sources, such as spreadsheets or flat files. The JDBC API provides a call-level API for SQL-based database access.
JDBC technology allows you to use the Java programming language to exploit “Write Once, Run Anywhere” capabilities for applications that require access to enterprise data. With a JDBC technology-enabled driver, you can connect all corporate data even in a heterogeneous environment.
https://www.oracle.com/java/technologies/javase/javase-tech-database.html
Typically, there is a gap between business development and big data analytics. If the two are forcefully coupled, it would make the corresponding system difficult to operate and optimize. On the flip side, if decoupled, the values of both can be maximized. Business experts can stay focused on their own business development, while Big Data engineers can continuously optimize server-side performance and stability. Kyuubi combines the two seamlessly through an easy-to-use JDBC interface.
1.1.1.1. Apache Hive#
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
Kyuubi supports Hive JDBC driver, which helps you seamlessly migrate your slow queries from Hive to Spark SQL.
1.1.1.2. Apache Thrift#
The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
1.1.2. Server#
Server is a daemon process that handles concurrent connection and query requests and converting these requests into various operations against the query engines to complete the responses to clients.
Aliases: Kyuubi Server / Kyuubi Instance / k.i.
1.1.3. ServerSpace#
A ServerSpace is used to register servers and expose them together as a service layer to clients.
1.1.4. Engine#
An engine handles all queries through Kyuubi servers. It is created in one Kyuubi server and can be shared with other Kyuubi servers by registering itself to an engine namespace. All its capabilities are mainly powered by Spark SQL.
Aliases: Query Engine / Engine Instance / e.i.
1.1.5. EngineSpace#
An EngineSpace is internally used by servers to register and interact with engines.
1.1.5.1. Apache Spark#
Apache Spark™ is a unified analytics engine for large-scale data processing.
1.1.6. Multi Tenancy#
Kyuubi guarantees end-to-end multi-tenant isolation and sharing in the following pipeline
Client --> Kyuubi --> Query Engine(Spark) --> Resource Manager --> Data Storage Layer
1.1.7. High Availability / Load Balance#
As an enterprise service, SLA commitment is essential. Deploying Kyuubi in High Availability (HA) mode helps you guarantee that.
1.1.7.1. Apache Zookeeper#
Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.
1.1.7.2. Apache Curator#
Apache Curator is a Java/JVM client library for Apache ZooKeeper, a distributed coordination service. It includes a high-level API framework and utilities to make using Apache ZooKeeper much easier and more reliable. It also includes recipes for common use cases and extensions such as service discovery and a Java 8 asynchronous DSL.
1.2. DataLake & Lakehouse#
Kyuubi unifies DataLake & Lakehouse access in the simplest pure SQL way, meanwhile it’s also the securest way with authentication and SQL standard authorization.
1.2.1. Apache Iceberg#
Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Trino and Spark that use a high-performance format that works just like a SQL table.
1.2.2. Delta Lake#
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.
1.2.3. Apache Hudi#
Apache Hudi ingests & manages storage of large analytical datasets over DFS (hdfs or cloud stores).