Hive#
You may know that the Apache Spark has built-in support for accessing Hive tables, it works well in most cases, but is limited to one Hive Metastore. The Kyuubi Spark Hive connector(KSHC) implemented a Hive connector based on Spark DataSource V2 API, supports accessing multiple Hive Metastore in a single Spark application.
Hive Integration#
To enable the integration of Kyuubi Spark SQL engine and Hive connector through Spark DataSource V2 API, you need to:
Referencing the Hive connector dependencies
Setting the Spark catalog configurations
Dependencies#
The classpath of Kyuubi Spark SQL engine with Hive connector supported consists of
kyuubi-spark-sql-engine-1.10.1_2.12.jar, the engine jar deployed with a Kyuubi distribution
a copy of Spark distribution
kyuubi-spark-connector-hive_2.12-1.10.1, which can be found in the Maven Central
In order to make the Hive connector packages visible for the runtime classpath of engines, we can use one of these methods:
Put the Kyuubi Hive connector packages into
$SPARK_HOME/jars
directlySet
spark.jars=/path/to/kyuubi-hive-connector
Note
Starting from v1.9.2 and v1.10.0, KSHC jars available in the Maven Central guarantee binary compatibility across Spark versions, namely, Spark 3.3 onwards.
Configurations#
To activate functionality of Kyuubi Spark Hive connector, we can set the following configurations:
spark.sql.catalog.hive_catalog org.apache.kyuubi.spark.connector.hive.HiveTableCatalog
spark.sql.catalog.hive_catalog.hive.metastore.uris thrift://metastore-host:port
spark.sql.catalog.hive_catalog.<other.hive.conf> <value>
spark.sql.catalog.hive_catalog.<other.hadoop.conf> <value>
Hive Connector Operations#
Taking CREATE NAMESPACE
as a example,
CREATE NAMESPACE ns;
Taking CREATE TABLE
as a example,
CREATE TABLE hive_catalog.ns.foo (
id bigint COMMENT 'unique id',
data string)
USING parquet;
Taking SELECT
as a example,
SELECT * FROM hive_catalog.ns.foo;
Taking INSERT
as a example,
INSERT INTO hive_catalog.ns.foo VALUES (1, 'a'), (2, 'b'), (3, 'c');
Taking DROP TABLE
as a example,
DROP TABLE hive_catalog.ns.foo;
Taking DROP NAMESPACE
as a example,
DROP NAMESPACE hive_catalog.ns;
Advanced Usages#
Though KSHC is a pure Spark DataSource V2 connector which isn’t coupled with Kyuubi deployment, due to the
implementation inside spark-sql
, you should not expect KSHC works properly with spark-sql
, and
any issues caused by such a combination usage won’t be considered at this time. Instead, it’s recommended
using BeeLine with Kyuubi as a drop-in replacement for spark-sql
, or switching to spark-shell
.
KSHC supports accessing Kerberized Hive Metastore and HDFS, by using keytab, or TGT cache, or Delegation Token. It’s not expected to work properly with multiple KDC instances, the limitation comes from JDK Krb5LoginModule, for such cases, consider setting up Cross-Realm Kerberos trusts, then you just need to talk with one KDC.
For HMS Thrift API used by Spark, it’s known that Hive 2.3.9 client is compatible with HMS from 2.1 to 4.0, and
Hive 2.3.10 client is compatible with HMS from 1.1 to 4.0, such version combinations should cover the most cases.
For other corner cases, KSHC also supports spark.sql.catalog.<catalog_name>.spark.sql.hive.metastore.jars
and
spark.sql.catalog.<catalog_name>.spark.sql.hive.metastore.version
as well as the Spark built-in Hive datasource
does, you can refer to the Spark documentation for details.
Currently, KSHC has not implemented the Parquet/ORC Hive tables read/write optimization, in other words, it always
uses Hive SerDe to access Hive tables, so there might be a performance gap compared to the Spark built-in Hive
datasource, especially due to lack of support for vectorized reading. And you may hit bugs caused by Hive SerDe,
e.g. ParquetHiveSerDe
can not read Parquet files that decimals are written in int-based format produced by
Spark Parquet datasource writer with spark.sql.parquet.writeLegacyFormat=false
.