The TPC-H is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance.
This connector can be used to test the capabilities and query syntax of Spark without configuring access to an external data source. When you query a TPC-H table, the connector generates the data on the fly using a deterministic algorithm.
Goto Try Kyuubi to explore TPC-H data instantly!
To enable the integration of kyuubi spark sql engine and TPC-H through Apache Spark Datasource V2 and Catalog APIs, you need to:
The classpath of kyuubi spark sql engine with TPC-H supported consists of
kyuubi-spark-sql-engine-1.7.0-SNAPSHOT_2.12.jar, the engine jar deployed with Kyuubi distributions
a copy of spark distribution
kyuubi-spark-connector-tpch-1.7.0-SNAPSHOT_2.12.jar, which can be found in the Maven Central
In order to make the TPC-H connector package visible for the runtime classpath of engines, we can use one of these methods:
Put the TPC-H connector package into
To add TPC-H tables as a catalog, we can set the following configurations in
# (required) Register a catalog named `tpch` for the spark engine. spark.sql.catalog.tpch=org.apache.kyuubi.spark.connector.tpch.TPCHCatalog # (optional) Excluded database list from the catalog, all available databases are: # sf0, tiny, sf1, sf10, sf30, sf100, sf300, sf1000, sf3000, sf10000, sf30000, sf100000. spark.sql.catalog.tpch.excludeDatabases=sf10000,sf30000 # (optional) When true, use CHAR/VARCHAR, otherwise use STRING. It affects output of the table schema, # e.g. `SHOW CREATE TABLE <table>`, `DESC <table>`. spark.sql.catalog.tpch.useAnsiStringType=false # (optional) Maximum bytes per task, consider reducing it if you want higher parallelism. spark.sql.catalog.tpch.read.maxPartitionBytes=128m
Listing databases under tpch catalog.
SHOW DATABASES IN tpch;
Listing tables under tpch.sf1 database.
SHOW TABLES IN tpch.sf1;
Switch current database to tpch.sf1 and run a query against it.
USE tpch.sf1; SELECT * FROM orders;