🦊 Welcome to Kyuubi’s online documentation ✨, v1.10.0-SNAPSHOT

PyHive#

PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Hive. PyHive can connect with the Kyuubi server serving in thrift protocol as HiveServer2.

Requirements#

PyHive works with Python 2.7 / Python 3. Install PyHive via pip for the Hive interface.

pip install 'pyhive[hive]'

Usage#

Use the Kyuubi server’s host and thrift protocol port to connect.

For further information about usages and features, e.g. DB-API async fetching, using in SQLAlchemy, please refer to project homepage.

DB-API#

from pyhive import hive
cursor = hive.connect(host=kyuubi_host,port=10009).cursor()
cursor.execute('SELECT * FROM my_awesome_data LIMIT 10')
print(cursor.fetchone())
print(cursor.fetchall())

Use PyHive with Pandas#

PyHive provides a handy way to establish a SQLAlchemy compatible connection and works with Pandas dataframe for executing SQL and reading data via pandas.read_sql.

from pyhive import hive
import pandas as pd

# open connection
conn = hive.Connection(host=kyuubi_host,port=10009)

# query the table to a new dataframe
dataframe = pd.read_sql("SELECT id, name FROM test.example_table", conn)

Authentication#

If password is provided for connection, make sure the auth param set to either CUSTOM or LDAP.

# open connection
conn = hive.Connection(host=kyuubi_host, port=10009, 
                       username='user', password='password', auth='CUSTOM')

Use Kerberos to connect to Kyuubi.

kerberos_service_name must be the name of the service that started the Kyuubi server, usually the prefix of the first slash of kyuubi.kinit.principal.

Note that PyHive does not support passing in principal, it splices in part of principal with kerberos_service_name and kyuubi_host.

# open connection
conn = hive.Connection(host=kyuubi_host, port=10009, auth="KERBEROS", kerberos_service_name="kyuubi")

If you encounter the following errors, you need to install related packages.

thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found'
yum install -y cyrus-sasl-plain cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-md5

Note that PyHive does not support the connection method based on zookeeper HA, you can connect to zookeeper to get the service address via Kazoo.

Code reference https://stackoverflow.com/a/73326589

from pyhive import hive
import random
from kazoo.client import KazooClient
zk = KazooClient(hosts='kyuubi1.xx.com:2181,kyuubi2.xx.com:2181,kyuubi3.xx.com:2181', read_only=True)
zk.start()
servers = [kyuubi_server.split(';')[0].split('=')[1].split(':') 
           for kyuubi_server 
           in zk.get_children(path='kyuubi')]
kyuubi_host, kyuubi_port = random.choice(servers)
zk.stop()
print(kyuubi_host, kyuubi_port)
conn = hive.Connection(host=kyuubi_host, port=kyuubi_port, auth="KERBEROS", kerberos_service_name="kyuubi")