The Share Level Of Kyuubi Engines#

The share level of Kyuubi engines describes the relationship between sessions and engines. It determines whether a new session can share an existing backend engine with other sessions or not. The sessions are also known as JDBC/ODBC/Thrift connections from clients that end-users create, and the engines are standalone applications with the full capabilities of Spark SQL, Flink SQL(under dev), running on single-node machines or clusters.

The share level of Kyuubi engines works the same whether in HA or single node mode. In other words, an engine is cluster widely shared by all Kyuubi server peers if could.

Why do we need this feature?#

Apache Spark is a unified engine for large-scale data analytics. Using Spark to process data is like driving an all-wheel-drive hefty horsepower supercar. However,

Cars have their limit of 0-60 times. In a similar way, all Spark applications also have to warm up before go full speed.
Cars have a constant number of seats and are not allowed to be overloaded. Due to the master-slave architecture of Spark and the resource configured ahead, the overall workload of a single application is predictable.
Cars have various shapes to meet our needs.

With this feature, Kyuubi give you a more flexible way to handle different big data workloads.

The current supported share levels#

The current supported share levels are,

Share Level	Syntax	Scenario	Isolation Degree	Shareability
CONNECTION	One engine per session	Large-scale ETL Ad hoc	High	Low
USER	One engine per user	Ad hoc Small-scale ETL	Medium	Medium
GROUP	One engine per primary group	Ad hoc Small-scale ETL	Low	High
SERVER_LOCAL	One engine per Kyuubi server	Resource load balancing	Very Low	Very High
SERVER	One engine per cluster	Admin	Highest If Secured Lowest If Unsecured	Admin ONLY If Secured

Better isolation degree of engines gives us better stability of an engine and the query executions running on it.
Better shareability of engines means we are more likely to reuse an engine which is already in full speed.

CONNECTION#

Figure.1 CONNECTION Share Level

Each session with CONNECTION share level has a standalone engine for itself which is unreachable for anyone else. Within the session, a user or client can send multiple operation request, including metadata calls or queries, to the corresponding engine.

Although it is still an interactive form, this model does allow for more practical batch processing jobs as well.

When closing session, the corresponding engine will be shutdown at the same time.

USER(Default)#

Figure.2 USER Share Level

All sessions with USER share level use the same engine if and only if the session user is the same.

Those sessions share the same engine with objects belong to the one and only SparkContext instance, including Classes/Classloaders, SparkConf, Driver/Executors, Hive Metastore Client, etc. But each session can still have its own SparkSession instance, which contains separate session state, including temporary views, SQL config, UDFs etc. Setting kyuubi.engine.single.spark.session to true will make SparkSession instance a singleton and share across sessions.

When closing session, the corresponding engine will not be shutdown. When all sessions are closed, the corresponding engine still has a time-to-live lifespan. This TTL allows new sessions to be established quickly without waiting for the engine to start.

GROUP#

Figure.3 GROUP Share Level

An engine will be shared by all sessions created by all users belong to the same primary group name. The engine will be launched by the group name as the effective username, so here the group name is kind of special user who is able to visit the compute resources/data of a team. It follows the Hadoop GroupsMapping to map user to a primary group. If the primary group is not found, it falls back to the USER level.

The mechanisms of SparkContext, SparkSession and TTL works similarly to USER share level.

Here is an example to configure HadoopGroupProvider to use LDAP-based group mapping.

Add the properties shown in the example below to the core-site.xml file. You will need to provide the value for the bind user, the bind password, and other properties specific to your LDAP instance, and make sure that object class, user, and group filters match the values specified in your LDAP instance.

<property>
  <name>hadoop.security.group.mapping</name>
  <value>org.apache.hadoop.security.LdapGroupsMapping</value>
</property>

<property>
  <name>hadoop.security.group.mapping.ldap.url</name>
  <value>ldap://localhost:389</value>
</property>

<property>
  <name>hadoop.security.group.mapping.ldap.base</name>
  <value>dc=example,dc=com</value>
</property>

<property>
  <name>hadoop.security.group.mapping.ldap.bind.user</name>
  <value>cn=Manager,dc=example,dc=com</value>
</property>

<property>
  <name>hadoop.security.group.mapping.ldap.bind.password</name>
  <value>example</value>
</property>

<property>
  <name>hadoop.security.group.mapping.ldap.search.filter.user</name>
  <value>(&(objectClass=posixAccount)(cn={0}))</value>
</property>

<property>
  <name>hadoop.security.group.mapping.ldap.search.filter.group</name>
  <value>(objectClass=posixGroup)</value>
</property>

<property>
  <name>hadoop.security.group.mapping.ldap.search.attr.member</name>
  <value>memberuid</value>
</property>

<property>
  <name>hadoop.security.group.mapping.ldap.search.attr.group.name</name>
  <value>cn</value>
</property>

Use the applicable instructions to re-start the HDFS NameNode and the YARN ResourceManager.
Verify LDAP group mapping by running the hdfs groups command. This command will fetch groups from LDAP for the current user. Note that with LDAP group mapping configured, the HDFS permissions can leverage groups defined in LDAP for access control.

Tips for authorization in GROUP share level:

The session user and the primary group name(as sparkUser/execute user) will be both accessible at engine-side. By default, the sparkUser will be used to check the YARN/HDFS ACLs. If you want fine-grained access control for session user, you need to get it from SparkContext.getLocalProperty("kyuubi.session.user") and send it to security service, like Apache Ranger.

SERVER_LOCAL#

Figure.4 SERVER_LOCAL Share Level

An engine with SERVER_LOCAL share level is private to a single Kyuubi Server instance and is shared by all user sessions on that instance.In this mode, an engine instance is created when it is first needed by the Kyuubi Server and is associated with the host address of that server.

SERVER#

Figure.5 SERVER Share Level

Literally, this model is similar to Spark Thrift Server with High availability.

Subdomain#

For USER, GROUP, or SERVER share levels, you can further use kyuubi.engine.share.level.subdomain to isolate the engine. That is, you can also create multiple engines for a single user, group or server(cluster). For example, in USER share level, you can use kyuubi.engine.share.level.subdomain=sd1 and kyuubi.engine.share.level.subdomain=sd2 to create two standalone engines for user Tom.

The kyuubi.engine.share.level.subdomain shall be configured in the JDBC connection URL to tell the Kyuubi server which engine you want to use.

Engine Pool#

Engine pool works by creating multiple engine subdomains (e.g., engine-pool-0, engine-pool-1) and distributing sessions across these subdomains.

The Engine Pool’s configurations will not take effect when kyuubi.engine.share.level.subdomain is set or kyuubi.engine.share.level is set as CONNECTION.

Hybrid#

All supported share levels can be used together in a single Kyuubi server or cluster.

Conclusion#

With this feature, end-users are able to leverage engines in different ways to handle their different workloads, such as large-scale ETL jobs and interactive ad hoc queries.