Apache Hive compatibility
Applies to: Databricks Runtime
Apache Spark SQL in Databricks is designed to be compatible with the Apache Hive, including metastore connectivity, SerDes, and UDFs.
Metastore connectivity
See External Apache Hive metastore (legacy) for information on how to connect Databricks to an externally hosted Hive metastore.
Supported Hive features
Spark SQL supports the vast majority of Hive features, such as:
Hive query statements, including:
SELECT
GROUP BY
ORDER BY
CLUSTER BY
SORT BY
All Hive expressions, including:
Relational expressions (
=
,⇔
,==
,<>
,<
,>
,>=
,<=
, etc)Arithmetic expressions (
+
,-
,*
,/
,%
, etc)Logical expressions (AND, &&, OR, ||, etc)
Complex type constructors
Mathematical expressions (sign, ln, cos, etc)
String expressions (instr, length, printf, etc)
User defined functions (UDF)
User defined aggregation functions (UDAF)
User defined serialization formats (SerDes)
Window functions
Joins
JOIN
{LEFT|RIGHT|FULL} OUTER JOIN
LEFT SEMI JOIN
CROSS JOIN
Unions
Sub-queries
SELECT col FROM ( SELECT a + b AS col from t1) t2
Sampling
Explain
Partitioned tables including dynamic partition insertion
View
Vast majority of DDL statements, including:
CREATE TABLE
CREATE TABLE AS SELECT
ALTER TABLE
Most Hive data types, including:
TINYINT
SMALLINT
INT
BIGINT
BOOLEAN
FLOAT
DOUBLE
STRING
BINARY
TIMESTAMP
DATE
ARRAY<>
MAP<>
STRUCT<>
Unsupported Hive functionality
The following sections contain a list of Hive features that Spark SQL doesn’t support. Most of these features are rarely used in Hive deployments.
Esoteric Hive features
Union type
Unique join
Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment and only supports populating the sizeInBytes field of the Hive metastore
Hive input and output formats
File format for CLI: For results showing back to the CLI, Spark SQL supports only TextOutputFormat
Hadoop archive
Hive optimizations
A handful of Hive optimizations are not included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model.
Block level bitmap indexes and virtual columns (used to build indexes).
Automatically determine the number of reducers for joins and groupbys: In Spark SQL, you need to control the degree of parallelism post-shuffle using
SET spark.sql.shuffle.partitions=[num_tasks];
.Skew data flag: Spark SQL does not follow the skew data flag in Hive.
STREAMTABLE
hint in join: Spark SQL does not follow theSTREAMTABLE
hint.Merge multiple small files for query results: if the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Spark SQL does not support that.