Hive table

This article shows how to import a Hive table from cloud storage into Databricks using an external table.

Step 1: Show the `CREATE TABLE` statement

Issue a SHOW CREATE TABLE <tablename> command on your Hive command line to see the statement that created the table.

hive> SHOW CREATE TABLE wikicc;
OK
CREATE  TABLE `wikicc`(
  `country` string,
  `count` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '<path-to-table>'
TBLPROPERTIES (
  'totalSize'='2335',
  'numRows'='240',
  'rawDataSize'='2095',
  'COLUMN_STATS_ACCURATE'='true',
  'numFiles'='1',
  'transient_lastDdlTime'='1418173653')

Step 2: Issue a `CREATE EXTERNAL TABLE` statement

If the statement that is returned uses a CREATE TABLE command, copy the statement and replace CREATE TABLE with CREATE EXTERNAL TABLE.

EXTERNAL ensures that Spark SQL does not delete your data if you drop the table.
You can omit the TBLPROPERTIES field.

DROP TABLE wikicc

CREATE EXTERNAL TABLE `wikicc`(
  `country` string,
  `count` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '<path-to-table>'

Step 3: Issue SQL commands on your data

SELECT * FROM wikicc

{ # 's' variable stands for the search page here. TODO: Fix it. #}

DATA+AU Summit 2024