read_kinesis streaming table-valued function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime 13.3 LTS and above

Returns a table with records read from Kinesis from one or more streams.

Syntax

read_kinesis ( { parameter => value } [, ...] )

Arguments

read_kinesis requires named parameter invocation.

The only required argument is streamName. All other arguments are optional.

The descriptions of the arguments are brief here. For more details, see the Amazon Kinesis documentation.

There are various connection options to connect and authenticate with AWS. awsAccessKey, and awsSecretKey can either be specified in the function arguments using the secret function, manually set in the arguments, or configured as environment variables as indicated below. roleArn, roleExternalID, roleSessionName can also be used to authenticate with AWS by using instance profiles. If none of these are specified, it will use the default AWS provider chain.

Parameter

Type

Description

streamName

STRING

Required, comma-separated list of one or more kinesis streams.

awsAccessKey

STRING

The AWS Access key, if any. Can also be specified through the various options supported through the AWS default credential provider chain including environment variables (AWS_ACCESS_KEY_ID) and a credential profiles file.

awsSecretKey

STRING

The secret key which corresponds to the access key. Can be specified either in the arguments or through the various options supported through the AWS default credential provider chain including environment variables (AWS_SECRET_KEY or AWS_SECRET_ACCESS_KEY) and a credentials profiles file.

roleArn

STRING

Amazon resource name of the role to assume when accessing Kinesis.

roleExternalId

STRING

Used when delegating access to the AWS account.

roleSessionName

STRING

AWS role session name.

stsEndpoint

STRING

An endpoint for requesting temporary access credentials.

region

STRING

Region for the streams to be specified. The default is the locally resolved region.

endpoint

STRING

regional endpoint for Kinesis data streams. Th default is the locally resolved region.

initialPosition

STRING

Starting position for reading from in the stream. One of: ‘latest’ (default), ‘trim_horizon’, ‘earliest’, ‘at_timestamp’.

consumerMode

STRING

One of: ‘polling’ (default), or ‘EFO’ (enhanced-fan-out).

consumerName

STRING

The name of the consumer. All consumers are prefixed with ‘databricks_’. The default is an empty string.

registerConsumerTimeoutInterval

STRING

the max timeout to wait for the Kinesis EFO consumer to be registered with the Kinesis stream before throwing an error. The default is ‘300s’.

requireConsumerDeregistration

BOOLEAN

true to de-register the EFO consumer on query termination. Default is false.

deregisterConsumerTimeoutInterval

STRING

The max timeout to wait for the Kinesis EFO consumer to be deregistered with the Kinesis stream before throwing an error. The default is ‘300s’.

consumerRefreshInterval

STRING

The interval at which the consumer is checked and refreshed. The default is ‘300s’.

The following arguments are used for controlling the read throughput and latency for Kinesis:

Parameter

Type

Description

maxRecordsPerFetch

INTEGER (>0)

Optional, with a default of 10,000 records to be read per API request to Kinesis.

maxFetchRate

STRING

How fast to prefetch data per shard. A value between ‘1.0’ and ‘2.0’ that’s measured in MB/s. The default is ‘1.0’.

minFetchPeriod

STRING

The maximum wait time between consecutive prefetch attempts. The default is ‘400ms’.

maxFetchDuration

STRING

The maximum duration to buffer prefetched new data. The default is ’10s’.

fetchBufferSize

STRING

The amount of data for the next trigger. The default is ‘20gb’.

shardsPerTask

INTEGER (>0)

The number of Kinesis shards to prefetch from in parallel per spark task. The default is 5.

shardFetchinterval

STRING

How often to poll for resharding. The default is ‘1s’.

coalesceThresholdBlockSize

INTEGER (>0)

The threshold at which automatic coalesce occurs. The default is 10,000,000.

coalesce

BOOLEAN

true to coalesce prefetched requests. The default is true.

coalesceBinSize

INTEGER (>0)

The approximate block size after coalescing. The default is 128,000,000.

reuseKinesisClient

BOOLEAN

true to reuse the Kinesis client stored in the cache. The default is true except on a PE cluster.

clientRetries

INTEGER (>0)

The number of retries in the retry scenario. The default is 5.

Returns

A table of Kinesis records with the following schema:

Name

Data type

Nullable

Standard

Description

partitionKey

STRING

No

A key that is used to distribute data among the shards of a stream. All data records with the same partition key will be read from the same shard.

data

BINARY

No

The kinesis data payload, base-64 encoded.

stream

STRING

No

The name of the stream where the data was read from.

shardId

STRING

No

A unique identifier for the shard where the data was read from.

sequenceNumber

BIGINT

No

The unique identifier of the record within its shard.

approximateArrivalTimestamp

TIMESTAMP

No

The approximate time that the record was inserted into the stream.

The columns (stream, shardId, sequenceNumber) constitute a primary key.

Examples

-- Streaming Ingestion from Kinesis
> CREATE STREAMING TABLE testing.streaming_table AS
    SELECT * FROM STREAM read_kinesis (
        streamName => 'test_databricks',
        awsAccessKey => secret(test-databricks, awsAccessKey),
        awsSecretKey => secret(test-databricks, awsSecretKey),
        initialPosition => 'earliest');

-- The data would now need to be queried from the testing.streaming_table

-- A streaming query when the environment variables already contain AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY configured
> CREATE STREAMING TABLE testing.streaming_table AS
    SELECT * FROM STREAM read_kinesis (
        streamName => 'test_databricks',
        initialPosition => 'earliest');

-- A streaming query when the roleArn, roleSessionName, and roleExternalID are configured
> CREATE STREAMING TABLE testing.streaming_table AS
    SELECT * FROM STREAM read_kinesis (
        streamName => 'test_databricks',
        initialPosition => 'earliest',
        roleArn => 'arn:aws:iam::123456789012:role/MyRole',
        roleSessionName => 'testing@databricks.com');