PySpark custom data sources

Preview

PySpark custom data sources are in Public Preview in Databricks Runtime 15.2 and above. Streaming support is available in Databricks Runtime 15.3 and above.

A PySpark DataSource is created by the Python (PySpark) DataSource API, which enables reading from custom data sources and writing to custom data sinks in Apache Spark using Python. You can use PySpark custom data sources to define custom connections to data systems and implement additional functionality, to build out reusable data sources.

DataSource class

The PySpark DataSource is a base class that provides methods to create data readers and writers.

Implement the data source subclass

Depending on your use case, the following must be implemented by any subclass to make a data source either readable, writable, or both:

Property or Method	Description
`name`	Required. The name of the data source
`schema`	Required. The schema of the data source to be read or written
`reader()`	Must return a `DataSourceReader` to make the data source readable (batch)
`writer()`	Must return a `DataSourceWriter` to make the data sink writeable (batch)
`streamReader()` or `simpleStreamReader()`	Must return a `DataSourceStreamReader` to make the data stream readable (streaming)
`streamWriter()`	Must return a `DataSourceStreamWriter` to make the data stream writeable (streaming)

Note

The user-defined DataSource, DataSourceReader, DataSourceWriter, DataSourceStreamReader, DataSourceStreamWriter, and their methods must be able to be serialized. In other words, they must be a dictionary or nested dictionary that contains a primitive type.

Register the data source

After implementing the interface, you must register it, then you can load or otherwise use it as shown in the following example:

# Register the data source
spark.dataSource.register(MyDataSourceClass)

# Read from a custom data source
spark.read.format("my_datasource_name").load().show()

Example 1: Create a PySpark DataSource for batch query

To demonstrate PySpark DataSource reader capabilities, create a data source that generates example data using the faker Python package. For more information about faker, see the Faker documentation.

Install the faker package using the following command:

%pip install faker

Step 1: Define the example DataSource

First, define your new PySpark DataSource as a subclass of DataSource with a name, schema, and reader. The reader() method must be defined to read from a data source in a batch query.

from pyspark.sql.datasource import DataSource, DataSourceReader
from pyspark.sql.types import StructType

class FakeDataSource(DataSource):
    """
    An example data source for batch query using the `faker` library.
    """

    @classmethod
    def name(cls):
        return "fake"

    def schema(self):
        return "name string, date string, zipcode string, state string"

    def reader(self, schema: StructType):
        return FakeDataSourceReader(schema, self.options)

Step 2: Implement the reader for a batch query

Next, implement the reader logic to generate example data. Use the installed faker library to populate each field in the schema.

class FakeDataSourceReader(DataSourceReader):

    def __init__(self, schema, options):
        self.schema: StructType = schema
        self.options = options

    def read(self, partition):
        # Library imports must be within the method.
        from faker import Faker
        fake = Faker()

        # Every value in this `self.options` dictionary is a string.
        num_rows = int(self.options.get("numRows", 3))
        for _ in range(num_rows):
            row = []
            for field in self.schema.fields:
                value = getattr(fake, field.name)()
                row.append(value)
            yield tuple(row)

Step 3: Register and use the example data source

To use the data source, register it. By default, the FakeDataSource has three rows, and the schema includes these string fields: name, date, zipcode, state. The following example registers, loads, and outputs the example data source with the defaults:

spark.dataSource.register(FakeDataSource)
spark.read.format("fake").load().show()

+-----------------+----------+-------+----------+
|             name|      date|zipcode|     state|
+-----------------+----------+-------+----------+
|Christine Sampson|1979-04-24|  79766|  Colorado|
|       Shelby Cox|2011-08-05|  24596|   Florida|
|  Amanda Robinson|2019-01-06|  57395|Washington|
+-----------------+----------+-------+----------+

Only string fields are supported, but you can specify a schema with any fields that correspond to faker package providers’ fields to generate random data for testing and development. The following example loads the data source with name and company fields:

spark.read.format("fake").schema("name string, company string").load().show()

+---------------------+--------------+
|name                 |company       |
+---------------------+--------------+
|Tanner Brennan       |Adams Group   |
|Leslie Maxwell       |Santiago Group|
|Mrs. Jacqueline Brown|Maynard Inc   |
+---------------------+--------------+

To load the data source with a custom number of rows, specify the numRows option. The following example specifies 5 rows:

spark.read.format("fake").option("numRows", 5).load().show()

+--------------+----------+-------+------------+
|          name|      date|zipcode|       state|
+--------------+----------+-------+------------+
|  Pam Mitchell|1988-10-20|  23788|   Tennessee|
|Melissa Turner|1996-06-14|  30851|      Nevada|
|  Brian Ramsey|2021-08-21|  55277|  Washington|
|  Caitlin Reed|1983-06-22|  89813|Pennsylvania|
| Douglas James|2007-01-18|  46226|     Alabama|
+--------------+----------+-------+------------+

Example 2: Create PySpark DataSource for streaming read and write

To demonstrate PySpark DataSource stream reader and writer capabilities, create an example data source that generates two rows in every microbatch using the faker Python package. For more information about faker, see the Faker documentation.

Install the faker package using the following command:

%pip install faker

Step 1: Define the example DataSource

First, define your new PySpark DataSource as a subclass of DataSource with a name, schema, and methods streamReader() and streamWriter().

from pyspark.sql.datasource import DataSource, DataSourceStreamReader, SimpleDataSourceStreamReader, DataSourceStreamWriter
from pyspark.sql.types import StructType

class FakeStreamDataSource(DataSource):
    """
    An example data source for streaming read and write using the `faker` library.
    """

    @classmethod
    def name(cls):
        return "fakestream"

    def schema(self):
        return "name string, state string"

    def streamReader(self, schema: StructType):
        return FakeStreamReader(schema, self.options)

    # If you don't need partitioning, you can implement the simpleStreamReader method instead of streamReader.
    # def simpleStreamReader(self, schema: StructType):
    #    return SimpleStreamReader()

    def streamWriter(self, schema: StructType, overwrite: bool):
        return FakeStreamWriter(self.options)

Step 2: Implement the stream reader

Next, implement the example streaming data reader that generates two rows in every microbatch. You can implement DataSourceStreamReader, or if the data source has low throughput and doesn’t require partitioning, you can implement SimpleDataSourceStreamReader instead. Either simpleStreamReader() or streamReader() must be implemented, and simpleStreamReader() is only invoked when streamReader() is not implemented.

DataSourceStreamReader implementation

The streamReader instance has an integer offset that increases by 2 in every microbatch, implemented with the DataSourceStreamReader interface.

class RangePartition(InputPartition):
    def __init__(self, start, end):
        self.start = start
        self.end = end

class FakeStreamReader(DataSourceStreamReader):
    def __init__(self, schema, options):
        self.current = 0

    def initialOffset(self) -> dict:
        """
        Returns the initial start offset of the reader.
        """
        return {"offset": 0}

    def latestOffset(self) -> dict:
        """
        Returns the current latest offset that the next microbatch will read to.
        """
        self.current += 2
        return {"offset": self.current}

    def partitions(self, start: dict, end: dict):
        """
        Plans the partitioning of the current microbatch defined by start and end offset. It
        needs to return a sequence of :class:`InputPartition` objects.
        """
        return [RangePartition(start["offset"], end["offset"])]

    def commit(self, end: dict):
        """
        This is invoked when the query has finished processing data before end offset. This
        can be used to clean up the resource.
        """
        pass

    def read(self, partition) -> Iterator[Tuple]:
        """
        Takes a partition as an input and reads an iterator of tuples from the data source.
        """
        start, end = partition.start, partition.end
        for i in range(start, end):
            yield (i, str(i))

SimpleDataSourceStreamReader implementation

The SimpleStreamReader instance is the same as the FakeStreamReader instance that generates two rows in every batch, but implemented with the SimpleDataSourceStreamReader interface without partitioning.

class SimpleStreamReader(SimpleDataSourceStreamReader):
    def initialOffset(self):
        """
        Returns the initial start offset of the reader.
        """
        return {"offset": 0}

    def read(self, start: dict) -> (Iterator[Tuple], dict):
        """
        Takes start offset as an input, then returns an iterator of tuples and the start offset of the next read.
        """
        start_idx = start["offset"]
        it = iter([(i,) for i in range(start_idx, start_idx + 2)])
        return (it, {"offset": start_idx + 2})

    def readBetweenOffsets(self, start: dict, end: dict) -> Iterator[Tuple]:
        """
        Takes start and end offset as inputs, then reads an iterator of data deterministically.
        This is called when the query replays batches during restart or after a failure.
        """
        start_idx = start["offset"]
        end_idx = end["offset"]
        return iter([(i,) for i in range(start_idx, end_idx)])

    def commit(self, end):
        """
        This is invoked when the query has finished processing data before end offset. This can be used to clean up resources.
        """
        pass

Step 3: Implement the stream writer

Now implement the streaming writer. This streaming data writer writes the metadata information of each microbatch to a local path.

class SimpleCommitMessage(WriterCommitMessage):
   partition_id: int
   count: int

class FakeStreamWriter(DataSourceStreamWriter):
   def __init__(self, options):
       self.options = options
       self.path = self.options.get("path")
       assert self.path is not None

   def write(self, iterator):
       """
       Writes the data, then returns the commit message of that partition. Library imports must be within the method.
       """
       from pyspark import TaskContext
       context = TaskContext.get()
       partition_id = context.partitionId()
       cnt = 0
       for row in iterator:
           cnt += 1
       return SimpleCommitMessage(partition_id=partition_id, count=cnt)

   def commit(self, messages, batchId) -> None:
       """
       Receives a sequence of :class:`WriterCommitMessage` when all write tasks have succeeded, then decides what to do with it.
       In this FakeStreamWriter, the metadata of the microbatch(number of rows and partitions) is written into a JSON file inside commit().
       """
       status = dict(num_partitions=len(messages), rows=sum(m.count for m in messages))
       with open(os.path.join(self.path, f"{batchId}.json"), "a") as file:
           file.write(json.dumps(status) + "\n")

   def abort(self, messages, batchId) -> None:
       """
       Receives a sequence of :class:`WriterCommitMessage` from successful tasks when some other tasks have failed, then decides what to do with it.
       In this FakeStreamWriter, a failure message is written into a text file inside abort().
       """
       with open(os.path.join(self.path, f"{batchId}.txt"), "w") as file:
           file.write(f"failed in batch {batchId}")

Step 4: Register and use the example data source

To use the data source, register it. After it is regsitered, you can use it in streaming queries as a source or sink by passing a short name or full name to format(). The following example registers the data source, then starts a query that reads from the example data source and outputs to the console:

spark.dataSource.register(FakeStreamDataSource)
query = spark.readStream.format("fakestream").load().writeStream.format("console").start()

Alternatively, the following example uses the example stream as a sink and specifies an output path:

query = spark.readStream.format("fakestream").load().writeStream.format("fake").start("/output_path")

Troubleshooting

If the output is the following error, your compute does not support PySpark custom data sources. You must use Databricks Runtime 15.2 or above.

Error: [UNSUPPORTED_FEATURE.PYTHON_DATA_SOURCE] The feature is not supported: Python data sources. SQLSTATE: 0A000