Natural language processing

You can perform natural language processing tasks on Databricks using popular open source libraries such as Spark ML and spark-nlp or proprietary libraries through the Databricks partnership with John Snow Labs.

For examples of NLP with Hugging Face, see Additional resources

Feature creation from text using Spark ML

Spark ML contains a range of text processing tools to create features from text columns. You can create input features from text for model training algorithms directly in your Spark ML pipelines using Spark ML. Spark ML supports a range of text processors, including tokenization, stop-word processing, word2vec, and feature hashing.

Training and inference using Spark NLP

You can scale out many deep learning methods for natural language processing on Spark using the open-source Spark NLP library. This library supports standard natural language processing operations such as tokenizing, named entity recognition, and vectorization using the included annotators. You can also summarize, perform named entity recognition, translate, and generate text using many pre-trained deep learning models based on Spark NLP’s transformers such as BERT and T5 Marion.

Perform inference in batch using Spark NLP on CPUs

Spark NLP provides many pre-trained models you can use with minimal code. This section contains an example of using the Marian Transformer for machine translation. For the full set of examples, see the Spark NLP documentation.

Requirements

  • Install Spark NLP on the cluster using the latest Maven coordinates for Spark NLP, such as com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0. Your cluster must be started with the appropriate Spark configuration options set in order for this library to work.

  • To use Spark NLP, your cluster must have the correct .jar file downloaded from John Snow Labs. You can create or use a cluster running any compatible runtime.

Example code for Machine Translation

In a notebook cell, install sparknlp python libraries:

%pip install sparknlp

Construct a pipeline for translation and run it on some sample text:

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import SentenceDetectorDLModel, MarianTransformer
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
  .setInputCols("document").setOutputCol("sentence")

marian_transformer = MarianTransformer.pretrained() \
  .setInputCols("sentence").setOutputCol("translation")

pipeline = Pipeline().setStages([document_assembler,  sentence_detector, marian_transformer])

data = spark.createDataFrame([["You can use Spark NLP to translate text. " + \
                               "This example pipeline translates English to French"]]).toDF("text")

# Create a pipeline model that can be reused across multiple data frames
model = pipeline.fit(data)

# You can use the model on any data frame that has a “text” column
result = model.transform(data)

display(result.select("text", "translation.result"))

Example: Named-entity recognition model using Spark NLP and MLflow

The example notebook illustrates how to train a named entity recognition model using Spark NLP, save the model to MLflow, and use the model for inference on text. Refer to the John Snow Labs documentation for Spark NLP to learn how to train additional natural language processing models.

Spark NLP model training and inference notebook

Open notebook in new tab