You can perform natural language processing tasks on Databricks using popular open source libraries such as Spark ML and spark-nlp or proprietary libraries through the Databricks partnership with John Snow Labs.
For examples of NLP with Hugging Face, see Additional resources
Spark ML contains a range of text processing tools to create features from text columns. You can create input features from text for model training algorithms directly in your Spark ML pipelines using Spark ML. Spark ML supports a range of text processors, including tokenization, stop-word processing, word2vec, and feature hashing.
You can scale out many deep learning methods for natural language processing on Spark using the open-source Spark NLP library. This library supports standard natural language processing operations such as tokenizing, named entity recognition, and vectorization using the included annotators. You can also summarize, perform named entity recognition, translate, and generate text using many pre-trained deep learning models based on Spark NLP’s transformers such as BERT and T5 Marion.
Spark NLP provides many pre-trained models you can use with minimal code. This section contains an example of using the Marian Transformer for machine translation. For the full set of examples, see the Spark NLP documentation.
To use Spark NLP, create or use a cluster running any compatible runtime.
Install Spark NLP on the cluster using the latest Maven coordinates for Spark NLP, such as
In a notebook cell, install
sparknlp python libraries:
%pip install sparknlp
Construct a pipeline for translation and run it on some sample text:
from sparknlp.base import DocumentAssembler from sparknlp.annotator import SentenceDetectorDLModel, MarianTransformer from pyspark.ml import Pipeline document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \ .setInputCols("document").setOutputCol("sentence") marian_transformer = MarianTransformer.pretrained() \ .setInputCols("sentence").setOutputCol("translation") pipeline = Pipeline().setStages([document_assembler, sentence_detector, marian_transformer]) data = spark.createDataFrame([["You can use Spark NLP to translate text. " + \ "This example pipeline translates English to French"]]).toDF("text") # Create a pipeline model that can be reused across multiple data frames model = pipeline.fit(data) # You can use the model on any data frame that has a “text” column result = model.transform(data) display(result.select("text", "translation.result"))
The example notebook illustrates how to train a named entity recognition model using Spark NLP, save the model to MLflow, and use the model for inference on text. Refer to the John Snow Labs documentation for Spark NLP to learn how to train additional natural language processing models.