# Tutorial: Analyze data with glm

Learn how to perform linear and logistic regression using a generalized linear model (GLM) in Databricks. `glm`

fits a Generalized Linear Model, similar to R’s `glm()`

.

**Syntax**: `glm(formula, data, family...)`

**Parameters**:

`formula`

: Symbolic description of model to be fitted, for eg:`ResponseVariable ~ Predictor1 + Predictor2`

. Supported operators:`~`

,`+`

,`-`

, and`.`

`data`

: Any SparkDataFrame`family`

: String,`"gaussian"`

for linear regression or`"binomial"`

for logistic regression`lambda`

: Numeric, Regularization parameter`alpha`

: Numeric, Elastic-net mixing parameter

**Output**: MLlib PipelineModel

This tutorial shows how to perform linear and logistic regression on the diamonds dataset.

## Load diamonds data and split into training and test sets

```
require(SparkR)
# Read diamonds.csv dataset as SparkDataFrame
diamonds <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
source = "com.databricks.spark.csv", header="true", inferSchema = "true")
diamonds <- withColumnRenamed(diamonds, "", "rowID")
# Split data into Training set and Test set
trainingData <- sample(diamonds, FALSE, 0.7)
testData <- except(diamonds, trainingData)
# Exclude rowIDs
trainingData <- trainingData[, -1]
testData <- testData[, -1]
print(count(diamonds))
print(count(trainingData))
print(count(testData))
```

```
head(trainingData)
```

## Train a linear regression model using `glm()`

This section shows how to predict a diamond’s price from its features by training a linear regression model using the training data.

There is a mix of categorical features (cut - Ideal, Premium, Very Good…) and continuous features (depth, carat). SparkR automatically encodes these features so you don’t have to encode these features manually.

```
# Family = "gaussian" to train a linear regression model
lrModel <- glm(price ~ ., data = trainingData, family = "gaussian")
# Print a summary of the trained model
summary(lrModel)
```

Use `predict()`

on the test data to see how well the model works on new data.

**Syntax:** `predict(model, newData)`

**Parameters:**

`model`

: MLlib model`newData`

: SparkDataFrame, typically your test set

**Output:** `SparkDataFrame`

```
# Generate predictions using the trained model
predictions <- predict(lrModel, newData = testData)
# View predictions against mpg column
display(select(predictions, "price", "prediction"))
```

Evaluate the model.

```
errors <- select(predictions, predictions$price, predictions$prediction, alias(predictions$price - predictions$prediction, "error"))
display(errors)
# Calculate RMSE
head(select(errors, alias(sqrt(sum(errors$error^2 , na.rm = TRUE) / nrow(errors)), "RMSE")))
```

## Train a logistic regression model using `glm()`

This section shows how to create a logistic regression on the same dataset to predict a diamond’s cut based on some of its features.

Logistic regression in MLlib supports binary classification. To test the algorithm in this example, subset the data to work with two labels.

```
# Subset data to include rows where diamond cut = "Premium" or diamond cut = "Very Good"
trainingDataSub <- subset(trainingData, trainingData$cut %in% c("Premium", "Very Good"))
testDataSub <- subset(testData, testData$cut %in% c("Premium", "Very Good"))
```

```
# Family = "binomial" to train a logistic regression model
logrModel <- glm(cut ~ price + color + clarity + depth, data = trainingDataSub, family = "binomial")
# Print summary of the trained model
summary(logrModel)
```

```
# Generate predictions using the trained model
predictionsLogR <- predict(logrModel, newData = testDataSub)
# View predictions against label column
display(select(predictionsLogR, "label", "prediction"))
```

Evaluate the model.

```
errorsLogR <- select(predictionsLogR, predictionsLogR$label, predictionsLogR$prediction, alias(abs(predictionsLogR$label - predictionsLogR$prediction), "error"))
display(errorsLogR)
```