Reading the training dataset

There is a Cryotherapy.xlsx Excel file, which contains data as well as data usage agreement texts. So, I just copied the data and saved it in a CSV file named Cryotherapy.csv. Let's start by creating SparkSession—the gateway to access Spark:

val spark = SparkSession
.builder
.master("local[*]")
.config("spark.sql.warehouse.dir", "/temp")
.appName("CryotherapyPrediction")
.getOrCreate()

import spark.implicits._

Then let's read the training set and see a glimpse of it:

var CryotherapyDF = spark.read.option("header", "true")
.option("inferSchema", "true")
.csv("data/Cryotherapy.csv")

Let's take a look to see if the preceding CSV reader managed to read the data properly, including header and types:

CryotherapyDF.printSchema()

As seen from the following screenshot, the schema of the Spark DataFrame has been correctly identified. Also, as expected, all the features of my ML algorithms are numeric (in other words, in integer or double format):

A snapshot of the dataset can be seen using the show() method. We can limit the number of rows; here, let's say 5:

CryotherapyDF.show(5)

The output of the preceding line of code shows the first five samples of the DataFrame: