Databricks InferSchema Performance Revisted Hello Again

Hello! I wrote last month about inferschema performance when using Databricks to read a csv into a dataframe. Long story short: it’s better to declare the schema of the dataframe prior to reading the csv as if you let Databricks to determine the schema then it has to pass the entire contents of the csv once to verify the schema. All very good, but this can become problematic when you are reading a particularly wide csv, especially if you do not need all of the columns, only a subset of them.

However, there is a solution to this: if you use withcolumn to pick out the columns you need when reading the csv and declare the data type of the new column into the dataframe, then select only those dataframes, you get the added benefit of not needing to declare a full schema whilst getting the faster execution time as if you did declare one.

The below notebook has 3 cells: the first sets inferschema to true so that no schema is declared. On a small cluster this takes about 7 seconds to run. The second cell uses a schema and takes about 3 seconds to run. Obviously with a larger csv the benefits are more profound. THe final cell uses withcolumn to create the columns we want from the csv and then selects them into the final dataframe. This took about 4 seconds to run. The good thing about this method is that you get the dataframe you want without the overhead of columns you never wanted in the first place.

# Databricks notebook source
csv_dir = "/databricks-datasets/bikeSharing/data-001/hour.csv"
csv_options = {
          'inferSchema': 'true',
          'multiline': 'true',
          'header': 'true',
          'encoding': 'UTF8'
      }
infer_schema_df = spark.read.options(**csv_options).csv(csv_dir)
display(infer_schema_df)

# COMMAND ----------

from pyspark.sql.types import *


schema = StructType([
StructField("instant", IntegerType(), True),
StructField("dteday", TimestampType(), True),
StructField("season", IntegerType(), True),
StructField("yr", IntegerType(), True),
StructField("mnth", IntegerType(), True),
StructField("hr", IntegerType(), True),
StructField("holiday", IntegerType(), True),
StructField("weekday", IntegerType(), True),
StructField("workingday", IntegerType(), True),
StructField("weathersit", IntegerType(), True),
StructField("temp", DoubleType(), True),
StructField("atemp", DoubleType(), True),
StructField("hum", DoubleType(), True),
StructField("windspeed", DoubleType(), True),
StructField("casual", IntegerType(), True),
StructField("registered", IntegerType(), True),
StructField("cnt", DoubleType(), True)])

csv_dir = "/databricks-datasets/bikeSharing/data-001/hour.csv"
csv_options = {
          'inferSchema': 'false',
          'multiline': 'true',
          'header': 'true',
          'encoding': 'UTF8'
      }
no_infer_df = spark.read.options(**csv_options).csv(csv_dir,schema=schema)
display(no_infer_df)

# COMMAND ----------

from pyspark.sql.types import *
from pyspark.sql.functions import *
# Databricks notebook source
csv_dir = "/databricks-datasets/bikeSharing/data-001/hour.csv"
csv_options = {
          'inferSchema': 'false',
          'multiline': 'true',
          'header': 'true',
          'encoding': 'UTF8'
      }
single_col_df = spark.read.options(**csv_options).csv(csv_dir).withColumn('dteday', col('dteday').cast(TimestampType())).select('dteday')
display(single_col_df)

# COMMAND ----------

bzzzt!

What's Going On?

Databricks InferSchema Performance Revisted Hello Again