Spark Code Cheatsheet

Word count in single line

 rdd.flatMap(line => line.split(" "))
 .map(word => (word,1)

Count number of lines in file


Display Spark version

print sc.version

Read csv file into dataframe in spark with header"header","true")

Read csv file into dataframe in spark without header
First you need to specify a schema:

schema = StructType([
    StructField("Field0", IntegerType()),
    StructField("Field1", StringType()),
    StructField("Field2", StringType())

Then read in the data:'com.databricks.spark.csv')
   .load(csv_path, schema = schema)

Handle commas in embedded quoted strings when reading in data

This is done by specifying an escape option:'com.databricks.spark.csv')
     .load(csv_path, schema = schema, escape='"')

Obtain distinct values of column in Spark DF'student_id').distinct().take(200)

Convert list of DataFrame Rows to Python list

student_ids=[x.student_id for x in df_raw

Filter out rows based on values in list


Save dataframe to csv (only for small datasets)

# Many Files
# Single File

Count number of occurrences of composite key in data frame:'major','gender').distinct().count()

Start pyspark in python notebook mode

export PYSPARK_DRIVER_PYTHON=ipython;pyspark