Spark Code Cheatsheet

Word count in single line

 rdd.flatMap(line => line.split(" "))
 .map(word => (word,1)
 .reduceByKey((count,accum)=>(count+accum))

Count number of lines in file

rdd.count

Display Spark version

print sc.version

Read csv file into dataframe in spark with header

df=spark.read.option("header","true")
    .option("delimiter",",").csv(csv_path)

Read csv file into dataframe in spark without header
First you need to specify a schema:

schema = StructType([
    StructField("Field0", IntegerType()),
    StructField("Field1", StringType()),
    StructField("Field2", StringType())
    ])

Then read in the data:

df=spark.read.format('com.databricks.spark.csv')
   .load(csv_path, schema = schema)


Handle commas in embedded quoted strings when reading in data

This is done by specifying an escape option:

df_raw=spark.read.format('com.databricks.spark.csv')
     .load(csv_path, schema = schema, escape='"')

Obtain distinct values of column in Spark DF

student_ids=df_raw.select('student_id').distinct().take(200)

Convert list of DataFrame Rows to Python list

student_ids=[x.student_id for x in df_raw
            .select('student_id').distinct().take(200)]

Filter out rows based on values in list

filtered_df=df_raw.where(col('student_id').isin(student_ids))

Save dataframe to csv (only for small datasets)

# Many Files
filtered_df.write.csv(output_csv) 
# Single File
filtered_df.coalesce(1).write
.option("header","true").csv(output_csv) 

Count number of occurrences of composite key in data frame:

df.select('major','gender').distinct().count()

Start pyspark in python notebook mode

export PYSPARK_DRIVER_PYTHON=ipython;pyspark

Leave a Reply

Your email address will not be published. Required fields are marked *