Word count in single line
rdd.flatMap(line => line.split(" "))
.map(word => (word,1)
.reduceByKey((count,accum)=>(count+accum))
Count number of lines in file
rdd.count
Display Spark version
print sc.version
Read csv file into dataframe in spark with header
df=spark.read.option("header","true") .option("delimiter",",").csv(csv_path)
Read csv file into dataframe in spark without header
First you need to specify a schema:
schema = StructType([ StructField("Field0", IntegerType()), StructField("Field1", StringType()), StructField("Field2", StringType()) ])
Then read in the data:
df=spark.read.format('com.databricks.spark.csv') .load(csv_path, schema = schema)
Handle commas in embedded quoted strings when reading in data
This is done by specifying an escape option:
df_raw=spark.read.format('com.databricks.spark.csv') .load(csv_path, schema = schema, escape='"')
Obtain distinct values of column in Spark DF
student_ids=df_raw.select('student_id').distinct().take(200)
Convert list of DataFrame Rows to Python list
student_ids=[x.student_id for x in df_raw .select('student_id').distinct().take(200)]
Filter out rows based on values in list
filtered_df=df_raw.where(col('student_id').isin(student_ids))
Save dataframe to csv (only for small datasets)
# Many Files filtered_df.write.csv(output_csv) # Single File filtered_df.coalesce(1).write .option("header","true").csv(output_csv)
Count number of occurrences of composite key in data frame:
df.select('major','gender').distinct().count()
Start pyspark in python notebook mode
export PYSPARK_DRIVER_PYTHON=ipython;pyspark
Display spark dataframe with all columns using pandas
import pandas as pd pd.options.display.max_columns = None pd.set_option('max_colwidth',100) df.limit(5).toPandas()