Before performing any analysis or modelling on a dataset, it often requires preprocessing. Manipulating rows and columns allows you to clean the data, handle missing values, filter out irrelevant information, and transform the data into a suitable format for analysis.
Mastering these skills enables you to work efficiently with large-scale datasets and derive meaningful insights for various data-driven tasks.
- Create a Dataframe
- Add, update, and remove rows in DataFrame
- Add, update and remove columns in DataFrame
- Rename columns in Dataframe
Create a DataFrame
Step1 : import libraries
from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import col, when
Step 2: Create a spark session and then the DataFrame
spark = SparkSession.builder.appName('PySpark_Scratch_tutorial').getOrCreate() # create a dataframe from scratch df = spark.createDataFrame([ (32, "Alice",170,3000), (35, "Bob",185,3200), (28, "John",181,3500), (24, "Jan",190,2800), (25, "Mary",155,2800), ( 33, "Nina",165,3100), ], schema=["age", "name","height","salary"]) df.show() # you should see this +---+-----+------+------+ |age| name|height|salary| +---+-----+------+------+ | 32|Alice| 170| 3000| | 35| Bob| 185| 3200| | 28| John| 181| 3500| | 24| Jan| 190| 2800| | 25| Mary| 155| 2800| | 33| Nina| 165| 3100| +---+-----+------+------+
Add, Update, and Remove Rows
1. Add a new row
For example, let’s add a new row about Ivy’s information
# New row about Ivy new_row = Row(25, 'Ivy',170, 0) # Add the new row to the original DataFrame df_new = df.unionAll(spark.createDataFrame([new_row], df.schema)) df_new.show() # you should see this +---+-----+------+------+ |age| name|height|salary| +---+-----+------+------+ | 32|Alice| 170| 3000| | 35| Bob| 185| 3200| | 28| John| 181| 3500| | 24| Jan| 190| 2800| | 25| Mary| 155| 2800| | 33| Nina| 165| 3100| | 25| Ivy| 170| 0| +---+-----+------+------+
2. Update a row
Since Ivy’s salary is 0, it is less likely to be true. We can assign a new value and update the row.
df_new = df_new.withColumn('salary', when(col('name') == 'Ivy', 1000).otherwise(col('name'))) df_new.show() # you should see this +---+-----+------+------+ |age| name|height|salary| +---+-----+------+------+ | 32|Alice| 170| Alice| | 35| Bob| 185| Bob| | 28| John| 181| John| | 24| Jan| 190| Jan| | 25| Mary| 155| Mary| | 33| Nina| 165| Nina| | 25| Ivy| 170| 1000| +---+-----+------+------+
3. Delete a row
You can remove rows based on certain conditions by using the filter
function.
# Remove the row with Name 'Ivy' df_filtered = df_new.filter(col('name') != 'Ivy') df_filtered.show() # you should see this +---+-----+------+------+ |age| name|height|salary| +---+-----+------+------+ | 32|Alice| 170| Alice| | 35| Bob| 185| Bob| | 28| John| 181| John| | 24| Jan| 190| Jan| | 25| Mary| 155| Mary| | 33| Nina| 165| Nina| +---+-----+------+------+
Add, Update, and Remove Columns
4. Add a new column
We can add a new column called “bonus”, the bonus value is 10% of salary.
# Add a new column called bonus df = df.withColumn("bonus", df['salary']/10) df.show() # you should see this +---+-----+------+------+-----+ |age| name|height|salary|bonus| +---+-----+------+------+-----+ | 32|Alice| 170| 3000|300.0| | 35| Bob| 185| 3200|320.0| | 28| John| 181| 3500|350.0| | 24| Jan| 190| 2800|280.0| | 25| Mary| 155| 2800|280.0| | 33| Nina| 165| 3100|310.0| +---+-----+------+------+-----+
We can add another column called “gender” and assign gender value (male and female) to each person. It’s called “Add column with conditions“.
# Define conditions for gender assignment gender_condition = when(df["name"].isin(["Alice", "Mary", "Nina"]), "Female").otherwise("Male") # Add a new column "gender" based on the conditions df = df.withColumn("gender", gender_condition) df.show() # you should see this +---+-----+------+------+-----+------+ |age| name|height|salary|bonus|gender| +---+-----+------+------+-----+------+ | 32|Alice| 170| 3000|300.0|Female| | 35| Bob| 185| 3200|320.0| Male| | 28| John| 181| 3500|350.0| Male| | 24| Jan| 190| 2800|280.0| Male| | 25| Mary| 155| 2800|280.0|Female| | 33| Nina| 165| 3100|310.0|Female| +---+-----+------+------+-----+------+
5. Update a column
I want to add 2cm to Mary and Nina because their height are lower than 170.
It is a good example of update a column based on conditions.
#Update the values of the "height" column based on a condition df_v1 = df.withColumn("height_v1", when(col("height") < 170, col("height") + 2).otherwise(col("height"))) df_v1.show() # you should see this +---+-----+------+------+-----+------+---------+ |age| name|height|salary|bonus|gender|height_v1| +---+-----+------+------+-----+------+---------+ | 32|Alice| 170| 3000|300.0|Female| 170| | 35| Bob| 185| 3200|320.0| Male| 185| | 28| John| 181| 3500|350.0| Male| 181| | 24| Jan| 190| 2800|280.0| Male| 190| | 25| Mary| 155| 2800|280.0|Female| 157| | 33| Nina| 165| 3100|310.0|Female| 167| +---+-----+------+------+-----+------+---------+
6. Delete a column
You can remove a column from a DataFrame by using the drop()
function.
df_v1 = df_v1.drop('height_v1') # the column "height_v1" is gone df_v1.show() # you should see this +---+-----+------+------+-----+------+ |age| name|height|salary|bonus|gender| +---+-----+------+------+-----+------+ | 32|Alice| 170| 3000|300.0|Female| | 35| Bob| 185| 3200|320.0| Male| | 28| John| 181| 3500|350.0| Male| | 24| Jan| 190| 2800|280.0| Male| | 25| Mary| 155| 2800|280.0|Female| | 33| Nina| 165| 3100|310.0|Female| +---+-----+------+------+-----+------+
7. Rename a column
You can rename a column from a DataFrame by using withColumnRenamed()
function.
In this case, I will change ‘gender’ to ‘Gender’
# rename column df_v1 = df_v1.withColumnRenamed('gender', 'Gender') # print all columns df_v1.columns # you should see this ['age', 'name', 'height', 'salary', 'bonus', 'Gender']