Skip to content

#3 Manipulate Rows and Columns in PySpark

Before performing any analysis or modelling on a dataset, it often requires preprocessing. Manipulating rows and columns allows you to clean the data, handle missing values, filter out irrelevant information, and transform the data into a suitable format for analysis.

Mastering these skills enables you to work efficiently with large-scale datasets and derive meaningful insights for various data-driven tasks.

Create a DataFrame

Step1 : import libraries

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import col, when

Step 2: Create a spark session and then the DataFrame

spark = SparkSession.builder.appName('PySpark_Scratch_tutorial').getOrCreate()

# create a dataframe from scratch
df = spark.createDataFrame([
    (32, "Alice",170,3000), (35, "Bob",185,3200),
    (28, "John",181,3500), (24, "Jan",190,2800),
    (25, "Mary",155,2800), ( 33, "Nina",165,3100),
    ], 
    schema=["age", "name","height","salary"])
df.show()

# you should see this
+---+-----+------+------+
|age| name|height|salary|
+---+-----+------+------+
| 32|Alice|   170|  3000|
| 35|  Bob|   185|  3200|
| 28| John|   181|  3500|
| 24|  Jan|   190|  2800|
| 25| Mary|   155|  2800|
| 33| Nina|   165|  3100|
+---+-----+------+------+

Add, Update, and Remove Rows

1. Add a new row

For example, let’s add a new row about Ivy’s information

# New row about Ivy
new_row = Row(25, 'Ivy',170, 0)

# Add the new row to the original DataFrame
df_new = df.unionAll(spark.createDataFrame([new_row], df.schema))
df_new.show()

# you should see this
+---+-----+------+------+
|age| name|height|salary|
+---+-----+------+------+
| 32|Alice|   170|  3000|
| 35|  Bob|   185|  3200|
| 28| John|   181|  3500|
| 24|  Jan|   190|  2800|
| 25| Mary|   155|  2800|
| 33| Nina|   165|  3100|
| 25|  Ivy|   170|     0|
+---+-----+------+------+

2. Update a row

Since Ivy’s salary is 0, it is less likely to be true. We can assign a new value and update the row.

df_new = df_new.withColumn('salary', when(col('name') == 'Ivy', 1000).otherwise(col('name')))
df_new.show()

# you should see this
+---+-----+------+------+
|age| name|height|salary|
+---+-----+------+------+
| 32|Alice|   170| Alice|
| 35|  Bob|   185|   Bob|
| 28| John|   181|  John|
| 24|  Jan|   190|   Jan|
| 25| Mary|   155|  Mary|
| 33| Nina|   165|  Nina|
| 25|  Ivy|   170|  1000|
+---+-----+------+------+

3. Delete a row

You can remove rows based on certain conditions by using the filter function.

# Remove the row with Name 'Ivy'
df_filtered = df_new.filter(col('name') != 'Ivy')
df_filtered.show()
# you should see this
+---+-----+------+------+
|age| name|height|salary|
+---+-----+------+------+
| 32|Alice|   170| Alice|
| 35|  Bob|   185|   Bob|
| 28| John|   181|  John|
| 24|  Jan|   190|   Jan|
| 25| Mary|   155|  Mary|
| 33| Nina|   165|  Nina|
+---+-----+------+------+

Add, Update, and Remove Columns

4. Add a new column

We can add a new column called “bonus”, the bonus value is 10% of salary.

# Add a new column called bonus
df = df.withColumn("bonus", df['salary']/10)
df.show()

# you should see this
+---+-----+------+------+-----+
|age| name|height|salary|bonus|
+---+-----+------+------+-----+
| 32|Alice|   170|  3000|300.0|
| 35|  Bob|   185|  3200|320.0|
| 28| John|   181|  3500|350.0|
| 24|  Jan|   190|  2800|280.0|
| 25| Mary|   155|  2800|280.0|
| 33| Nina|   165|  3100|310.0|
+---+-----+------+------+-----+

We can add another column called “gender” and assign gender value (male and female) to each person. It’s called “Add column with conditions“.

# Define conditions for gender assignment
gender_condition = when(df["name"].isin(["Alice", "Mary", "Nina"]), "Female").otherwise("Male")

# Add a new column "gender" based on the conditions
df = df.withColumn("gender", gender_condition)
df.show()

# you should see this
+---+-----+------+------+-----+------+
|age| name|height|salary|bonus|gender|
+---+-----+------+------+-----+------+
| 32|Alice|   170|  3000|300.0|Female|
| 35|  Bob|   185|  3200|320.0|  Male|
| 28| John|   181|  3500|350.0|  Male|
| 24|  Jan|   190|  2800|280.0|  Male|
| 25| Mary|   155|  2800|280.0|Female|
| 33| Nina|   165|  3100|310.0|Female|
+---+-----+------+------+-----+------+

5. Update a column

I want to add 2cm to Mary and Nina because their height are lower than 170.

It is a good example of update a column based on conditions.

#Update the values of the "height" column based on a condition
df_v1 = df.withColumn("height_v1", when(col("height") < 170, col("height") + 2).otherwise(col("height")))
df_v1.show()

# you should see this
+---+-----+------+------+-----+------+---------+
|age| name|height|salary|bonus|gender|height_v1|
+---+-----+------+------+-----+------+---------+
| 32|Alice|   170|  3000|300.0|Female|      170|
| 35|  Bob|   185|  3200|320.0|  Male|      185|
| 28| John|   181|  3500|350.0|  Male|      181|
| 24|  Jan|   190|  2800|280.0|  Male|      190|
| 25| Mary|   155|  2800|280.0|Female|      157|
| 33| Nina|   165|  3100|310.0|Female|      167|
+---+-----+------+------+-----+------+---------+

6. Delete a column

You can remove a column from a DataFrame by using the drop() function.

df_v1 = df_v1.drop('height_v1') # the column "height_v1" is gone
df_v1.show()

# you should see this
+---+-----+------+------+-----+------+
|age| name|height|salary|bonus|gender|
+---+-----+------+------+-----+------+
| 32|Alice|   170|  3000|300.0|Female|
| 35|  Bob|   185|  3200|320.0|  Male|
| 28| John|   181|  3500|350.0|  Male|
| 24|  Jan|   190|  2800|280.0|  Male|
| 25| Mary|   155|  2800|280.0|Female|
| 33| Nina|   165|  3100|310.0|Female|
+---+-----+------+------+-----+------+

7. Rename a column

You can rename a column from a DataFrame by using withColumnRenamed() function.

In this case, I will change ‘gender’ to ‘Gender’

# rename column
df_v1 = df_v1.withColumnRenamed('gender', 'Gender')
# print all columns
df_v1.columns

# you should see this
['age', 'name', 'height', 'salary', 'bonus', 'Gender']