Skip to content

#4 Process Datetime in PySpark


In PySpark, datetime operations are efficiently handled using functions from pyspark.sql.functions. These functions facilitate various common operations such as extracting components like year, month, day, from datetime values.

Some common operations include:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# create a sparksession

spark = SparkSession.builder \
    .appName("PySpark Datetime Tutorial") \
    .getOrCreate()

Create a DataFrame with datetime string

data = [("2023-01-15 10:30:00",), 
        ("2023-02-20 12:45:00",), 
        ("2023-03-25 15:00:00",),
        ("2023-04-12 00:15:00",), 
        ("2023-02-20 02:45:00",), 
        ("2023-09-18 16:44:00",),
        ]
df = spark.createDataFrame(data, ["datetime_string"])

# check the data types
df.dtypes

# you should see this
[('datetime_string', 'string')]

Convert datetime to a specific format

  1. Create a new column called “ts” and cast the the column “datetime_string” from string to timestamp.
df = df.withColumn("ts", df["datetime_string"].cast("timestamp"))
df.dtypes

# you should see this
[('datetime_string', 'string'), ('ts', 'timestamp')]

2. Create a new column called “timestamp” and convert column “datetime_string” from string to a specific timestamp format

# Convert datetime strings to timestamp datatype
df= df.withColumn("timestamp", to_timestamp(df["datetime_string"], "yyyy-MM-dd HH:mm:ss"))
df.show()

#you should see this                                                                            
+-------------------+-------------------+-------------------+
|    datetime_string|                 ts|          timestamp|
+-------------------+-------------------+-------------------+
|2023-01-15 10:30:00|2023-01-15 10:30:00|2023-01-15 10:30:00|
|2023-02-20 12:45:00|2023-02-20 12:45:00|2023-02-20 12:45:00|
|2023-03-25 15:00:00|2023-03-25 15:00:00|2023-03-25 15:00:00|
|2023-04-12 00:15:00|2023-04-12 00:15:00|2023-04-12 00:15:00|
|2023-02-20 02:45:00|2023-02-20 02:45:00|2023-02-20 02:45:00|
|2023-09-18 16:44:00|2023-09-18 16:44:00|2023-09-18 16:44:00|
+-------------------+-------------------+-------------------+

3. Create a new column called “only_date_format"

# Format the timestamp into a only date string format
df = df.withColumn("only_date_format", date_format(df["timestamp"], "yyyy/MM/dd"))
df.select('datetime_string','only_date_format').show()

#you should see this
+-------------------+----------------+
|    datetime_string|only_date_format|
+-------------------+----------------+
|2023-01-15 10:30:00|      2023/01/15|
|2023-02-20 12:45:00|      2023/02/20|
|2023-03-25 15:00:00|      2023/03/25|
|2023-04-12 00:15:00|      2023/04/12|
|2023-02-20 02:45:00|      2023/02/20|
|2023-09-18 16:44:00|      2023/09/18|
+-------------------+----------------+

4. Create a new column called “only_time_format"

# Format the timestamp into a only time string format
df = df.withColumn("only_time_format", date_format(df["timestamp"], "HH:mm:ss"))
df.select('datetime_string','only_time_format').show()

# you should see this
+-------------------+----------------+
|    datetime_string|only_time_format|
+-------------------+----------------+
|2023-01-15 10:30:00|        10:30:00|
|2023-02-20 12:45:00|        12:45:00|
|2023-03-25 15:00:00|        15:00:00|
|2023-04-12 00:15:00|        00:15:00|
|2023-02-20 02:45:00|        02:45:00|
|2023-09-18 16:44:00|        16:44:00|
+-------------------+----------------+

Extract year, month, day, hour, minute and second from timestamp column

# Extract year, month, day
df = df.withColumn('year', year(df['timestamp']))
df = df.withColumn('month', month(df['timestamp']))
df = df.withColumn('day', dayofmonth(df['timestamp']))

# Extract hour, minute, and second components
df = df.withColumn("hour", hour(df["timestamp"]))
df = df.withColumn("minute", minute(df["timestamp"]))
df = df.withColumn("second", second(df["timestamp"]))

df.select('ts','year','month','day','hour','minute','second').show()

# you should see this
+-------------------+----+-----+---+----+------+------+
|                 ts|year|month|day|hour|minute|second|
+-------------------+----+-----+---+----+------+------+
|2023-01-15 10:30:00|2023|    1| 15|  10|    30|     0|
|2023-02-20 12:45:00|2023|    2| 20|  12|    45|     0|
|2023-03-25 15:00:00|2023|    3| 25|  15|     0|     0|
|2023-04-12 00:15:00|2023|    4| 12|   0|    15|     0|
|2023-02-20 02:45:00|2023|    2| 20|   2|    45|     0|
|2023-09-18 16:44:00|2023|    9| 18|  16|    44|     0|
+-------------------+----+-----+---+----+------+------+

Calculate time difference

We can create a new column called “timestamp_plus_5_min“, which adds 5 minute more from the “timestamp” column, and compare the time difference between the two columns.

df = df.withColumn("timestamp_plus_5_min", expr("timestamp + interval 5 minutes"))
df = df.withColumn('time_diff', col('timestamp_plus_5_min') - col('timestamp'))
df.select('timestamp','timestamp_plus_5_min', 'time_diff').show()

# you should see this
+-------------------+--------------------+--------------------+
|          timestamp|timestamp_plus_5_min|           time_diff|
+-------------------+--------------------+--------------------+
|2023-01-15 10:30:00| 2023-01-15 10:35:00|INTERVAL '0 00:05...|
|2023-02-20 12:45:00| 2023-02-20 12:50:00|INTERVAL '0 00:05...|
|2023-03-25 15:00:00| 2023-03-25 15:05:00|INTERVAL '0 00:05...|
|2023-04-12 00:15:00| 2023-04-12 00:20:00|INTERVAL '0 00:05...|
|2023-02-20 02:45:00| 2023-02-20 02:50:00|INTERVAL '0 00:05...|
|2023-09-18 16:44:00| 2023-09-18 16:49:00|INTERVAL '0 00:05...|
+-------------------+--------------------+--------------------+

Compare and Sort Datetime Column

Filter based on Datetime. For example, you can select the rows after “2023-04-01”

# Filter based on datetime
df.select('datetime_string','timestamp').filter(df['timestamp'] > lit('2023-04-01')).show()

#you should see this
+-------------------+-------------------+
|    datetime_string|          timestamp|
+-------------------+-------------------+
|2023-04-12 00:15:00|2023-04-12 00:15:00|
|2023-09-18 16:44:00|2023-09-18 16:44:00|
+-------------------+-------------------+

Sort based on Datetime. The default sort mode is ascending.

# Ascending Sort
df.select('timestamp').orderBy(df['timestamp']).show()
# you should see this
+-------------------+
|          timestamp|
+-------------------+
|2023-01-15 10:30:00|
|2023-02-20 02:45:00|
|2023-02-20 12:45:00|
|2023-03-25 15:00:00|
|2023-04-12 00:15:00|
|2023-09-18 16:44:00|
+-------------------+
# Descending Sort
df.select('timestamp').orderBy(df['timestamp'],ascending = False).show(). 
# you should see this
+-------------------+
|          timestamp|
+-------------------+
|2023-09-18 16:44:00|
|2023-04-12 00:15:00|
|2023-03-25 15:00:00|
|2023-02-20 12:45:00|
|2023-02-20 02:45:00|
|2023-01-15 10:30:00|
+-------------------+