In PySpark, datetime
operations are efficiently handled using functions from pyspark.sql.functions
. These functions facilitate various common operations such as extracting components like year, month, day, from datetime
values.
Some common operations include:
- Formatting: Convert datetime to a specific format.
- Extracting components: Extract year, month, day, hour, minute, second, etc.
- Calculations: Perform arithmetic operations on datetime values.
- Comparisons: Compare datetime values by filtering and Sorting
from pyspark.sql import SparkSession from pyspark.sql.functions import * # create a sparksession spark = SparkSession.builder \ .appName("PySpark Datetime Tutorial") \ .getOrCreate()
Create a DataFrame with datetime string
data = [("2023-01-15 10:30:00",), ("2023-02-20 12:45:00",), ("2023-03-25 15:00:00",), ("2023-04-12 00:15:00",), ("2023-02-20 02:45:00",), ("2023-09-18 16:44:00",), ] df = spark.createDataFrame(data, ["datetime_string"]) # check the data types df.dtypes # you should see this [('datetime_string', 'string')]
Convert datetime to a specific format
- Create a new column called “
ts
” and cast the the column “datetime_string
” from string to timestamp.
df = df.withColumn("ts", df["datetime_string"].cast("timestamp")) df.dtypes # you should see this [('datetime_string', 'string'), ('ts', 'timestamp')]
2. Create a new column called “timestamp
” and convert column “datetime_string
” from string to a specific timestamp format
# Convert datetime strings to timestamp datatype df= df.withColumn("timestamp", to_timestamp(df["datetime_string"], "yyyy-MM-dd HH:mm:ss")) df.show() #you should see this +-------------------+-------------------+-------------------+ | datetime_string| ts| timestamp| +-------------------+-------------------+-------------------+ |2023-01-15 10:30:00|2023-01-15 10:30:00|2023-01-15 10:30:00| |2023-02-20 12:45:00|2023-02-20 12:45:00|2023-02-20 12:45:00| |2023-03-25 15:00:00|2023-03-25 15:00:00|2023-03-25 15:00:00| |2023-04-12 00:15:00|2023-04-12 00:15:00|2023-04-12 00:15:00| |2023-02-20 02:45:00|2023-02-20 02:45:00|2023-02-20 02:45:00| |2023-09-18 16:44:00|2023-09-18 16:44:00|2023-09-18 16:44:00| +-------------------+-------------------+-------------------+
3. Create a new column called “only_date_format"
# Format the timestamp into a only date string format df = df.withColumn("only_date_format", date_format(df["timestamp"], "yyyy/MM/dd")) df.select('datetime_string','only_date_format').show() #you should see this +-------------------+----------------+ | datetime_string|only_date_format| +-------------------+----------------+ |2023-01-15 10:30:00| 2023/01/15| |2023-02-20 12:45:00| 2023/02/20| |2023-03-25 15:00:00| 2023/03/25| |2023-04-12 00:15:00| 2023/04/12| |2023-02-20 02:45:00| 2023/02/20| |2023-09-18 16:44:00| 2023/09/18| +-------------------+----------------+
4. Create a new column called “only_time_format"
# Format the timestamp into a only time string format df = df.withColumn("only_time_format", date_format(df["timestamp"], "HH:mm:ss")) df.select('datetime_string','only_time_format').show() # you should see this +-------------------+----------------+ | datetime_string|only_time_format| +-------------------+----------------+ |2023-01-15 10:30:00| 10:30:00| |2023-02-20 12:45:00| 12:45:00| |2023-03-25 15:00:00| 15:00:00| |2023-04-12 00:15:00| 00:15:00| |2023-02-20 02:45:00| 02:45:00| |2023-09-18 16:44:00| 16:44:00| +-------------------+----------------+
Extract year, month, day, hour, minute and second from timestamp column
# Extract year, month, day df = df.withColumn('year', year(df['timestamp'])) df = df.withColumn('month', month(df['timestamp'])) df = df.withColumn('day', dayofmonth(df['timestamp'])) # Extract hour, minute, and second components df = df.withColumn("hour", hour(df["timestamp"])) df = df.withColumn("minute", minute(df["timestamp"])) df = df.withColumn("second", second(df["timestamp"])) df.select('ts','year','month','day','hour','minute','second').show() # you should see this +-------------------+----+-----+---+----+------+------+ | ts|year|month|day|hour|minute|second| +-------------------+----+-----+---+----+------+------+ |2023-01-15 10:30:00|2023| 1| 15| 10| 30| 0| |2023-02-20 12:45:00|2023| 2| 20| 12| 45| 0| |2023-03-25 15:00:00|2023| 3| 25| 15| 0| 0| |2023-04-12 00:15:00|2023| 4| 12| 0| 15| 0| |2023-02-20 02:45:00|2023| 2| 20| 2| 45| 0| |2023-09-18 16:44:00|2023| 9| 18| 16| 44| 0| +-------------------+----+-----+---+----+------+------+
Calculate time difference
We can create a new column called “timestamp_plus_5_min
“, which adds 5 minute more from the “timestamp
” column, and compare the time difference between the two columns.
df = df.withColumn("timestamp_plus_5_min", expr("timestamp + interval 5 minutes")) df = df.withColumn('time_diff', col('timestamp_plus_5_min') - col('timestamp')) df.select('timestamp','timestamp_plus_5_min', 'time_diff').show() # you should see this +-------------------+--------------------+--------------------+ | timestamp|timestamp_plus_5_min| time_diff| +-------------------+--------------------+--------------------+ |2023-01-15 10:30:00| 2023-01-15 10:35:00|INTERVAL '0 00:05...| |2023-02-20 12:45:00| 2023-02-20 12:50:00|INTERVAL '0 00:05...| |2023-03-25 15:00:00| 2023-03-25 15:05:00|INTERVAL '0 00:05...| |2023-04-12 00:15:00| 2023-04-12 00:20:00|INTERVAL '0 00:05...| |2023-02-20 02:45:00| 2023-02-20 02:50:00|INTERVAL '0 00:05...| |2023-09-18 16:44:00| 2023-09-18 16:49:00|INTERVAL '0 00:05...| +-------------------+--------------------+--------------------+
Compare and Sort Datetime Column
Filter based on Datetime. For example, you can select the rows after “2023-04-01”
# Filter based on datetime df.select('datetime_string','timestamp').filter(df['timestamp'] > lit('2023-04-01')).show() #you should see this +-------------------+-------------------+ | datetime_string| timestamp| +-------------------+-------------------+ |2023-04-12 00:15:00|2023-04-12 00:15:00| |2023-09-18 16:44:00|2023-09-18 16:44:00| +-------------------+-------------------+
Sort based on Datetime. The default sort mode is ascending.
# Ascending Sort df.select('timestamp').orderBy(df['timestamp']).show() # you should see this +-------------------+ | timestamp| +-------------------+ |2023-01-15 10:30:00| |2023-02-20 02:45:00| |2023-02-20 12:45:00| |2023-03-25 15:00:00| |2023-04-12 00:15:00| |2023-09-18 16:44:00| +-------------------+
# Descending Sort df.select('timestamp').orderBy(df['timestamp'],ascending = False).show(). # you should see this +-------------------+ | timestamp| +-------------------+ |2023-09-18 16:44:00| |2023-04-12 00:15:00| |2023-03-25 15:00:00| |2023-02-20 12:45:00| |2023-02-20 02:45:00| |2023-01-15 10:30:00| +-------------------+