Nan's Blog: PySpark SQL - write out records with time across multiple days

Thursday, September 21, 2023

PySpark SQL - write out records with time across multiple days

Example:

From A:

To B:

Logic:

1. Use sequence() to list out the interval dates between the start and end date.

2. Use explode() to create duplicate rows with the interval dates.

3. Use lag() and row_number() to create extra dates columns for the next steps.

4. Use CASE statement to select different dates based on the row_no to indicate the first/last row.

Code:

tbl_explode as (
  select StartDateTime
      , explode(sequence(to_date(dateadd(day,1,StartDateTime)), to_date(dateadd(day,1,EndDateTime)), interval 1 day))  as interval_date
      , EndDateTime
      , *
  from tbl_1
  where datediff(day, StartDateTime,EndDateTime) > 0 or cast(StartDateTime as date) <> cast(EndDateTimeUTC as date)
),
tbl_time_split as (
  select  CASE WHEN row_no_min = 1 THEN StartDateTime
              ELSE to_timestamp(lag_date) END AS new_StartDateTime
        , CASE WHEN row_no_max = 1 THEN to_timestamp(EndDateTime)
              ELSE to_timestamp(interval_date) END AS new_EndDateTime
        , *
  from(
        select interval_date
              , lag(interval_date, 1) OVER(PARTITION BY WO ORDER BY interval_date) as lag_date
              , row_number() OVER (PARTITION BY WO, EventDate ORDER BY interval_date ) as row_no_min
              , row_number() OVER (PARTITION BY WO, EventDate ORDER BY interval_date desc ) as row_no_max
              , *
        from tbl_explode
      ) d
)

Nan's Blog

Thursday, September 21, 2023

PySpark SQL - write out records with time across multiple days

No comments:

Post a Comment

Labels

Blog Archive