Nan's Blog: September 2023

Wednesday, September 27, 2023

Python - Finding Correlation Between Many Variables (Multidimensional Dataset) with Python

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt

data = pd.read_csv('/Users/Downloads/test.csv', index_col=0)
corr = data.corr()
fig = plt.figure()

ax = fig.add_subplot(111)
cax = ax.matshow(corr, cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0, len(data.columns), 1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(data.columns)
ax.set_yticklabels(data.columns)

plt.show()

Thursday, September 21, 2023

PySpark SQL - write out records with time across multiple days

Example:

From A:

To B:

Logic:

1. Use sequence() to list out the interval dates between the start and end date.

2. Use explode() to create duplicate rows with the interval dates.

3. Use lag() and row_number() to create extra dates columns for the next steps.

4. Use CASE statement to select different dates based on the row_no to indicate the first/last row.

Code:

tbl_explode as (
  select StartDateTime
      , explode(sequence(to_date(dateadd(day,1,StartDateTime)), to_date(dateadd(day,1,EndDateTime)), interval 1 day))  as interval_date
      , EndDateTime
      , *
  from tbl_1
  where datediff(day, StartDateTime,EndDateTime) > 0 or cast(StartDateTime as date) <> cast(EndDateTimeUTC as date)
),
tbl_time_split as (
  select  CASE WHEN row_no_min = 1 THEN StartDateTime
              ELSE to_timestamp(lag_date) END AS new_StartDateTime
        , CASE WHEN row_no_max = 1 THEN to_timestamp(EndDateTime)
              ELSE to_timestamp(interval_date) END AS new_EndDateTime
        , *
  from(
        select interval_date
              , lag(interval_date, 1) OVER(PARTITION BY WO ORDER BY interval_date) as lag_date
              , row_number() OVER (PARTITION BY WO, EventDate ORDER BY interval_date ) as row_no_min
              , row_number() OVER (PARTITION BY WO, EventDate ORDER BY interval_date desc ) as row_no_max
              , *
        from tbl_explode
      ) d
)

Nan's Blog

Wednesday, September 27, 2023

Python - Finding Correlation Between Many Variables (Multidimensional Dataset) with Python

Thursday, September 21, 2023

PySpark SQL - write out records with time across multiple days

Labels

Blog Archive