Nan's Blog: August 2019

Tuesday, August 27, 2019

Python - Detect outliers using moving average

In this piece of code, you can change window_size to determine how you want to calculate the average. e.g. using 6 months of adjacent data. You can change the sigma_value to determine the abnormal points you want to capture. e.g. if you assume 99.7% of data are normally distributed, then set sigma_value= 3. You can change the start and end values to determined how many months of production you want to see.

from scrapers import config_sql
from itertools import count
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import collections

def moving_average(data, window_size):
weight = np.ones(int(window_size)) / float(window_size)
return np.convolve(data, weight, 'same') # ways to handle edges. the mode are 'same', 'full', 'valid'

def detect_anomalies(y, window_size, sigma):
# slide a window along the input and compute the mean of the window's contents
avg = moving_average(y, window_size).tolist()
residual = y - avg
# Calculate the variation in the distribution of the residual
std = np.std(residual)
return {'standard_deviation': round(std, 3),
'anomalies_dict': collections.OrderedDict([(index, y_i) for index, y_i, avg_i in zip(count(), y, avg) if (y_i > avg_i + (sigma * std)) | (y_i < avg_i - (sigma * std))])} # distance from the mean

# This function is responsible for displaying how the function performs on the given dataset.
def plot_results(x, y, window_size, sigma_value, text_xlabel, text_ylabel, start, end):
plt.figure(figsize=(15, 8))
plt.plot(x, y, "k.")
y_av = moving_average(y, window_size)
try:
plt.plot(x, y_av, color='blue')
plt.plot(x, y, color='green')
plt.xlim(start, end) # this can let you change the plotted date frame
plt.xlabel(text_xlabel)
plt.ylabel(text_ylabel)

events = detect_anomalies(y, window_size=window_size, sigma=sigma_value)
x_anomaly = np.fromiter(events['anomalies_dict'].keys(), dtype=int, count=len(events['anomalies_dict']))
y_anomaly = np.fromiter(events['anomalies_dict'].values(), dtype=float, count=len(events['anomalies_dict']))
print(collections.OrderedDict([(x, y) for index, x, y in zip(count(), x_anomaly, y_anomaly)]))
ax = plt.plot(x_anomaly, y_anomaly, "r.", markersize=12)

# add grid and lines and enable the plot
plt.grid(True)
plt.show()

except Exception as e:
pass

# Main
if __name__ == '__main__':
conn = config_sql.sql_credentials('XXXXX', 'XXXX')
cursor = conn.cursor()

query = "XX where bridge_Id in('8368051','8502207','8369707','8520772','8420250','12776634')"

df = pd.read_sql(query, conn)
cols = ['bridge_id', 'product_name', 'metric_date', 'prod_value']
oil_prod = df.loc[df['product_name'] == 'Gas']
prod_as_frame = pd.DataFrame(oil_prod, columns=['bridge_id', 'product_name', 'metric_date', 'prod_value'])

# get unique list of bridge
rows = cursor.execute(query)
unique_bridge = set(list(zip(*list(rows.fetchall())))[0])

for bridge_id in unique_bridge:
print('bridge_Id: ' + str(bridge_id))
prod = prod_as_frame[prod_as_frame.bridge_id == bridge_id].reindex()
prod['mop'] = range(1, len(prod) + 1)

x = prod['mop']
Y = prod['prod_value']
max_x = max(x) # this can let you change the plotted date frame
print(prod)

# plot the results
plot_results(x, y=Y, window_size=6, sigma_value=3, text_xlabel="MOP", text_ylabel="production", start=1, end=max_x)

Background knowlodge:

np.std
$s={\sqrt {{\frac {1}{N-1}}\sum _{i=1}^{N}(x_{i}-{\bar {x}})^{2}}},$
To quantify the amount of variation or dispersion of dataset.
Low std means data points tend to be close to the mean; high std means the data points are spread out over a wide range of values.
np.Convolve
$(f*g)(t)\triangleq \ \int _{-\infty }^{\infty }f(\tau )g(t-\tau )\,d\tau .$
Is defined as the integral of the product of two functions after one is reversed and shifted.
It is an operation on two functions to produce a third function that express how the shape of one is modified by the other.
Rules of normally distributed data (68-95-99.7 rule)
${\begin{aligned}\Pr(\mu -1\sigma \leq X\leq \mu +1\sigma )&\approx 0.6827\\\Pr(\mu -2\sigma \leq X\leq \mu +2\sigma )&\approx 0.9545\\\Pr(\mu -3\sigma \leq X\leq \mu +3\sigma )&\approx 0.9973\end{aligned}}$ Image result for empirical rule of normally distributed data

Image result for empirical rule of normally distributed data

If a data distribution is approximately normal, then about 68% of the data value are within 1 std of the mean

Thursday, August 1, 2019

SQL - Use of PARSENAME() function

Sometimes when we have pivoted value in excel or csv with title as 'Column 1','Column 2',...'Column n'. But in the meantime, those columns means the dates like 1980-01-01, 1980-02-01. After unpivot the data into sql , the data will be like this:
API date_string Product
053-29070 Column 112 Oil
053-29070 Column 113 Oil
053-29070 Column 118 Oil
053-29070 Column 172 Oil
053-29070 Column 173 Oil
053-29070 Column 177 Oil
053-29070 Column 178 Oil
053-29070 Column 19 Oil
Then how do we know the corresponding year and month?

So here is the trick. Function PARSENAME() can get the specific part of a string delimited by a delimiter. so
for SELECT PARSENAME(12.3,1), you will get 3,
for SELECT PARSENAME(13%12,1), you will get 1

so by using this function together with remainder function, we can get the month of the data.
by using some tricks with number/12, we can get the year of the data.

date_string	date_int	month_indicator	year_indicator	date_int_plus
Column 1	1	1	0	2
Column 2	2	2	0	3
Column 3	3	3	0	4
Column 4	4	4	0	5
Column 5	5	5	0	6
Column 6	6	6	0	7
Column 7	7	7	0	8
Column 8	8	8	0	9
Column 9	9	9	0	10
Column 10	10	10	0	11
Column 11	11	11	0	12
Column 12	12	0	1	13
Column 13	13	1	1	14
Column 14	14	2	1	15
Column 15	15	3	1	16
Column 16	16	4	1	17
Column 17	17	5	1	18
Column 18	18	6	1	19
Column 19	19	7	1	20
Column 20	20	8	1	21
Column 21	21	9	1	22
Column 22	22	10	1	23
Column 23	23	11	1	24

--------------------------update production_month---------------------------------------

so we can set production_month =

case when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 1 then 1

when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 2 then 2

when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 3 then 3

when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 4 then 4

when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 5 then 5

when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 6 then 6

when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 7 then 7

when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 8 then 8

when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 9 then 9

when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 10 then 10

when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 11 then 11

when PARSENAME(cast(replace(date_string,'Column ','') as int)%12,1) = 0 then 12

else null

end

--------------------------update production_year---------------------------------------

IF OBJECT_ID('tempdb..#tbl_prod') IS NOT NULL DROP TABLE #tbl_prod

select distinct cast(PARSENAME(cast(replace(date_string,'Column ','') as int)/12,1) as int) as year_indicator, cast(replace(date_string,'Column ','') as int) +1 as date_int_plus

into #tbl_prod

from XXXX with (nolock)

DECLARE @counter int = 0;

WHILE @counter <= (select max(cast(replace(date_string,'Column ','') as int)) from XXXX)

BEGIN

update a

set production_year = 1980 + @counter

from XXXX a ,#tbl_prod b

where b.year_indicator = @counter and cast(replace(a.date_string,'Column ','') as int) =

b.date_int_plus and a.api <> '' and a.volume <> 0 and a.volume is not null ;

SET @counter = @counter + 1;

END

--only column 1 is not assigned a year

update XXXX

set production_year = 1980

where date_string = 'Column 1' and api <> '' and volume is not null and volume <> 0;

Or just the easiest way. Use SELECT DATEADD(MONTH,+1,getdate()) to get the date.

Nan's Blog

Tuesday, August 27, 2019

Python - Detect outliers using moving average

Thursday, August 1, 2019

SQL - Use of PARSENAME() function

Labels

Blog Archive