Machine Learning Prediction

24min

this use case is a customized timeseries forecasting version of making a cnn prediction model from the tensorflow website the complete tutorial is available on the tensorflow https //www tensorflow\ org/tutorials/structured data/time series website setup and data if you run the script below, you see an output of a dataframe (table) \#### initial imports import os import datetime import matplotlib as mpl import matplotlib pyplot as plt import numpy as np import pandas as pd import seaborn as sns import tensorflow as tf \### optional for setting up plot sizes mpl rcparams\['figure figsize'] = (8, 6) mpl rcparams\['axes grid'] = false \### get data zip path = tf keras utils get file( origin=' https //storage googleapis com/tensorflow/tf keras datasets/jena climate 2009 2016 csv zip https //storage googleapis com/tensorflow/tf keras datasets/jena climate 2009 2016 csv zip ', fname='jena climate 2009 2016 csv zip', extract=true) csv path, = os path splitext(zip path) \## read data into pandas data frame df = pd read csv(csv path) # slice \[start\ stop\ step], starting from index 5 take every 6th record df = df\[5 6] date time = pd to datetime(df pop('date time'), format='%d %m %y %h %m %s') print(df head()) p (mbar) t (degc) tpot (k) tdew (degc) rho (g/m 3) wv (m/s) max wv (m/s) wd (deg) 5 996 50 8 05 265 38 8 78 1307 86 0 21 0 63 192 7 11 996 62 8 88 264 54 9 77 1312 25 0 25 0 63 190 3 17 996 84 8 81 264 59 9 66 1312 18 0 18 0 63 167 2 23 996 99 9 05 264 34 10 02 1313 61 0 10 0 38 240 0 29 997 46 9 63 263 72 10 65 1317 19 0 40 0 88 157 0 \[5 rows x 14 columns] feature engineering/data standardization now with the quick setup done, we can do some feature engineering/extraction and data manipulation which will help the model there are plenty of options when it comes to feature engineering and it is case by case basis, but for this example, it will be kept almost exactly the same as tensorflow \### select as many or as little number of columns of variables columns = \['p (mbar)','t (degc)','tpot (k)','tdew (degc)','rh (%)', 'vpmax (mbar)', 'vpact (mbar)', 'vpdef (mbar)', 'sh (g/kg)','h2oc (mmol/mol)','rho (g/m 3)','wv (m/s)', 'max wv (m/s)', 'wd (deg)'] \### a dataframe containing the selected columns is named as "features" features = df\[columns] features index = date time ### optional if you want to see a plot of these features = features plot(subplots=true) \### optional smoother version of the plot by using only the first month of datafeatures = df\[columns]\[ 480] features index = date time\[ 480] = features plot(subplots=true) plt show() ### to see the statistics of the selected data print(df describe() transpose()) count mean std min 25% 50% 75% max p (mbar) 70091 0 989 212842 8 358886 913 60 984 20 989 57 994 720 1015 29 t (degc) 70091 0 9 450482 8 423384 22 76 3 35 9 41 15 480 37 28 tpot (k) 70091 0 283 493086 8 504424 250 85 277 44 283 46 289 530 311 21 tdew (degc) 70091 0 4 956471 6 730081 24 80 0 24 5 21 10 080 23 06 rh (%) 70091 0 76 009788 16 474920 13 88 65 21 79 30 89 400 100 00 vpmax (mbar) 70091 0 13 576576 7 739883 0 97 7 77 11 82 17 610 63 77 vpact (mbar) 70091 0 9 533968 4 183658 0 81 6 22 8 86 12 360 28 25 vpdef (mbar) 70091 0 4 042536 4 898549 0 00 0 87 2 19 5 300 46 01 sh (g/kg) 70091 0 6 022560 2 655812 0 51 3 92 5 59 7 800 18 07 h2oc (mmol/mol) 70091 0 9 640437 4 234862 0 81 6 29 8 96 12 490 28 74 rho (g/m 3) 70091 0 1216 061232 39 974263 1059 45 1187 47 1213 80 1242 765 1393 54 wv (m/s) 70091 0 1 702567 65 447512 9999 00 0 99 1 76 2 860 14 01 max wv (m/s) 70091 0 2 963041 75 597657 9999 00 1 76 2 98 4 740 23 50 wd (deg) 70091 0 174 789095 86 619431 0 00 125 30 198 10 234 000 360 00 plot of entire data set plot of initial part of data set cleanup most of this section is highly dependent on the kind of data being used this data is of the weather, and various parameters are used to predict the temperature so most of this cleanup is specific to these kinds of datasets, and maybe completely irrelevant to other kinds of datasets wind velocity as the minimum value of wind velocity (wv m/s) and max wv m/s is 9999, it seems incorrect, as we already have wind direction, let's correct this minimum wv = df\['wv (m/s)'] bad wv = wv == 9999 0 wv\[bad wv] = 0 0 max wv = df\['max wv (m/s)'] bad max wv = max wv == 9999 0 max wv\[bad max wv] = 0 0 ### check if original data frame is edited correctly print(df\['wv (m/s)'] min()) ### expected value 0 0 note this may not apply to all kinds of datasets changing wind variables to see why we need to change the wind variables, lets plot them first plt figure() plt hist2d(df\['wd (deg)'], df\['wv (m/s)'], bins=(50, 50), vmax=400) plt colorbar() plt xlabel('wind direction \[deg]') plt ylabel('wind velocity \[m/s]') \### if you forgot to add this earlier \### plt show() original wind variables not to get into too many details, but ideally 0 degrees and 360 degrees should be next to each other, which they aren't and it should not be a sharp change at 0 degree mark also, direction of the wind does not matter if its velocity is 0 (no wind) so let's make some changes to the dataframe and make wind velocity a vector, rather than scalar with degrees \### remove these columns from the data frame and replace them as vectors wv = df pop('wv (m/s)') max wv = df pop('max wv (m/s)') ### convert to radians wd rad = df pop('wd (deg)') np pi / 180 ### calculate the wind x and y components df\['wx'] = wv np cos(wd rad) df\['wy'] = wv np sin(wd rad) ### calculate the max wind x and y components df\['max wx'] = max wv np cos(wd rad) df\['max wy'] = max wv np sin(wd rad) ### check the histogram again, with better representation of wind variables plt figure() plt hist2d(df\['wx'], df\['wy'], bins=(50, 50), vmax=400) plt colorbar() plt xlabel('wind x \[m/s]') plt ylabel('wind y \[m/s]') ax = plt gca() ax axis('tight') ### plt show() modified wind variables the model interprets this kind of wind variables better than it does with velocity and direction \### optional check your data frame again to see what modifications have been made so far print(df head()) print(df describe() transpose()) p (mbar) t (degc) tpot (k) tdew (degc) wx wy max wx max wy 5 996 50 8 05 265 38 8 78 0 204862 0 046168 0 614587 0 138503 11 996 62 8 88 264 54 9 77 0 245971 0 044701 0 619848 0 112645 17 996 84 8 81 264 59 9 66 0 175527 0 039879 0 614344 0 139576 23 996 99 9 05 264 34 10 02 0 050000 0 086603 0 190000 0 329090 29 997 46 9 63 263 72 10 65 0 368202 0 156292 0 810044 0 343843 \[5 rows x 15 columns] count mean std 50% 75% max p (mbar) 70091 0 989 212842 8 358886 989 570000 994 720000 1015 290000 t (degc) 70091 0 9 450482 8 423384 9 410000 15 480000 37 280000 tpot (k) 70091 0 283 493086 8 504424 283 460000 289 530000 311 210000 tdew (degc) 70091 0 4 956471 6 730081 5 210000 10 080000 23 060000 rh (%) 70091 0 76 009788 16 474920 79 300000 89 400000 100 000000 vpmax (mbar) 70091 0 13 576576 7 739883 11 820000 17 610000 63 770000 vpact (mbar) 70091 0 9 533968 4 183658 8 860000 12 360000 28 250000 vpdef (mbar) 70091 0 4 042536 4 898549 2 190000 5 300000 46 010000 sh (g/kg) 70091 0 6 022560 2 655812 5 590000 7 800000 18 070000 h2oc (mmol/mol) 70091 0 9 640437 4 234862 8 960000 12 490000 28 740000 rho (g/m 3) 70091 0 1216 061232 39 974263 1213 800000 1242 765000 1393 540000 wx 70091 0 0 627813 1 987440 0 633142 0 299975 8 244699 wy 70091 0 0 407068 1 552621 0 293467 0 450077 7 733831 max wx 70091 0 1 018681 3 095279 1 117029 0 627619 11 913133 max wy 70091 0 0 733589 2 611890 0 527021 0 822895 14 302308 quick summary we took our desired number of features/columns from the original data set modified the data such that there is no 9999 as the minimum value in the wind velocity changed wind variables from wind velocity + wind direction (degrees) into > wind velocity vector in x & y components more cleanup timestamp, date time in string, and time in seconds is not really that useful for the model, so we can make features instead we can convert date time to seconds, and then to sin & cos of years and days this will help simplify input to the model and make it more useful to identify periodicity in the data \### convert timestamp in data to seconds timestamp s = date time map(datetime datetime timestamp) day = 24 60 60 year = (365 2425) day ### make useful features out of this for the data frame df\['day sin'] = np sin(timestamp s (2 np pi / day)) df\['day cos'] = np cos(timestamp s (2 np pi / day)) df\['year sin'] = np sin(timestamp s (2 np pi / year)) df\['year cos'] = np cos(timestamp s (2 np pi / year)) ### to see how this is useful for the model, try visualizing it in terms of frequency \### this part is only if you want to know how the above features will be helpful to the model \### no other changes will be made with the following # fft = tf signal rfft(df\['t (degc)']) \# f per dataset = np arange(0, len(fft)) \# n samples h = len(df\['t (degc)']) \# hours per year = 24 365 2524 \# years per dataset = n samples h/(hours per year) \# f per year = f per dataset/years per dataset \# plt figure() \# plt step(f per year, np abs(fft)) \# plt xscale('log') \# plt ylim(0, 400000) \# plt xlim(\[0 1, max(plt xlim())]) \# plt xticks(\[1, 365 2524], labels=\['1/year', '1/day']) \# = plt xlabel('frequency (log scale)') preparing data for the model all the necessary cleanups and some modifications to the data are done, now let's prepare the data for the model split the data split the data into training, validation and testing with ratios training 70% , validation 20%, testing 10% column indices = {name i for i, name in enumerate(df columns)} n = len(df) train df = df\[0\ int(n 0 7)] val df = df\[int(n 0 7)\ int(n 0 9)] test df = df\[int(n 0 9) ] num features = df shape\[1] normalize the data use the simple mean to normalize the data train mean = train df mean() train std = train df std() train df = (train df train mean) / train std val df = (val df train mean) / train std test df = (test df train mean) / train std df std = (df train mean) / train std df std = df std melt(var name='column', value name='normalized') quick view of how the data looks plt figure(figsize=(12, 6)) ax = sns violinplot(x='column', y='normalized', data=df std) = ax set xticklabels(df keys(), rotation=90) ### plt show() after normalization functions for making a moving window generator and making plots this process is a methodology you can also use other methods see the tensorflow https //www tensorflow\ org/tutorials/structured data/time series website for more information \### indexes and offsets class windowgenerator() def init (self, input width, label width, shift, train df=train df, val df=val df, test df=test df, label columns=none) \# store the raw data self train df = train df self val df = val df self test df = test df \# work out the label column indices self label columns = label columns if label columns is not none self label columns indices = {name i for i, name in enumerate(label columns)} self column indices = {name i for i, name in enumerate(train df columns)} \# work out the window parameters self input width = input width self label width = label width self shift = shift self total window size = input width + shift self input slice = slice(0, input width) self input indices = np arange(self total window size)\[self input slice] self label start = self total window size self label width self labels slice = slice(self label start, none) self label indices = np arange(self total window size)\[self labels slice] def repr (self) return '\n' join(\[ f'total window size {self total window size}', f'input indices {self input indices}', f'label indices {self label indices}', f'label column name(s) {self label columns}']) \### split window def split window(self, features) inputs = features\[ , self input slice, ] labels = features\[ , self labels slice, ] if self label columns is not none labels = tf stack( \[labels\[ , , self column indices\[name]] for name in self label columns], axis= 1) \# slicing doesn't preserve static shape information, so set the shapes \# manually this way the tf data datasets are easier to inspect inputs set shape(\[none, self input width, none]) labels set shape(\[none, self label width, none]) return inputs, labels windowgenerator split window = split window \### plotting def plot(self, model=none, plot col='t (degc)', max subplots=3) inputs, labels = self example plt figure(figsize=(12, 8)) plot col index = self column indices\[plot col] max n = min(max subplots, len(inputs)) for n in range(max n) plt subplot(3, 1, n+1) plt ylabel(f'{plot col} \[normed]') plt plot(self input indices, inputs\[n, , plot col index], label='inputs', marker=' ', zorder= 10) if self label columns label col index = self label columns indices get(plot col, none) else label col index = plot col index if label col index is none continue plt scatter(self label indices, labels\[n, , label col index], edgecolors='k', label='labels', c='#2ca02c', s=64) if model is not none predictions = model(inputs) plt scatter(self label indices, predictions\[n, , label col index], marker='x', edgecolors='k', label='predictions', c='#ff7f0e', s=64) if n == 0 plt legend() plt xlabel('time \[h]') windowgenerator plot = plot \### creating tf datasets def make dataset(self, data) data = np array(data, dtype=np float32) ds = tf keras preprocessing timeseries dataset from array( data=data, targets=none, sequence length=self total window size, sequence stride=1, shuffle=true, batch size=32,) ds = ds map(self split window) return ds windowgenerator make dataset = make dataset \### add properties for window generators such as train, validation, test @property def train(self) return self make dataset(self train df) @property def val(self) return self make dataset(self val df) @property def test(self) return self make dataset(self test df) @property def example(self) """get and cache an example batch of inputs, labels for plotting """ result = getattr(self, ' example', none) if result is none \# no example batch was found, so get one from the train dataset result = next(iter(self train)) \# and cache it for next time self example = result return result windowgenerator train = train windowgenerator val = val windowgenerator test = test windowgenerator example = example extremely simplistic model to truly understand what is going on, let's define a very simple model, which uses the variables that are currently in place, and makes a prediction for the next hour \### optional \### very simple model predicts 1 timestamp (1hr) in the future ### generate window for the simple model single step window = windowgenerator( input width=1, label width=1, shift=1, label columns=\['t (degc)']) ### 1 timetamp is inputed, and prediction is made 1 hr in the future for the field "t (degc)" print(single step window) \### shows you what the batch size is, how many timstamps consumed by input/ predicted by output and the number of features used/number of labels in the output (in out case only 1 label in output t degc ) for example inputs, example labels in single step window\ train take(1) print(f'inputs shape (batch, time, features) {example inputs shape}') print(f'labels shape (batch, time, features) {example labels shape}') \### create extremely basic model to be fed by the above window class baseline(tf keras model) def init (self, label index=none) super() init () self label index = label index def call(self, inputs) if self label index is none return inputs result = inputs\[ , , self label index] return result\[ , , tf newaxis] baseline = baseline(label index=column indices\['t (degc)']) baseline compile(loss=tf losses meansquarederror(), metrics=\[tf metrics meanabsoluteerror()]) val performance = {} performance = {} val performance\['baseline'] = baseline evaluate(single step window\ val) total window size 2 input indices \[0] label indices \[1] label column name(s) \['t (degc)'] inputs shape (batch, time, features) (32, 1, 19) labels shape (batch, time, features) (32, 1, 1) 439/439 \[==============================] 2s 5ms/step loss 0 0128 mean absolute error 0 0785 to make the model more interesting, repeat this window for 24 hrs instead just 1 hr \### optional wide window = windowgenerator( input width=24, label width=24, shift=1, label columns=\['t (degc)']) print(wide window) print('input shape ', single step window\ example\[0] shape) print('output shape ', baseline(single step window\ example\[0]) shape) wide window\ plot(baseline) \### plt show() total window size 25 input indices \[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23] label indices \[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] label column name(s) \['t (degc)'] input shape (32, 1, 19) output shape (32, 1, 1) three batches of the basic model actual cnn model max epochs = 20 ### make a function to be able to quickly compile and fit model def compile and fit(model, window, patience=2) early stopping = tf keras callbacks earlystopping(monitor='val loss', patience=patience, mode='min') model compile(loss=tf losses meansquarederror(), optimizer=tf optimizers adam(), metrics=\[tf metrics meanabsoluteerror()]) history = model fit(window\ train, epochs=max epochs, validation data=window\ val, callbacks=\[early stopping]) return history \### use this much data conv width = 48 \### how many values to predict label width = 1 \### for plotting purposes input width = label width + (conv width 1) \### how many timesteps in future is the predicted value shift = 12 conv window = windowgenerator( input width = input width, label width = label width, shift = shift, label columns = \['t (degc)']) print(conv window) \### cnn model conv model = tf keras sequential(\[ tf keras layers conv1d(filters=32, kernel size=(conv width,), activation='relu'), tf keras layers dense(units=32, activation='relu'), tf keras layers dense(units=1, name='predict'), ]) print("conv model on conv window ") print('input shape ', conv window\ example\[0] shape) print('output shape ', conv model(conv window\ example\[0]) shape) history = compile and fit(conv model, conv window) val performance\['conv'] = conv model evaluate(conv window\ val) performance\['conv'] = conv model evaluate(conv window\ test, verbose=0) conv model summary() conv window\ plot(conv model) total window size 60 input indices \[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47] label indices \[59] label column name(s) \['t (degc)'] conv model on conv window input shape (32, 48, 19) output shape (32, 1, 1) epoch 1/20 1532/1532 \[==============================] 12s 8ms/step loss 0 1032 mean absolute error 0 2499 val loss 0 1030 val mean absolute error 0 2499 epoch 2/20 1532/1532 \[==============================] 11s 7ms/step loss 0 0859 mean absolute error 0 2289 val loss 0 0965 val mean absolute error 0 2427 epoch 3/20 1532/1532 \[==============================] 12s 8ms/step loss 0 0802 mean absolute error 0 2204 val loss 0 1005 val mean absolute error 0 2490 epoch 4/20 1532/1532 \[==============================] 11s 7ms/step loss 0 0764 mean absolute error 0 2152 val loss 0 1084 val mean absolute error 0 2593 437/437 \[==============================] 2s 5ms/step loss 0 1084 mean absolute error 0 2593 model "sequential" layer (type) output shape param # conv1d (conv1d) (none, 1, 32) 29216 dense (dense) (none, 1, 32) 1056 predict (dense) (none, 1, 1) 33 total params 30,305 trainable params 30,305 non trainable params 0 cnn predictions performance comparison \### compare performances x = np arange(len(performance)) width = 0 3 metric name = 'mean absolute error' metric index = conv model metrics names index('mean absolute error') val mae = \[v\[metric index] for v in val performance values()] test mae = \[v\[metric index] for v in performance values()] plt figure() plt ylabel('mean absolute error \[t (degc), normalized]') plt bar(x 0 17, val mae, width, label='validation') plt bar(x + 0 17, test mae, width, label='test') plt xticks(ticks=x, labels=performance keys(), rotation=45) = plt legend() plt show() performance comparison save the model saving a model in keras is simple \### replace enter the path (in string) where you want to save the model, and then the name of the model with a / conv model save("{path to save model}/{name of saved model}") \### how to load a model \# model = tf keras models load model("{path to save model}/{name of saved model}") \# print("loaded") \# model summary() ### how to check all the input tensors and output tensor names \# print(os system("saved model cli show dir {path to save model}/{name of saved model} all")) metagraphdef with tag set 'serve' contains the following signaturedefs signature def\[' saved model init op'] the given savedmodel signaturedef contains the following input(s) the given savedmodel signaturedef contains the following output(s) outputs\[' saved model init op'] tensor info dtype dt invalid shape unknown rank name noop method name is signature def\['serving default'] the given savedmodel signaturedef contains the following input(s) inputs\['conv1d input'] tensor info dtype dt float shape ( 1, 48, 19) name serving default conv1d input 0 the given savedmodel signaturedef contains the following output(s) outputs\['predict'] tensor info dtype dt float shape ( 1, 1, 1) name statefulpartitionedcall 0 method name is tensorflow/serving/predict defined functions function name ' call ' option #1 callable with argument #1 inputs tensorspec(shape=(none, 48, 19), dtype=tf float32, name='inputs') argument #2 dtype bool value false argument #3 dtype nonetype value none option #2 callable with argument #1 conv1d input tensorspec(shape=(none, 48, 19), dtype=tf float32, name='conv1d input') argument #2 dtype bool value false argument #3 dtype nonetype value none option #3 callable with argument #1 inputs tensorspec(shape=(none, 48, 19), dtype=tf float32, name='inputs') argument #2 dtype bool value true argument #3 dtype nonetype value none option #4 callable with argument #1 conv1d input tensorspec(shape=(none, 48, 19), dtype=tf float32, name='conv1d input') argument #2 dtype bool value true argument #3 dtype nonetype value none function name ' default save signature' option #1 callable with argument #1 conv1d input tensorspec(shape=(none, 48, 19), dtype=tf float32, name='conv1d input') function name 'call and return all conditional losses' option #1 callable with argument #1 conv1d input tensorspec(shape=(none, 48, 19), dtype=tf float32, name='conv1d input') argument #2 dtype bool value true argument #3 dtype nonetype value none option #2 callable with argument #1 conv1d input tensorspec(shape=(none, 48, 19), dtype=tf float32, name='conv1d input') argument #2 dtype bool value false argument #3 dtype nonetype value none option #3 callable with argument #1 inputs tensorspec(shape=(none, 48, 19), dtype=tf float32, name='inputs') argument #2 dtype bool value true argument #3 dtype nonetype value none option #4 callable with argument #1 inputs tensorspec(shape=(none, 48, 19), dtype=tf float32, name='inputs') argument #2 dtype bool value false argument #3 dtype nonetype value none

Model Types

Machine Learning Classification