-
Notifications
You must be signed in to change notification settings - Fork 19
Linear Regression Notebook
This page describes how you can use a Jupyter Python notebook, such as from within IBM Data Science Experience, to do a Linear Regression model with the TensorFlow library. For this regression model, the machine learning is going to be trained to predict median housing value based on several predictor variables. In the accompanying notebook, the main steps are:
- read training data from a local CSV file into a Pandas dataframe
- use TensorFlow to train the linear regression model with the data
- saves the machine learned model to a local dataset
- restore the machine learned model from the local dataset
- perform inferences with the restored machine learned model
As preliminary info, I downloaded the California housing data from the same source as the scikit learn fetch_california_housing() method. This gave me a CSV file with a sample dataset that maps house prices to several predictor variables such as house age, number of bedrooms, and municipal population. I added a header row based on the domain descriptions of the columns. This CSV file is stored alongside the notebook, and its content is loaded by the first cell:
import pandas as pd
df_data_1 = pd.read_csv('cal_housing_data with headers.csv')
df_data_1.head()
Now, we start getting into interesting code. This first snippet just imports numpy and then extracts data from the Pandas dataframe into numpy arrays, which is what TensorFlow needs as input. Because we'll be training a simple model with only 20,640 rows of data, we're loading it all at once, but for larger training sets, you'd do this in smaller epochs. The "housing_data" are the 20,460 values for each of the 8 predictor variables, and the "housing_target" is the vector of 20,640 house values that we will be machine learning how to predict.
import numpy as np
# Make a numpy array from the dataframe
data = np.array([x for x in df_data_1.values])
# Separate the 'predictors' (aka 'features') from the dependent variable (aka 'label') that we will learn how to predict
housing_data = np.delete(data, 8, axis=1)
housing_target = np.delete(data, slice(0, 8), axis=1)
The two lines in the next cell are just a little housekeeping to prepare for the machine learning step:
m, n = housing_data.shape
housing_data_plus_bias = np.c_[np.ones((m, 1)), housing_data]
So, now we're going to do the world's simplest machine learning model because we'd like to be able to see the elements of TensorFlow-based machine learning with as little problem complexity as possible getting in the way of understanding. The word 'tensor' just means n-dimensional array, and TensorFlow is a library that makes it easy to specify a computational 'flow' of tensors and then to execute that flow in the most efficient way possible given the compute power available to TensorFlow. In essence, the data scientist describes what computations must occur, and then TensorFlow determines how to do the computations efficiently.
We're going to start by defining the 'flow' or computation graph that TensorFlow will run. In this case, we're going to define the compute tree for training a multiple linear regression using the 8 predictor variables and the housing value variable that we'd like to learn how to predict. Here's what that looks like:
import tensorflow as tf
# Make the compute graph
X = tf.constant(housing_data_plus_bias, dtype=tf.float64, name="X")
XT = tf.transpose(X)
y = tf.constant(housing_target.reshape(-1, 1), dtype=tf.float64, name="y")
theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT, X)), XT), y)
The X variable is the matrix of 8 predictors by the 20,640 samples. XT is a transpose needed in the linear regression computation. The 'y' variable is the dependent variable, and it is assigned to the 20,640 housing values we have in the training data. The 'theta' variable is the vector of linear regression equation coefficients that will result from the series of matrix operations on the righthand side formula.
It is important to note that the code above just specifies the compute graph, i.e. the tensor flow. To perform the flow, you then run the following code:
# Run the compute graph
with tf.Session() as sess:
theta_value = theta.eval()
If you then run a line of code to output theta_value, you will get an output like this:
array([[ -3.59402294e+06],
[ -4.28237438e+04],
[ -4.25767219e+04],
[ 1.15630387e+03],
[ -8.18164928e+00],
[ 1.13410689e+02],
[ -3.85350953e+01],
[ 4.83082868e+01],
[ 4.02485142e+04]])
For a linear regression, this is the machine learned model. It gives the coefficients of a linear equation that is the best fit to the training data. Given values for the 8 predictor variables like house age and number of bedrooms, these coefficients can be used to predict a house value. We'll see how to do that below, but first, we're going to see how to save and reload the model in TensorFlow because you would typically want to save a model trained in IBM Data Science Experience or on your laptop and then transport it to a production deployment environment, where you'd want to restore it so that you can actually use it for inference (prediction).
The first time I ran this notebook, I used this line to create a subdirectory in datasets where I could save the TensorFlow model from this notebook:
!mkdir "../datasets/Linear Regression"
Then, to save the model, I defined a second simple TensorFlow compute model that just assigned the theta_value vector to a variable called "model". The code below creates and then executes this simple tensor flow, and then saves the result in the subdirectory created above.
model = tf.Variable(tf.constant(theta_value, dtype=tf.float64), name="model")
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as saver_sess:
init.run()
theta_value = model.eval()
save_path = saver.save(saver_sess, "../datasets/Linear Regression/Linear Regression.ckpt")
The save method we're using here is useful to know about because it is the same "checkpoint" method that you would use if you were incrementally training a larger model in epochs. It's also useful to understand that what we're saving is the compute graph and the tf.Variable TensorFlow variables and values defined in the model we're checkpointing. In other words, what gets saved is specific to the type of model you're training because the type of model affects the compute graph, or tensor flow, that you specified. In a neural net, for example, you'd have to save the structure of the net in addition to the weights and biases. For a linear regression, we already know the structure is a linear equation, so we just save the coefficients. Regardless of what is being saved, TensorFlow actually saves four files, as shown by the line of code below and its output:
!ls "../datasets/Linear Regression"
checkpoint Linear Regression.ckpt.index
Linear Regression.ckpt.data-00000-of-00001 Linear Regression.ckpt.meta
Now, suppose you were to move these four files to a production deployment environment. Below is code that you could use to reload the model so that you could use it for inference:
sess_restore = tf.Session()
saver = tf.train.import_meta_graph('../datasets/Linear Regression/Linear Regression.ckpt.meta')
saver.restore(sess_restore,tf.train.latest_checkpoint('../datasets/Linear Regression/'))
theta_value = sess_restore.run('model:0')
sess_restore.close()
At last, you can now perform inferences using the 'theta_value' vector. To simulate making a prediction in the code below, I've used the 0th row of the housing_data for the values of the predictor values. I initialize 'predicted_value' to the constant coefficient of the linear equation, and then the remaining coefficients of the theta_value are placed in 'linear_coefficients' to make the loop easier to read. The loop then multiplies each predictor variable value housing_data[0][j] by the corresponding coefficient (each coefficient 'c' in the for loop iteration of linear_coefficients is, unfortunately, an array of size 1, so c[0] is used to get the actual value of the coefficient).
predicted_value = theta_value[0][0]
linear_coefficients = theta_value[1:]
for j, c in enumerate(linear_coefficients):
predicted_value += c[0] * housing_data[0][j]
If you now run a line of Python code to see the value of predicted_value, you will get output like this:
411111.09606504953
Finally, it's worth noting that, for a larger kind of model, you can also use TensorFlow to perform the inference. But because this is a linear regression involving only 9 coefficients, using TensorFlow would probably just slow it down. Still, it is an easy tensor flow to write... an exercise for the reader!