The numpy ndarray
class is used to represent both matrices and vectors. To construct a matrix in numpy we list the rows of the matrix in a list and pass that list to the numpy array constructor.
For example, to construct a numpy array that corresponds to the matrix
we would do
A = np.array([[1,-1,2],[3,2,0]])
Vectors are just arrays with a single column. For example, to construct a vector
we would do
v = np.array([[2],[1],[3]])
A more convenient approach is to transpose the corresponding row vector. For example, to make the vector above we could instead transpose the row vector
The code for this is
v = np.transpose(np.array([[2,1,3]]))
numpy overloads the array index and slicing notations to access parts of a matrix. For example, to print the bottom right entry in the matrix A we would do
print(A[1,2])
To slice out the second column in the A matrix we would do
col = A[:,1:2]
The first slice selects all rows in A, while the second slice selects just the middle entry in each row.
To do a matrix multiplication or a matrix-vector multiplication we use the np.dot()
method.
w = np.dot(A,v)
One of the more common problems in linear algebra is solving a matrix-vector equation. Here is an example. We seek the vector x that solves the equation
A x = b
where
We start by constructing the arrays for A and b.
A = np.array([[2,1,-2],[3,0,1],[1,1,-1]]) b = np.transpose(np.array([[-3,5,-2]])
To solve the system we do
x = np.linalg.solve(A,b)
In a multiple regression problem we seek a function that can map input data points to outcome values. Each data point is a feature vector (x1 , x2 , …, xm) composed of two or more data values that capture various features of the input. To represent all of the input data along with the vector of output values we set up a input matrix X and an output vector y:
In a simple least-squares linear regression model we seek a vector β such that the product Xβ most closely approximates the outcome vector y.
Once we have constructed the β vector we can use it to map input data to a predicted outcomes. Given an input vector in the form
we can compute a predicted outcome value
The formula to compute the β vector is
β = (XT X)-1 XT y
In our next example program I will use numpy to construct the appropriate matrices and vectors and solve for the β vector. Once we have solved for β we will use it to make predictions for some test data points that we initially left out of our input data set.
Assuming we have constructed the input matrix X and the outcomes vector y in numpy, the following code will compute the β vector:
Xt = np.transpose(X) XtX = np.dot(Xt,X) Xty = np.dot(Xt,y) beta = np.linalg.solve(XtX,Xty)
The last line uses np.linalg.solve
to compute β, since the equation
β = (XT X)-1 XT y
is mathematically equivalent to the system of equations
(XT X) β = XT y
The data set I will use for this example is the Windsor house price data set, which contains information about home sales in the Windsor, Ontario area. The input variables cover a range of factors that may potentially have an impact on house prices, such as lot size, number of bedrooms, and the presence of various amenities. A CSV file with the full data set is available here. I downloaded the data set from this site, which offers a large number of data sets covering a large range of topics.
Here now is the source code for the example program.
import csv import numpy as np def readData(): X = [] y = [] with open('Housing.csv') as f: rdr = csv.reader(f) # Skip the header row next(rdr) # Read X and y for line in rdr: xline = [1.0] for s in line[:-1]: xline.append(float(s)) X.append(xline) y.append(float(line[-1])) return (X,y) X0,y0 = readData() # Convert all but the last 10 rows of the raw data to numpy arrays d = len(X0)-10 X = np.array(X0[:d]) y = np.transpose(np.array([y0[:d]])) # Compute beta Xt = np.transpose(X) XtX = np.dot(Xt,X) Xty = np.dot(Xt,y) beta = np.linalg.solve(XtX,Xty) print(beta) # Make predictions for the last 10 rows in the data set for data,actual in zip(X0[d:],y0[d:]): x = np.array([data]) prediction = np.dot(x,beta) print('prediction = '+str(prediction[0,0])+' actual = '+str(actual))
The original data set consists of over 500 entries. To test the accuracy of the predictions made by the linear regression model we use all but the last 10 data entries to build the regression model and compute β. Once we have constructed the β vector we use it to make predictions for the last 10 input values and then compare the predicted home prices against the actual home prices from the data set.
Here are the outputs produced by the program:
[[ -4.14106096e+03] [ 3.55197583e+00] [ 1.66328263e+03] [ 1.45465644e+04] [ 6.77755381e+03] [ 6.58750520e+03] [ 4.44683380e+03] [ 5.60834856e+03] [ 1.27979572e+04] [ 1.24091640e+04] [ 4.19931185e+03] [ 9.42215457e+03]] prediction = 97360.6550969 actual = 82500.0 prediction = 71774.1659014 actual = 83000.0 prediction = 92359.0891976 actual = 84000.0 prediction = 77748.2742379 actual = 85000.0 prediction = 91015.5903066 actual = 85000.0 prediction = 97545.1179047 actual = 91500.0 prediction = 97360.6550969 actual = 94000.0 prediction = 106006.800756 actual = 103000.0 prediction = 92451.6931269 actual = 105000.0 prediction = 73458.2949381 actual = 105000.0
Overall, the predictions are not spectacularly good, but a number of the predictions fall somewhat close to being correct. Making better predictions from this data will be the subject of the winter term tutorial on machine learning.