Files

Reading data from a file

The example program below demonstrates the basics of working with data that comes from a text file. The program is designed to work with a text file that contains a list temperature readings. The readings are listed one per line in the text file. The program will read the list of readings from the text file, put the readings in a list, and then determine and print the lowest and highest temperature readings found in the file.

temps = []
f = open('temps.txt')
for line in f.readlines():
    temps.append(float(line))
f.close()

lowest = temps[0]
highest = temps[0]

for t in temps:
    if t < lowest:
        lowest = t
    if t > highest:
        highest = t

print('Lowest temp = '+str(lowest))
print('Highest temp = '+str(highest))

Here are some things to make note of in the program.

The first step in working with a file is to open the file. The open() function opens a file for reading and returns a file object. The parameter to open() specifies the name of the file to open.
f.readlines() returns an iterable list of lines in the text file. We set up a for loop to iterate over this list of lines.
Each line in the text file is a string. We have to convert that string to a float before appending it to the list of temps.
When we are done reading the data from the file we use the close() method to close the file.

Since we always have to take care to close a file after we are done working with it, it may be helpful to use an alternative construction to manage opening and closing the file. The Python with construct is useful for this purpose.

In place of

f = open('temps.txt')
for line in f.readlines():
    temps.append(float(line))
f.close()

we can do

with open('temps.txt') as f:
    for line in f.readlines():
        temps.append(float(line))

Once we exit the body of the with construct the file will get closed for us automatically. In addition, should the program generate an error anywhere in the body the program will automatically exit the body of the with and close the file for us.

Writing to a file

The next example is a short program that I used to generate some random data for the temps.txt data file.

import random

f = open('temps.txt','w')

for n in range(0,50):
    f.write('{:2.1f}\n'.format(random.random()*100))

f.close()

Here are some things to note in this program.

We will be making use of the random module to generate a list of random temperature readings.
As in the previous example, we use the open() function to open a file. The optional second parameter to the open() function is a file mode specifier. Since we are opening this file for writing we use the 'w' mode specifier.
To write text to the file we use the write() method. The parameter we pass to write() is a string of text that we want to have written to the file. We have to take care to make sure that the string ends in the newline character, \n, so that the text gets a line break at the end.
We use the random() function from the random module to generate a random float in the range from 0.0 to 1.0. We multiply this random number by 100 to scale it up to the range from 0.0 to 100.0.
As in the previous example, we use the close() method to close the file when we are done writing to the file.

A generic data reading function

In the next few examples we are going to be reading data from text files. In every case the data will be arranged as a data series with a list of data items on each line of the file. The following Python function will serve as a generic data reading function to load the raw data from the text file. The function reads the individual lines of the input file as text strings and then uses the string split() method to split each line into a list of strings for the individual data items.

def readData(fileName):
  """Generic data reading function:
     reads lines in a text file and
     splits them into lists."""
    data = []
    with open(fileName) as f:
        for line in f.readlines():
            data.append(lineToData(line.split()))
    return data

The next step will typically be to convert the strings in our data lists into a data format that is appropriate for our particular application. For example, in the next example program below we are going to replicate the linear regression example I showed a few lectures back. We will be working with an input file that looks like this:

The first entry in the data list returned by the call to split() in readData will look like

["1935","32.1"]

I would like to convert that pair of strings into a tuple containing a combination of an integer and a float. Here is a simple data cleaning function that can perform that transformation:

def lineToData(line)
  """Converts a raw line list into an appropriate data format."""
    return (int(line[0]),float(line[1]))

readData() will then use this lineToData function to put the data in the format that we need.

pairs = readData('farm.txt')

Linear regression program

Here now is the program that reads the farm population data and performs the regression analysis on the data. Note the function definitions that help us to perform key parts of the regression computation.

def lineToData(line):
    """Converts a raw line list into an
       appropriate data format."""
    return (int(line[0]), float(line[1]))

def readData(fileName):
    """Generic data reading function:
       reads lines in a text file and
       splits them into lists."""
    data = []
    with open(fileName) as f:
        for line in f.readlines():
            data.append(lineToData(line.split()))
    return data

def means(pairs):
    xSum = 0
    ySum = 0
    for x, y in pairs:
        xSum += x
        ySum += y
    N = len(pairs)
    return xSum / N, ySum / N


def covariance(pairs, means):
    sum = 0
    for x, y in pairs:
        sum += (x - means[0]) * (y - means[1])
    return sum


def xVariance(pairs, xMean):
    sum = 0
    for x, y in pairs:
        sum += (x - xMean) * (x - xMean)
    return sum


def regressionCoeffs(pairs):
    """Computes linear regression coefficients (a,b)
       from a list of (x,y) pairs."""
    m = means(pairs)
    beta = covariance(pairs, m) / xVariance(pairs, m[0])
    alpha = m[1] - beta * m[0]
    return (alpha, beta)


pairs = readData('farm.txt')
a, b = regressionCoeffs(pairs)
for x, y in pairs:
    prediction = a + x * b
    print('Year: {:d} Prediction: {:5.2f} Actual: {:5.2f}'.format(x, prediction, y))

The output produced by this program is

Year: 1935 Prediction: 31.49 Actual: 32.10
Year: 1940 Prediction: 28.56 Actual: 30.50
Year: 1945 Prediction: 25.62 Actual: 24.40
Year: 1950 Prediction: 22.69 Actual: 23.00
Year: 1955 Prediction: 19.76 Actual: 19.10
Year: 1960 Prediction: 16.82 Actual: 15.60
Year: 1965 Prediction: 13.89 Actual: 12.40
Year: 1970 Prediction: 10.96 Actual:  9.70
Year: 1975 Prediction:  8.02 Actual:  8.90
Year: 1980 Prediction:  5.09 Actual:  7.20

This looks about right for a linear regression.

Programming Exercise

Write a Python program that reads two lists of integers from files named 'one.txt' and 'two.txt' and then determines which numbers from the first file do not appear in the second file. Construct a list of these numbers and then write the list out to a third file named 'diff.txt'.

To submit your work for grading, compress your entire project folder into a ZIP archive and send me that archive as an attachment to an email message.