List Difference

A series of examples

In the first third of CMSC 250 we are going to focus on object-oriented programming. In this portion of the course I am going to be showing a series of examples that demonstrate how to use some of the collection classes in the Java class library, how to create our own classes, and how to construct more complex programs composed of multiple interacting classes.

Working with Java collection classes

The Java class library contains a number of classes designed to store collections of objects. In the first few lectures in this series I will introduce you to some of the more commonly used collection classes, and I will explain how to select the best collection class for a given application.

This first lecture will introduce you to the ArrayList and TreeSet classes.

A problem

All of the lectures in this first portion of the center around problems. In each lecture I will introduce the problem to be solved and then develop a solution to that problem. Each lecture will also have an assignment at the end that will ask you to solve a closely related problem.

Here is the problem we will tackle in the first lecture. In this problem we have two text files that store lists of positive integers. Each text file contains a list of random integers with no repeats. Our problem is to construct a third file that contains all of the numbers that appear in exactly one of the two lists. In other words, we want to construct the union of the two lists but then take away any numbers that are in the intersection of the two lists.

Generating the lists

The first step in solving this problem is to write a program that can generate the two lists for us. Specifically, we want to be able to generate a list of N random integers that fall in the range from 0 to N-1. Furthermore, we want to have no repeats in the list. (Since we will not put repeats in our list the final list we generate will have fewer than N numbers in it.)

The program to generate the lists

Here is the source code for program that generates these lists for us. I will discuss important parts of the code below.

package difference;

import java.io.File;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.Random;

public class MakeFiles {
    /** Generate N random integers in the range from 0 to N
     *  and store them in the file with name filename.
     * 
     *  Skip any duplicates when saving - this will result
     *  in less than N numbers being saved in the file.
     */
    public static void makeNumbers(String filename,int N) {
        Random random = new Random();
        ArrayList<Integer> numbers = new ArrayList<Integer>();
        // Generate N random integers and add them to numbers
        // only if they don't already appear in that list.
        for(int n = 0;n < N;n++) {
            int nextInt = random.nextInt(N);
            if(!numbers.contains(nextInt))
                numbers.add(nextInt);
        }
        // Save the numbers to the file.
        PrintWriter output = null;
        try {
            output = new PrintWriter(new File(filename));
        } catch(Exception ex) {
            ex.printStackTrace();
            return;
        }
        for(int n : numbers) {
            output.println(n);
        }
        output.close();
    }
    
    public static void main(String[] args) {
        final int N = 200000;
        makeNumbers("one.txt",N);
        makeNumbers("two.txt",N);
        System.out.println("Done");
    }
}

Generating random integers

The first technical problem we need to solve in building this program is generating a sequence of random integers. As with many problems in Java programming, the Java class library already has a class that can help with this, the Random class in the java.util package. (Full documentation on this class is available here.)

To use the Random class we start by making an object of type Random.

Random random = new Random();

The Random class offers a number of methods for generating random numbers. To generate random integers, we use one of the nextInt() methods. The first variant of nextInt() takes no parameters and returns a random integer, which may be either positive or negative. This method also places no bound on the size of the integer returned. The second variant takes a positive integer N as its parameter and returns a random positive integer in the range from 0 to N-1. That is the version we will use for our solution.

int nextInt = random.nextInt(N);

Avoiding duplicates

An important requirement for this problem is that the list of integers we generate must not contain any duplicates. To enforce this requirement I will use the following strategy:

Create an empty list to store integers.
Generate N random integers.
For each integer that we generate, first check to see if that integer already appears in the list.
If it does not, add it to the list.
When we have generated all N random integers we will write the contents of the list out to a file.

Working with the ArrayList class

The Java class library contains a number of collection classes designed to store collections of objects. The most basic and most widely used collection class is the ArrayList class, which stores data items in an internal array. The ArrayList class, like all collection classes, is a parameterized class, which means that when we create the ArrayList we have to specify what type of object we are planning to store in the ArrayList. In our example we will do this:

ArrayList<Integer> numbers = new ArrayList<Integer>();

An important restriction we have to respect here is that collection classes can only store objects. This is a problem for us in this application, because we want to store a list of integers. Normally we use the int data type in Java to store integers. Unfortunately, the int data type is an example of a primitive data type, and is not an object type. Fortunately, the Java class library offers a set of special classes that can serve as containers for primitive data types. One of those classes is the Integer class, which is an object that serves as a simple container for a single int. To store our list of integers we will first put each int inside an Integer object, and then we will store that Integer object in our ArrayList.

The ArrayList class offers a number of useful methods that make the ArrayList class a better alternative to simply using an array. In this example we will make use of two of the most important methods of the ArrayList class, the add() method and the contains() method.

Here is the part of our program where I use these two methods.

for(int n = 0;n < N;n++) {
    int nextInt = random.nextInt(N);
    if(!numbers.contains(nextInt))
        numbers.add(nextInt);
}

Our job here is to generate a list of N random integers. The loop here takes care of that. For each random integer we generate we first want to make sure that that number is not a duplicate. To check for that, we use the ArrayList's contains() method to make sure that the integer we generated is not already in the list. If it is not, we use go ahead and use the add() method to add it to the list.

An important technical issue to point our here has to do with the way we are calling the contains() and add() methods. In both cases we are passing the variable nextInt to those methods. This looks like a type mismatch, because nextInt is of type int, while both of these methods actually expect a parameter of type Integer. Java solves this problem for us automatically by using a technique called auto-boxing. If we pass an int to a method that expects an Integer parameter Java will automatically put the int in an Integer and pass that Integer object to the method.

Working with a PrintWriter

Once we have constructed our list of integers we will want to save that list to a text file. The simplest way to write data to a text file in Java is to use the PrintWriter class. This class provides a println() method that we can use to print a line of the data to the text file.

Here is the relevant portion of the program where we set up the PrintWriter and write the integers to the file.

PrintWriter output = null;
try {
    output = new PrintWriter(new File(filename));
} catch(Exception ex) {
    ex.printStackTrace();
    return;
}
for(int n : numbers) {
    output.println(n);
}
output.close();

To create a PrintWriter that writes to a file we use the version of the PrintWriter's constructor that takes a File object as its parameter. We initialize the File object with the name of the file we want to write to.

output = new PrintWriter(new File(filename));

A technical issue we have to contend with here is the fact that the File constructor may throw an exception. This means we have to do something to handle that potential exception. The best way to handle a potential exception is to put the code that may generate the exception into a try/catch block:

try {
    output = new PrintWriter(new File(filename));
} catch(Exception ex) {
    ex.printStackTrace();
    return;
}

If an exception does take place somewhere in the try block we can catch that exception in the catch block. Once we have caught the exception we can ask it to print the details of the exception to the standard output by using the printStackTrace() method. Also, once we learn that our attempt to open the file has failed we pretty much have to give up on writing our numbers out to the file. The simplest way to give up is to simply return from the function we are in, since there is now nothing left for us to do.

Once we set up the PrintWriter it is time to write all of our integers out to the file. Here is the loop I use to do that:

for(int n : numbers) {
    output.println(n);
}

Here I am using a special variant of the Java for loop, called the enhanced for loop. This type of for loop is used quite often whenever we need to iterate over all of the items in a collection. In this example our collection numbers is an ArrayList of Integers. The for loop here iterates over that collection, placing each number in the list in turn in an int variable n. (Note that once again we have a type mismatch here: the ArrayList stores Integer objects, yet we are placing the numbers in an int variable. Once again Java solves this problem for us via auto-unboxing: for each Integer in the list Java will copy the int out of the Integer and store that int in our variable n.)

Once we have written the numbers out to the file we have to take care of one list detail. We need to call the PrintWriter's close() method:

output.close();

This last step is essential because the PrintWriter uses an optimization called a caching strategy. When we ask the PrintWriter to print some text to the file it will not write the text out right away. Instead, it stores text in an internal text cache. When that cache fills up it will write the entire cache out to the file all at once, which is more efficient than writing one line of text at a time to the file. The downside of this optimization is that we don't call the close() method we may end up with some or all of our text still stuck in the cache. Calling close() flushes any remaining text out of the cache and into the file.

Computing the difference list

Once we have our two files of integers we are ready to tackle the main problem, which is the problem of finding all of the numbers that appear in exactly one of the two lists. Since this problem amounts to finding the union of the two lists and then removing all of the numbers that appear in both lists we can use the following strategy to solve our problem:

Put all of the numbers from the first list in an ArrayList.
For each number in the second list we check to see if is already in the ArrayList. If it is, that number is in the intersection and must be removed from the ArrayList. If it is not, we add it to the ArrayList.
After processing each number in the second list we write the contents of the ArrayList out to an output file.

Here now is the code for program that carries out this strategy:

public class Difference {

    /** Program to find the numbers that appear in exactly one of the two
     *  files one.txt and two.txt.
     */
    public static void main(String[] args) {
        /** Read the numbers in the first file and put them in a list. **/
        Scanner input = null;
        try {
            input = new Scanner(new File("one.txt"));
        } catch(Exception ex) {
            ex.printStackTrace();
            return;
        }
        ArrayList<Integer> numbers = new ArrayList<Integer>();
        while(input.hasNextInt()) {
            numbers.add(input.nextInt());
        }
        input.close();
        
        /** Read and process the numbers from the second list. **/
        try {
            input = new Scanner(new File("two.txt"));
        } catch(Exception ex) {
            ex.printStackTrace();
            return;
        }
        // Here is how we will find the numbers we want. For each number 
        // that we read from the second file, we first check to see if 
        // it has already appeared in the first file (and hence is in numbers).
        // If is already in numbers, remove it from numbers, since we don't 
        // want to retain any numbers that appear in both files. If it is not
        // in numbers, add it to numbers, since now we know that the number
        // appears only in file two.
        while(input.hasNextInt()) {
            Integer next = input.nextInt();
            if(numbers.contains(next))
                numbers.remove(next);
            else
                numbers.add(next);
        }
        input.close();
        
        /** Write the results out to a file. **/
        PrintWriter output = null;
        try {
            output = new PrintWriter(new File("result.txt"));
        } catch(Exception ex) {
            ex.printStackTrace();
            return;
        }
        for(int n : numbers) {
            output.println(n);
        }
        output.close();
    }
    
}

A speed problem

The program above solves the problem at hand, but it has one very significant problem. If we increase the size of the two lists, this solution slows down pretty significantly. The source of the trouble lies in this part of the code:

while(input.hasNextInt()) {
    Integer next = input.nextInt();
    if(numbers.contains(next))
        numbers.remove(next);
    else
        numbers.add(next);
}

This loop iterates over all of the numbers in the second list. For each number in the second list we first have to check whether or not it appears in the list already by calling the contains() method. Because the numbers in the list are in no particular order the only way that contains() can determine that a given number is not in the list is to search through all of the items in the list. If the number is in the list, we will typically have to search about half way through the list to locate it. Given all of this, if both lists have ≤ k N items in them the total work done by all of the calls to contains() here is bounded above by k² N². This quadratic scaling behavior is bad news: it means that if we double N the algorithm will take four times as long to process the data, and if we increase N by a factor of ten the algorithm will 100 times as long to process the data.

The code to generate the initial random lists suffers from essentially the same problem.

A fix for the speed problem

Fortuntately our speed problem is really easy to address. All we have to do is to replace the ArrayList that we have been using to store our numbers with a container class that offers a more efficient contains() method. The best class for this purpose is the TreeSet class. A TreeSet stores its data items internally in a specialized structure called a binary search tree. Here is an example of what a binary search tree looks like.

In a tree structure the data is arranged in a collection of nodes. In a binary tree each node can have up to two child nodes that are connected to it. Furthermore, a binary search tree imposes a special ordering property on its children. If a node contains an integer k then its left child has to contain an integer that is less than k and its right child has to contain an integer greater than k. The ordering rule also affects how we add new items to the tree. To add a new item to the tree we search downward from the top (or root) node until we find an empty space for the number. At each node we come to we compare the number we are trying to insert against the number in the current node. If the incoming number is less than the node's number we continue the search in the left child, and if it is greater than the node's number we continue the search in the right child. If we encounter a node that already has the number we are trying to insert we simply stop. If we fall off the end of the tree we simply add a new node at the spot where we fell off.

The algorithm for searching for a number in the tree is similar. Again, we start at the root and use the same procedure we used to insert a new number. If we come to a node that has the number we are searching for we stop and declare success. If we instead fall of the end of the tree we stop and declare failure.

The key advantage of this data structure is that both the algorithm for adding new items and the algorithm for searching take time proportional to log₂ N for a tree containing N items.

Replacing the ArrayList in our two programs above with a TreeSet is absurdly simple. All we have to do is to replace the statement

ArrayList<Integer> numbers = new ArrayList<Integer>();

with

TreeSet<Integer> numbers = new TreeSet<Integer>();

If we look at the bit of code that was giving us problems before

while(input.hasNextInt()) {
    Integer next = input.nextInt();
    if(numbers.contains(next))
        numbers.remove(next);
    else
        numbers.add(next);
}

We will see that we have a while loop that has to process k N numbers. For each of those numbers we have to call contains() on a TreeSet that also contains about k N numbers. The total work done by all of those calls to contains() is now k N log₂( k N), which is proportional to N log₂ N and not the N² we had before.

Why was that so easy?

The fact that we only had to change one line of code to get dramatically better performance from our program is not an accident. What we are seeing here is the result of a carefully thought out design strategy the designers of the Java class library pursued.

Here are the outlines of that design strategy:

We start by constructing an interface that lists all of the methods that we want our collection classes to share.
We then make sure that all of our collection classes implement that same interface.
This will guarantee that all of our classes share a common set of methods, such as add(), contains(), and remove().
When we write code that works with a particular collection class we try to limit ourselves to only using methods from the common interface.
When we decide to switch to a different collection that implements that same interface we won't have to modify our code beyond the statement that created the collection.

You can see how this strategy works in practice by looking at the documentation for the java.util.Collection interface, which is available here. Both the ArrayList and TreeSet classes implement that interface.

Suggested reading