Project: Tag Cloud Generator

Objectives

Familiarity with designing and coding a realistic component-based application program without being provided a skeleton solution.
Familiarity with using "collection" components (e.g., Map and SortingMachine).
Familiarity with using file I/O components (e.g., SimpleReader and SimpleWriter).

Note that in your solution you can only use components from the components package and components from the standard Java libraries that have been used in CSE 2221/2231 in lectures/labs/projects (e.g., String). You should not use other components from any other libraries that have not been used in CSE 2221/2231.

The Problem

Write a Java program that generates a tag cloud from a given input text. The solution to this problem should have a lot in common with your solution for the word counter project you did at the start of the course. Here are some initial requirements for the new problem:

The program shall ask the user for the name of an input file, for the name of an output file, and for the number of words to be included in the generated tag cloud (a positive integer, say N).
The program shall respect the user input as being the complete relative or absolute path as the name of the input file, or the name of the output file, and will not augment the given path in any way, e.g., it will not supply its own filename extension. For example, a reasonable user response for the name of the input file could directly result in the String value "data/importance.txt"; similarly, a reasonable user response for the name of the output file could directly result in the String value "data/importance.html".
In contrast with one or more past projects, the program shall check for invalid input; however, the program may (and probably should) rely on the SimpleReader and SimpleWriter family components to raise an error in response to conditions such as non-existent files or paths.
The input file can be an arbitrary text file. No special requirements are imposed.
The output shall be a single well-formed HTML file displaying the name of the input file in a heading followed by a tag cloud of the N words with the highest count in the input. The words shall appear in alphabetical order (in which, e.g., "bar" comes before "Foo", not the lexicographic order provided by the String compareTo method which would put capitalized words ahead of lower case ones, e.g., "Foo" would come before "bar"). The font size of each word in the tag cloud shall be proportional to the number of occurrences of the word in the input text (i.e., more frequent words will be displayed in a larger font than less frequent ones).
Words contain no whitespace characters. Beyond that, it is up to you to come up with a reasonable definition of what a word is and what characters (in addition to whitespace characters) are considered separators. (For the sample inputs provided, the characters in the string " \t\n\r,-.!?[]';:/()" do a decent job of separating words.)
You must use the SimpleReader and SimpleWriter family components for all the input and output needed.
You must use the Map family components to keep track of the words and their counts.
You must use the SortingMachine family components for all the sorting needed.

These are the stated requirements for your program. If you have questions or need additional details, ask in class.

Setup

You're on your own! However, a good starting point may be your solution to the first project of CSE 2231 (the word counter one). In any case, one member of the team should set up an Eclipse project for this assignment. The project should then be shared with the rest of the team by using the Subversion version control system as learned in the Version Control With Subversion lab.

Method

When you and your teammate(s) are done with the project, decide who is going to submit your solution. That team member should select your Eclipse project (not just some of the files, but the whole project) containing the complete group submission, create a zip archive of it, and submit the zip archive to the Carmen dropbox for this project, as described in Submitting a Project. Note that you will only be allowed one submission per group, that is, your group can submit as many times as you want, but only the last submission will be on Carmen and will be graded. Under no circumstance will teammates be allowed to submit separate solutions. Make sure that you and your partner(s) agree on what should be submitted.

Your grade will depend not merely on whether the final program meets the initial requirements, of course, but also on the general software quality factors you've learned in CSE 2221/2231: understandability, precision, appropriate use of existing software components, maintainability, adherence to coding standards, efficiency, and so forth.

Some sample input files are available in this folder (they end with .txt). They are books downloaded from www.gutenberg.org. Here is a sample output tag cloud generated from this input file (with N = 100 and converting all words to lower case). The output generated by your program should follow the format of this example, including both of the two <link> tags referring to the tagcloud.css stylesheet file, which make the tag cloud look as it does in the example. This simple cascading style sheet (CSS) is used by the browser to format tag clouds; this file is needed by the browser to correctly display the tag cloud in the sample file. It also defines font sizes f11 through f48 that you can use to control the size of the font for each word. (This example has two <link> tags for the purpose of being more robust in different conditions. A web browser will try the other <link> tag if one of them fails. The first-listed tag names a CSS provided on the CSE web site. The second one expects the CSS to be a file in the same folder as the rendered html file. If this file is not present, the browser will use the one from the CSE web site. If the browser is currently running with no internet connection or the CSE web site is down, it will use the file in the same folder. Of course, if both conditions fail, the browser will render the html file without the CSS. If both conditions hold, all is well: the browser will use one of them. (In this case, the two files are identical, so which one is used does not matter.))

Sorting Words and Counts in Different Ways

Your solution will have to be able to sort the words first in decreasing order of count (to find the N most frequent words), and then in alphabetical order to output the tag cloud. To sort them with a SortingMachine while keeping the words and their counts together, you should use the Map.Pair<String, Integer>s that come out of the Map you used to count words, and put those in a SortingMachine that sorts them in decreasing order of count; then remove the top N in order from the first SortingMachine and put them into a second SortingMachine that sorts them in alphabetical order of word. All that is needed to get this to work is two nested classes implementing the Comparator<Map.Pair<String, Integer>> interface: one that compares the values (counts) in the pairs and the other that compares the keys (words) in the pairs. (You have seen implementations of the Comparator interface before in several lectures and labs, e.g., SortingMachine, Queue Insertion Sort, Queue Quicksort, etc.)

Note that the documentation for the Comparator interface includes a warning about using a comparator capable of imposing an ordering that is not consistent with equals in certain situations. A comparator c is consistent with equals if c.compare(e1, e2) == 0 has the same boolean value as e1.equals(e2) for all e1 and e2. For this project, it is not going to be a problem if your comparators are not consistent with equals because all the OSU components are designed to work correctly as long as the comparator provides a valid total preorder (possibly not consistent with equals). However, in the next project, once you start using the Java standard components, this could become an issue. If you want to avoid any future problems, you may want to ensure that your two implementations of the Comparator interface are indeed consistent with equals.

Additional Activities

Here are some possible additional activities related to this project. Any extra work is strictly optional, for your own benefit, and will not directly affect your grade.

Modify the program so that common words (such as, for example, "a", "the", "and", etc.) and strings that are not words (such as, for example, "t", "s", etc.) are not included in the tag cloud.
Modify the program so that it is case-insensitive, i.e., the words "hello" and "HeLLo" would be counted as the same word.
Modify the case-insensitive program so that the capitalization displayed in the output is the one that occurs most often among the different capitalizations of the same word.