Project: Tag Cloud Generator
Objectives
- Familiarity with designing and coding a realistic component-based application program without being provided a skeleton solution.
- Familiarity with using "collection" components (e.g.,
MapandSortingMachine). - Familiarity with using file I/O components (e.g.,
SimpleReaderandSimpleWriter).
Note that in your solution you can only use components from the
components package and components from the standard Java libraries
that have been used in CSE 2221/2231 in lectures/labs/projects (e.g.,
String). You should not use other components from any other libraries
that have not been used in CSE 2221/2231.
The Problem
Write a Java program that generates a tag cloud from a given input text. The solution to this problem should have a lot in common with your solution for the word counter project you did at the start of the course. Here are some initial requirements for the new problem:
- The program shall ask the user for the name of an input file, for the name of an output file, and for the number of words to be included in the generated tag cloud (a positive integer, say N).
- The program shall respect the user input as being the complete relative or absolute path as the name of the input file, or the name of the output file, and will not augment the given path in any way, e.g., it will not supply its own filename extension. For example, a reasonable user response for the name of the input file could directly result in the String value "data/importance.txt"; similarly, a reasonable user response for the name of the output file could directly result in the String value "data/importance.html".
- In contrast with one or more past projects, the program shall check
for invalid input; however, the program may (and probably should) rely
on the
SimpleReaderandSimpleWriterfamily components to raise an error in response to conditions such as non-existent files or paths. - The input file can be an arbitrary text file. No special requirements are imposed.
- The output shall be a single well-formed HTML file displaying the name
of the input file in a heading followed by a tag cloud of the N words
with the highest count in the input. The words shall appear in
alphabetical order (in which, e.g., "bar" comes before "Foo",
not the lexicographic order provided by the
StringcompareTomethod which would put capitalized words ahead of lower case ones, e.g., "Foo" would come before "bar"). The font size of each word in the tag cloud shall be proportional to the number of occurrences of the word in the input text (i.e., more frequent words will be displayed in a larger font than less frequent ones). - Words contain no whitespace characters. Beyond that, it is up to you to come up with a reasonable definition of what a word is and what characters (in addition to whitespace characters) are considered separators. (For the sample inputs provided, the characters in the string " \t\n\r,-.!?[]';:/()" do a decent job of separating words.)
- You must use the
SimpleReaderandSimpleWriterfamily components for all the input and output needed. - You must use the
Mapfamily components to keep track of the words and their counts. - You must use the
SortingMachinefamily components for all the sorting needed.
These are the stated requirements for your program. If you have questions or need additional details, ask in class.
Setup
You're on your own! However, a good starting point may be your solution to the first project of CSE 2231 (the word counter one). In any case, one member of the team should set up an Eclipse project for this assignment. The project should then be shared with the rest of the team by using the Subversion version control system as learned in the Version Control With Subversion lab.
Method
When you and your teammate(s) are done with the project, decide who is going to submit your solution. That team member should select your Eclipse project (not just some of the files, but the whole project) containing the complete group submission, create a zip archive of it, and submit the zip archive to the Carmen dropbox for this project, as described in Submitting a Project. Note that you will only be allowed one submission per group, that is, your group can submit as many times as you want, but only the last submission will be on Carmen and will be graded. Under no circumstance will teammates be allowed to submit separate solutions. Make sure that you and your partner(s) agree on what should be submitted.
Your grade will depend not merely on whether the final program meets the initial requirements, of course, but also on the general software quality factors you've learned in CSE 2221/2231: understandability, precision, appropriate use of existing software components, maintainability, adherence to coding standards, efficiency, and so forth.
Some sample input files are available in this folder (they end
with .txt). They are books downloaded from
www.gutenberg.org. Here is a sample output
tag cloud generated from this input
file (with N = 100 and converting all words to
lower case). The output generated by your program should follow the
format of this example, including both of the two <link> tags
referring to the tagcloud.css stylesheet file,
which make the tag cloud look as it does in the example. This simple
cascading style sheet (CSS) is used by the browser to format tag clouds;
this file is needed by the browser to correctly display the tag cloud in
the sample file. It also defines font sizes f11 through f48 that you
can use to control the size of the font for each word. (This example has
two <link> tags for the purpose of being more robust in different
conditions. A web browser will try the other <link> tag if one of them
fails. The first-listed tag names a CSS provided on the CSE web site.
The second one expects the CSS to be a file in the same folder as the
rendered html file. If this file is not present, the browser will use
the one from the CSE web site. If the browser is currently running with
no internet connection or the CSE web site is down, it will use the file
in the same folder. Of course, if both conditions fail, the browser will
render the html file without the CSS. If both conditions hold, all is
well: the browser will use one of them. (In this case, the two files are
identical, so which one is used does not matter.))
Sorting Words and Counts in Different Ways
Your solution will have to be able to sort the words first in decreasing
order of count (to find the N most frequent words), and then in
alphabetical order to output the tag cloud. To sort them with a
SortingMachine while keeping the words and their counts together, you
should use the Map.Pair<String, Integer>s that come out of the Map
you used to count words, and put those in a SortingMachine that sorts
them in decreasing order of count; then remove the top N in order from
the first SortingMachine and put them into a second SortingMachine
that sorts them in alphabetical order of word. All that is needed to get
this to work is two nested classes implementing the
Comparator<Map.Pair<String, Integer>> interface: one that compares the
values (counts) in the pairs and the other that compares the keys
(words) in the pairs. (You have seen implementations of the
Comparator
interface before in several lectures and labs, e.g.,
SortingMachine,
Queue Insertion
Sort,
Queue
Quicksort,
etc.)
Note that the documentation for the
Comparator
interface includes a warning about using a comparator capable of
imposing an ordering that is not consistent with equals in certain
situations. A comparator c is consistent with equals if
c.compare(e1, e2) == 0 has the same boolean value as e1.equals(e2)
for all e1 and e2. For this project, it is not going to be a problem
if your comparators are not consistent with equals because all the OSU
components are designed to work correctly as long as the comparator
provides a valid total preorder (possibly not consistent with equals).
However, in the next project, once you start using the Java standard
components, this could become an issue. If you want to avoid any future
problems, you may want to ensure that your two implementations of the
Comparator interface are indeed consistent with equals.
Additional Activities
Here are some possible additional activities related to this project. Any extra work is strictly optional, for your own benefit, and will not directly affect your grade.
- Modify the program so that common words (such as, for example, "a", "the", "and", etc.) and strings that are not words (such as, for example, "t", "s", etc.) are not included in the tag cloud.
- Modify the program so that it is case-insensitive, i.e., the words "hello" and "HeLLo" would be counted as the same word.
- Modify the case-insensitive program so that the capitalization displayed in the output is the one that occurs most often among the different capitalizations of the same word.