CSE 17
Not so long ago, text messages could only be sent using the numeric key pad of the cell phone. Even
today, less expensive phones rely on this means for texting. In this assignment, we will look at the
problem of decoding text messages using the T9 system. Unlike the multi-tap system, T9 requires only a
single key press for each letter, and then uses a dictionary of words and their frequencies to choose the
most likely word given a sequence of key presses. The keypad is mapped to letters as shown below:
Thus, we would press “2” once if we want “A”, “B” or “C”. Of course, this means that some words would
have identical sequences of key presses. For example, “pay”, “raw” and “say” all map to the sequence
“7”,”2”, “9”. T9 assumes you want the most frequently used word, and then typically provides an option
to cycle through alternative words in order of decreasing frequency. If the word “say” was the most
frequently used word, then this would be the first option. We will represent a user’s entry as a string
consisting of the key presses. For clarity, the space between words will simply be represented by spaces.
In order to solve this program you will need to process a word dictionary text file that I will provide. This
file has the 5000 most frequently-used words in the English language, sorted by descending frequency.
There is one word per line, with multiple fields separated by whitespace. The format is:
rank word part-of-speech-code frequency dispersion
For our purposes, you will only need to be concerned with the word and the frequency fields. The latter is
an integer representing the number of occurrences in a large collection of English documents, and will be
less than 100 million.
To solve this problem, you will need to write the four classes below. Each class must be defined in a
separate file. It is strongly recommended that you write the classes in the order described, and that you
compile and test each class after you write it. Remember, it is critical that the classes and methods have
the exact signatures (visibility, names, number and type of parameters, etc.) as described below. However,
you may create additional fields and methods if you find that helpful. When your program is compiled,
there should not be any warnings that can be eliminated by instantiating the generic class with a concrete
type. However, there may be some warnings that are unavoidable. CSE 17, Fall 2014
2
Word A simple class representing an entry in our word dictionary.
Implements the Comparable<Word> interface.
-word: String The word.
-freq: int The word’s frequency.
+Word(word:String, freq:int) The constructor. Sets word and freq.
+getWord(): String Returns the word.
+getFreq(): int Returns the frequency of the word.
+getKeySequence():String Returns the sequence of numbers that would be used to enter the
word as a text message in T9. Note, this is returned as a string.
Thus, for the word “pay,” the string “729” would be returned.
+toString():String Returns a string of the form: word (code = key-sequence, freq =
freq) where word and freq are the fields of the Word, and key-
sequence is the corresponding key sequence of the word.
+compareTo(w:Word):int Implements the standard semantics of compareTo. The current
word is less than w if its key sequence comes before w or if they
have the same key sequence and its frequency is higher than w’s.
The word is greater than w if its key sequence comes after w or if
they have the same key sequence and its frequency is less than
w’s. Otherwise, the two words are equal.
+encode(c:char):char Given a character c, returns the character for the corresponding
numeric key press. For example, if c=’A’ or c=’B’ or c=’C’ then
returns the character ‘2’.
SelectionSort This class is used to sort an ArrayList using Selection Sort.
There are no fields.
+sort<E extends Comparable<E>>
(list:ArrayList<E>):void
This is a generic method that sorts list using the Selection Sort
algorithm. Your implementation should be written to work with
parameterized ArrayLists, as opposed to converting the ArrayList
to an array and reusing the book’s code as-is. The sort should
work for any Comparable E.
TextDecoder This is the main class of the application.
There are no fields.
+main(args:String[]):void The entry point to the application. Loads a FreqDictionary from
the file wordFreq.txt (this file is available on Course Site). Then
prompts the user to enter a line of numeric key presses, with each
“word” separated by spaces. It then translates these numbers into a
text message where each word is predicted to be the most frequent
word that matches the key sequence. Regardless of user input, the
program not crash, and words that can’t be translated should be
displayed as a sequence of question marks. The program should
also handle the FileNotFoundException robustly.
CSE 17, Fall 2014
3
FreqDictionary A dictionary of words and their frequencies. The dictionary is
organized to support efficient translation of key sequences to
words.
+MAX_LENGTH: int A constant for the length of the longest word that will appear in
the dictionary. It should be set to 10.
-wordLists:ArrayList<Word>[] An array of lists of words. Each element of the array is an
ArrayList that only has words of the same length (from 1 to
MAX_LENGTH, depending on the list). These words will be
sorted first by their key press sequence and then by descending
frequency.
+FreqDictionary() The constructor. Should initialize the ArrayLists in wordLists.
+loadFromTextFile(freqFile:File):
void
Reads from a text file formatted as described above and adds the
words into the appropriate element of wordLists depending on
lengths. Any words that are too long will be skipped. After all lists
are created, uses selection sort to sort them first by their key press
sequence (ascending) and then by descending frequency. It should
throw the FileNotFoundException if unable to open the file.
+getTopKWords(keySeq:String,
k:int): Word[]
For the given sequence of presses keySeq, returns up to k Words
that best match the sequence, in order of descending frequency.
This method should use a binary search to locate a corresponding
word in the appropriate element (i.e., ArrayList) of wordLists. It
will then need to locate the most frequent words with that
sequence, and then find the next k-1 most frequent words. The
length of the returned array should be the same as the number of
matching elements, which may be less than k.
+decodeWord(keySeq:String):
String
Returns the most frequent word that matches the presses keySeq.
Make use of existing methods as much as possible, and avoid
writing redundant code. If there is no matching word, then returns
a sequence of question marks of the same length as keySeq.
In order to write this program correctly, you need to understand how the wordLists field is
organized. Consider the following diagram:
wordLists:
...
ear(code=327, freq=29597)
far(code=327, freq=16006)
eat(code=328, freq=73505)
fat(code=328, freq=24844)
fat(code=328, freq=14826)
day(code=329, freq=432773)
...
...
club(code=2582, freq=30800)
blue(code=2583, freq=47622)
clue(code=2583, freq=8732)
blue(code=2583, freq=7671)
coal(code=2625, freq=8928)
boat(code=2628, freq=32079)
...
0 1 2 3 4
... CSE 17, Fall 2014
4
Note, that each element of wordLists is a list of words (in fact, each is an ArrayList). The first
element (index 0) would have a list of words of length 1 (e.g., “I” and “a”). For each word in the
dictionary, we know its string, its key sequence (referred to as code above) and its frequency.
Note, that the lists are neither sorted alphabetically nor by frequency. Instead, they are sorted
first by code, and then by descending frequency. This organization makes it easy to find the best
matching word for a given numeric sequence. First, use the length of the sequence to determine
which list to look into. Then use binary search to find a word whose code matches the sequence.
Then examine its neighbors to find the most frequent matching word and other matching words.
Because, we have sorted first by key presses then by descending frequency, the earliest match to
a given key press sequence will be the most frequent word. Note, because our dictionary file has
separate entries for different parts-of-speech (e.g., noun vs. verb) of the same word, there may be
duplicates in your lists, such as “fat” or “blue” above. This is okay.
An example interaction is given below (user input is given in bold):
> run TextDecoder
Enter a sequence of key presses (separate words by spaces):
7625 263 7655
Your text message is:
rock and roll
Note, due to other words with the same key sequences, alternative phrases would be, “sock and
roll”, “soak and poll”, “rock and poll”, etc. However, each of these would be less likely because
the words are not as common. You are only expected to return the phrase with the most common
word for each set of key presses. You should identify each word in a context independent
manner: that is, do not use other words in the phrase to determine the likelihood of possible
interpretation of the key presses, only use the frequency information from the dictionary. You do
not need to recognize punctuation. Note, although your main does not demonstrate finding the
top K matching words for a given sequence (it only needs to find the top word), we will test and
grade this feature with our own programs. Therefore make sure that it works correctly.
At a minimum provide a Javadoc comment explaining what each method does, and another one
that summarizes each class. Include additional comments for lines that are especially
complicated. Also, at the top of each class include a comment with the following form, although
the program description only needs to appear in TextDecoder.java:
/*
CSE 17
Your name
Your user id
[with assistance from tutor name (tutor e-mail address)] if tutor used
Program #4 DEADLINE: November 13, 2014
Program: T9 Text Messaging
*/
Once the program compiles, runs, and has been tested to your satisfaction, use SSH to connect to
your account on sunlab.cse.lehigh.edu. Create a subdirectory of cse017.144 called P4, and
transfer all of your .java files to this subdirectory. Be very careful to ensure that you have
spelled the directories correctly and used the correct case for all letters in directory and file
names. At the time of the deadline, an email will be sent to you either confirming that your files
have been collected or stating that the files have not been collected.