Wednesday, September 1, 2010

Word Length Frequency

// word_len_histo.cpp : reads words and lists distribution
//                      of word lengths.
// Fred Swartz, 2002-09-01

// This would be nice to turn into an OO program, where
// a class represented a distribution of values.
// Some elements which are globals here would turn into
// private member elements in the class (eg, valueCount).


//--- includes
#include <iostream>
#include <iomanip>
#include <cctype>
using namespace std;

//--- prototypes
void  countValue(int cnt);
float getAverage();

//--- constants
const int BINS = 21;  // how many numbers can be counted

//--- globals
int valueCount[BINS]; // bins used for counting each number
int totalChars = 0;   // total number of characters

//=========================================================== main
int main() {

    char c;              // input character
    int  wordLen = 0;    // 0 if not in word, else word length

    //--- Initialize counts to zero
    for (int i=0; i
        valueCount[i] = 0;
    }

    //--- Read chars in loop and decide if in a word or not.
    while (cin.get(c)) {
        if (isalpha(c)) { // letters are in words, so
            wordLen++;    // add one to the word length
        } else {
            countValue(wordLen); // end of word
            wordLen = 0;  // not in a word, set to zero
        }
    }
    countValue(wordLen);  // necessary if word ended in EOF

    //--- print the number of words of each length
    cout << "Why does this line disappear?" << endl;
    cout << "Word length    Frequency" << endl;
    for (int j=1; j
        cout << setw(6) << right << j << "       " 
             << setw(8) << right << valueCount[j] << endl;
    }

    //--- print average length
    cout << "\nAverage word length: " << getAverage() << endl;

    return 0;
}//end main


//==================================================== countValue
void countValue(int cnt) {
    if (cnt > 0) {
        // this must be the end of a word
        if (cnt > 20) {
            cnt = 20;  // longer than 20 counts as 20
        }
        valueCount[cnt]++; // count in correct bin
    }
    totalChars += cnt;
}//end countWord


//==================================================== getAverage
float getAverage() {
    int totalCount  = 0;

    for (int i=0; i
        totalCount  += valueCount[i];
    }
    if (totalCount > 0) {
        return (float)totalChars/totalCount;
    } else {
        return 0.0;
    }
}//end getAverage


Alternative approach
Suppose that we have a very short paragraph like this "Roses are red. Violets are blue. This verse doesn't rhyme. And neither does this one," how can we find and save all different word lengths and their frequencies of occurrence? For example, "red" is 3 letter long and there are a total of five letters of this length (are, red, are, and, one) in the paragraph. Then one of the items in our cache should be 3 and 5
Just like other problems where we have to keep track of the number of occurrences, we should use a hash table. The algorithm is like this:
public HashMap  countWordLengthFrequency(String[] paragraph)
{
    HashMap  frequencyTable = new HashMap();

    for (int i = 0; i < paragraph.length; i++)
    {
      if (!frequencyTable.containsKey(paragraph[i].length()))
        frequencyTable.put(paragraph[i].length(), 1);
      else
      {
        Integer count = frequencyTable.get(paragraph[i].length()) + 1;
        frequencyTable.put(paragraph[i].length(), count);
      }
    }

    return frequencyTable;
}
Explanation:we just loop through the words in the paragraph. For each word, we check to see if its length is already in the hash table. Therefore, there are two cases. 1) If the word's length is already in the table, we increase the frequency count by one. 2) Otherwise, we hash the length into the table and set the frequency to 1 because this is the first time this length is hashed into the table.
Obviously, the time complexity is O(n) because we have to check every word in the paragraph. Moreover, in the worst case, the space complexity is O(n). That's when every word in the paragraph has a different length than the other words. If you know any better algorithm for this problem

0 comments:

Post a Comment