Friday, May 20, 2011

Tries

Trie (pronounced as try even though it stands for tree and retrieval) is a prefix tree data structure. You can read more on it on topcoder and wiki.

Trie mostly has a root node as an empty string ("") and each node has 26 child pointers letters (A to Z) at least. As we add words to it, the trie grows. Here is an example of how a trie looks (from www.danvk.org) after adding a few words to it. Note that the complete words are double rounded.
It is mainly used for longest prefix (ex: IP matching) and for dictionaries <word, meaning> pairs. For dictionary it has advantages like spell-checking, other nearest words etc.
Usually, the trie node needs to have the children pointers at the least and it can have more data members as per the problem.

class trieNode {
trieNode[26] children
char c //
int countPrefix // words with prefix from root to current node
int countWord // complete words for this current node
meaning // pointer to meanings of this node's word (dictionary)
}


At the minimum, we need the below functions. Depending upon the problem, we add more data members, change the functions.


initialize(trieNode)
addWord(trieNode, word) //called as addWord(root, word)


Efficiency for a building a trie is O(n) for n strings as we have to add all n strings to the trie. For finding a string in a trie, it takes O(l) worst case where l is the length of the string.
Let's take a problem where we want to find the "longest prefix of a given list of strings" for given percentage of strings (i.e longest prefix for 50% of given strings). We would first form the trie with the addWord(root, word) function and then use longestPrefix().
Once the trie is formed, the problem becomes quite simple (like finding the max from a list of numbers). We maintain currPath (path from root to current node i.e. current prefix) and longestPrefix (longest prefix so far). From node's countPrefix and root's countPrefix, we calculate the node's percentage. Based on these two conditions, we change longestPrefix accordingly as we traverse the whole tree by DFS. Note that we first go deep into the tree and then check for conditions as ancestor nodes within the longestPrefix have the same prefixCount (or greater but we prefer longer prefix than greater prefixCount).

 


class trieNode:
trieNode[26] children
char c
int countPrefix
int countWord

addWord(trieNode node, string word) {
//we are at end of the word
if (word = ""):
node.countWord = node.countWord +1
return

//do stuff with current node
node.countPrefix = node.countPrefix + 1

//go to next node
char first = firstChar(word)
if (node.children[first] == null):
node.children[first] = new trieNode()

addWord(node.children[first], firstCharRemoved(word))
}

addWord(trieNode root, List l) {
for string s in l:
addWord(root, s)
}

string longestPrefix(trieNode root, int percent) {
StringBuffer currPath = new StringBuffer()
StringBuffer longestPrefix = new StringBuffer()
dfs(root, percent, currPath, longestPrefix)
return longestPrefix.toString()
}

dfs(trieNode node, int percent, StringBuffer currPath, StringBuffer longestPrefix) {
// append node char
currPath.append(node.c)

// go as deep as possible as ancestors have the same prefixCount
// as the child within the longestPrefix
for trieNode tn in children:
dfs(tn, percent, currPath, longestPrefix)

if (node.countPrefix/countPrefixForRoot *100 >= percent):
if (currPath.length() > longestPrefix.length()):
longestPrefix.setLength(0)
longestPrefix.append(currPath)

// drop last node char
currPath.deleteCharAt(currPath.length() -1)
}

}

Longest Prefix problem needs O(n) time to build the trie for n strings. Then it dwindles down to DFS with (O(|V| + |E|) time to find the longest prefix.



0 comments:

Post a Comment