Algorithmic horror engine
If you thought my last post was too technical, then you'd best just skip this one.
After a grueling week of upgrading my PC's hardware a dealing with flawed Seagate firmware updates, I decided to celebrate by adding yet another Markov chain feature to KevBot. There's not a lot of story to this. Qedi (who still hasn't updated his website since his reply to one of my very first posts), decided to call me on my claim that I couldn't make a full Markov conditional probability matrix out of KevBot's factoid database. He claimed that a graph-like data structure should be able to store the entire Markov table and also that it could be computed in 1.5 seconds.
With this in mind I took out my whiteboard and started coming up with a data structure that would be suited. Since qedi is interested in this structure, I will be outlining what I required and why I designed the structure the way I did. (The things I do for ex-roommates.)
There were three main requirements for the data structure. The first is the operations that need to be available on the structure: given a word, get the next word; access any random word; write a new Markov relationship. The second is that it has to be fast. All read/write operations have to be of O(1) time complexity. The third is that the structure should be serializable and deserializable so that the probabilities can be stored and retrieved across runs.
Qedi's main idea was to use a directed-graph structure to store the probabilities. Basically, each unique word stores a pointer to each other unique word that can come after this one. This allows O(1)-time access to any given next word. Its disadvantage is it uses O(n)-space (in the number of word relationships), so it can get really big. So each unique word is wrapped in a structure called a MarkovNode, and each MarkovNode stores a list of all words that can possibly follow from the current word, along with their weights.
A mere graph is not enough to satisfy the other requirements. A graph is not inherently searchable for a given word. What's a good structure for retrieving a given wrapper object given its name? There are several. A hash table comes to mind, but one can't guarantee O(1)-time access in every case with a hash table. The structure I chose for this is a trie. Tries aren't as fast on average as hash tables, but they certainly are consistently efficient. Each MarkovNode is referenced by a trie by its name, so that given an input string, the appropriate node can be found.
Tries are terrible at serialization and random access. That's where the final part of the structure comes into play. Each MarkovNode is referenced in a large unsorted ArrayList object. When a random MarkovNode needs to be accessed, it can be found in the ArrayList by index. Additionally, the ArrayList is very easy to serialize: for each node in the list, output the word and all of its children. Rebuilding the Markov graph from this kind of list is trivial. Serialization and deserialization are both O(n) operations (where n is the number of relationships), which is pretty much as good as can be expected.
Since the data structure is completely different from the existing structures KevBot is using, I decided to not use the existing database to populate it. I decided to use all input from the IRC channel to drive the structure instead of just what I've arbitrarily called factoids. Every time someone says something in the channel (#keveX on irc.slashnet.org), the sentence is read by the trie. Each word in the sentence is converted to a MarkovNode (a node is created and added to the ArrayList if it doesn't already exist for a given word) as found in the trie, and the MarkovNode objects are chained together based on the order of the words in the new sentence.
To generate output, one need only find a MarkovNode and follow along its children according to their weights. Finding a MarkovNode can be done randomly using an index into the ArrayList, or using a string by following the trie structure. Both operations are O(1) in the number of relationships.
So, in conclusion, large is one is something that the end up the algorithm.