It's been a while since I have visited a subject close to my heart: Markov chains. In case you're new here (doubtful) or just love my writing so much that you want to read my earlier posts on the subject (also doubtful) feel free: first I explain Markov chains in lay terms, then I improved my original algorithm, I completely changed the way I was collecting and organizing data and then finally I replaced that structure with a MySQL table (which I haven't posted about yet).
Lately KevBot has been having trouble with its chaining because its word-pair table was approaching three million records. This table was a slapped-together idea that I whipped up on the fly because the previous graph was taking up too much space. The table was as simple as they come: one record for every recorded word pair. That means that every time someone says anything, a record is added to this table for every two words he said, even if those pairs were already there before. In fact, repetition was necessary to this structure because that is how the weight problem was solved.
See, if one pair of words is more common (statistically speaking) than another, KevBot needs to be able to pick that word pair over the other. So KevBot needs some way to assign extra weight to the more common word pairs. The simplest and most naive way to do this is to just repeat records: if a record is common, it will be there more often, so a random selection by definition accounts for the weights without any extra work. It's a very easy solution, but it doesn't scale well: every time someone says something it adds more records to the table.
This worked just fine in the short term, when there wasn't much data. But as data began piling up it became obvious that my naive structure wouldn't be able to cut it. A big part of my problem was my choice of primary key: since the word pairs could repeat, I could not use the word pair as my primary key: I had to add an artificial index to the table and use that. That made the types of queries I needed to make most often very slow, since they weren't working on the structure of the table but rather supplementary indices.
So I had two problems to solve: one of size and one of efficiency. The solution to the first was obvious: add a new column to the table to indicate the weight of a particular word pair, so I could collapse repeated word pairs into one record. Then my word pairs would be unique and I could use them as my primary key, thereby solving the second problem. I created the necessary tables and populated them with my existing data, made the changes to KevBot's code so that it could pick word pairs based on their weights (thereby making KevBot more complicated than it was before) and I was done.
Instead of inserting every new pair, KevBot would have to first check to see whether the pair exists. If it doesn't, it can just add the record normally with a weight of 1. If the pair does exist, then KevBot has to increase the weight of that pair. You know how much programmers love to increment numbers. But since the word pair is the primary key of the table, it's very efficient to check to see if it exists. So even though KevBot is now running two queries instead of one, overall it's still faster than it was before.
The performance boost is amazing. KevBot can now create sentences nearly instantly again instead of taking sometimes up to 30 seconds to produce very long strings. The astute reader will notice that I made a structure smaller and faster at the same time, which seems to go against the time and space trade-off. However, I'd like to point out that it's not the size of the structure that made it slow, but rather the efficient primary key that allows it to be queried quickly. The changes to KevBot I had to make actually make it slower (all else being equal), but the database performance boost more than offsets that small loss.
KevBot has something it would like to say:
<KevBot> Yes it does seem pretty large compared to yours.
<KevBot> a touch of spicy montreal on top of the model m from the machine grinds up to your house.
<KevBot> I want to play. I have a state which votes for one party in national elections and another one.
Since I had this new structure I had a new idea: using Markov chains to turn KevBot into a doppelganger. Usually, KevBot pulls in all data from a channel and stores it in a single table; that way its output statistically resembles the ramblings of the channel itself rather than the individuals in it. So the things I say in the channel mix with the things qedi says the way KevBot stores it. Now that I had this new structure I could easily add a new column to the primary key to represent the user who created the word pair!
Each user would therefore have its own subset of word pairs and weights that KevBot could use to not only emulate the channel as a whole but also individual users. The result is just as fast as the regular Markov table and often just as weird:
<KevBot>And now it learns at a linear rate and now shit has been a fun coding project at work.
<KevBot>What in the channel in order to have this agreement. It piles up all it wants, and I are disconnected.
That sure sounds like me. Once again I invite you to swing by #kevex on irc.slashnet.org to interact with my creation. It will be statistically welcoming. As more people interact with it, it gains more data and produces better output, so come on down!