Friday, July 5, 2013

The RankedSet: Optimization's Best Friend


This post is the second post about the 'New Genetic Algorithm' that I posted about previously.

What We Need

See, when you're running the genetic algorithm -- that nifty way to solve problems by computer -- you need to store different solutions to the same problem. For example, suppose I need dinner for my family tonight. Do I cook at home, grab some fast food, or buy a TV dinner at the local grocery market? These are different solutions to the same problem: my family needs food.

While the genetic algorithm (any genetic algorithm, including this new one) is running, it needs to maintain a "population" of such solutions. It keeps around, say, 100 solutions, kills off some, and adds new ones, but it always stores so many solutions (e.g. 100) at all times. In my new genetic algorithm, it will be important for reasons explained in a later post for the algorithm to be able to figure out the rank of each solution. Rather than knowing how much better one solution is from another, I feel it is important only to tell which solution is better than the other. When we make a buying decision, it does not matter if one option is a little bit better or way better, we usually just take the better option. This idea is important, and in the algorithm it will become important to be able to find a solution within the "population" and be able also to delete it based on its rank. I need to be able to say to the data structure, "give me the 45th best solution" and out it pops. I also need to be able to say, "kill off the 32nd solution". So, in order to store the population so that I can easily find, delete (kill off), insert (introduce) a solution into my population, the question is: what is the best way to store my population in a computer?

It turns out, there really isn't a data structure in any standard library I've found that can do the job. By the way, for those purist logicians who are craving a formal specification of what 'the job' is, sink your teeth into this:

  • The data structure needs to store information about each solution. If this is the 'bring home the bacon' example, it needs to store stuff like 'store location', 'what to buy', etc.
  • The data structure needs to store information about the rank (e.g., best, second-best, 43rd-best) of each solution.
  • The data structure needs to have efficient retrieval, deletion, and insertion. It needs to be able to delete a solution given its rank, and retrieve a solution given its rank.

Fanfare, Please: Enter the RankedSet

Why We Like It

The RankedSet will solve the problems I have stated above. You can store a solution to a problem in it, you can tell the set how great the solution is, and it will assign ranks for you based on how good each solution is. It provides retrieval and deletion based on rank and all retrieval, deletion, and insertion runs in logarhmic time. For those non-nerds out there, that means it does what we want it to, and it does it at speeds that take the cake.

Here's how it works: when we insert a solution into the set, we give it two values: how good the solution is, which we will call the index, and then the information about the solution itself, which we will call the payload. Since we'll rank the solutions in the set based on the first value, it needs to be a number -- a score, or perhaps in some cases, a monetary amount, etc. The payload can be anything we want, but in this case it is information about the different solution that we will put here.

How It Works

But how does it work, you say. That's the juicy part, right?

Computer Scientists know that what I am about to describe, is, basically, a Binary Search Tree, or BST. Each solution has at most two "children" solutions--one with a greater index than itself, and one with a smaller index.
With each solution is stored three things:
  1. The solution information itself.
  2. How "good" the solution is.
  3. How many other solutions are "downstream", or, how many "descendants" the solution has (including itself).
It's number 3 in the above list that's the interesting bit. every other part of the description of how the solutions are stored is just a regular ol' BST, easily found in most any standard library in most any computer language. But no BST I've ever found has #3 in the above list stored in it, and so cannot function as a RankedSet.

Note that the value for #3 can never be empty. If there is a solution stored with no "children" solutions, #3 value is just "1", since it is the number of solutions that are descended from a node, including itself.

If every solution in the tree stores #3, then it becomes easy for the BST to look up a solution based on rank. Say that I need the 13th best solution in the BST. I ask the "root" node (the very first solution, with no parents and all children) to give me the 13th best. It asks the 'greater' of its children how many children it has. Say that number is 10. To few. It asks its lesser child again how many children that it has, which might be 20. So, it tells its lesser child to find the 2nd best solution in its family (13 minus the 10 best on the greater side, take away the root node itself, which leaves 2). If that sound hokey to you, then just trust me: it works, and it's fast.

Further explanation would boggle the mind of those of my non-egg-head readers, and is not necessary for those already familiar with binary search trees. Suffice it to say that caching the number of descendants of a solution, including the solution itself, allows us to find and delete solutions based on rank.

So What's It All Mean?


It means that we can now construct an algorithm that does not have to concern itself with how great a solution is, only if it is the best one. At the same time, it also means that the algorithm only has to know how good a solution is when it is inserted into the population, and from then on it need only talk about weeding out solutions that are weak, like the 75th best solution, and promoting the strong ones, like the 2nd solution. It means that it can talk about the rank of those solutions without ever actually retrieving them until it is time to kill the weak and create the strong. This makes our algorithm really fast and very effective.

Up Next

In the next post I will talk about how the algorithm uses the RankedSet, how it chooses how many weak to kill off and spawn, how it chooses how many solutions to mutate, and how easy those choices make it for programmers to use the new genetic algorithm to solve the next big problems.