Chouyu

chouyu_31


The ravings of a sane person.

Sometimes filled with information.


Previous Entry Share Next Entry
Using data structures properly; making it fast
Chouyu
chouyu_31
First, a bit of context.

One of the more-requested features of the asynchronous sockets library available with Python is a scheduler (I maintain the async sockets library). That is, being able to say "execute operation X in Y seconds". Once you have that behavior, two other behaviors become really useful to have; "I know I said run the operation in Y seconds, make that Z seconds", or "don't run X at all". These are generally referred to as 'schedule', 'reschedule', and 'cancel'.

The schedule operation is very easy to implement. You take your minheap[1], put pairs of items into it (absolute time to execute, operation), and any time your current time passes the smallest absolute time in the heap, you remove and execute items until their time is greater than the current time. Simple, fast, and effective.

But what about reschedule and cancel? Well, there's the slow way, the fast but not great method, and then there's what I would call the right way.

The slow way (for rescheduling and canceling) pretty much involves scanning through your entire heap (which takes time linearly proportionate to the length of the heap), removing your item, and running the heapify() operation (which also takes time linearly proportionate to the length of the heap). Generally, we don't like to do operations that are linearly dependent on the size of anything (unless absolutely necessary), because when things grow to be 1 million or 1 billion items, those operations get *really* slow. Twisted[2] and sched.py (available in Python) does this. I know, a big wtf from me.

What about the fast but not great method? Basically, when you cancel an item, you mark it as "don't execute this when it comes up", and when you reschedule, you first cancel the original, then you make a copy for the new schedule. It's not great because if you end up doing a lot of reschedules, your heap can grow. Now, the basic schedule operation doesn't grow linearly, it grows logarithmically. That is, whenever you double the number of items in the heap, you take about one more operation to insert a new item or remove the smallest item. Not a big deal, because 1 million takes 20 operations, and 1 billion takes 30 operations, but if your application is hugely dominated by reschedules or cancels, this can be overwhelming for your memory. If you are smart, you can pay attention to how many canceled items are in the heap, and clear them out when canceled > total/log(total) (this tweaking is known as amortization, because to clear it out will take time linear in the length of the heap, but if you divide the total amount of work by the number of canceled items, it only takes log(total) time for each of the canceled items!)

Now we come to the right way. There is actually a data structure that was pretty much designed for this called a "pairing heap". Schedules, reschedules, and cancels are all O(logn), but the problem is that standard implementation from the literature internally uses a tree, which requires creating tree nodes, linking them, and paying attention to the links. It's not a nightmare, but it's not terribly elegant.

What would Josiah do? Josiah would make a new data structure, and did. Basically you get all the benefits of the pairing heap from the literature; O(logn) schedules, reschedules, and cancels, without any of the nastiness; no trees. It is implemented by using a hash table to pay attention to the location of every item in the binary heap (you know that item X_7 is at position 23, for example) so that when you need to reschedule (or cancel), you know what item to change, and where to adjust within the heap. I actually implemented it publicly in December 2006, but I originally implemented it for closed-source contract software in the summer of 2004. It probably existed before then, but I didn't find it in literature (kind-of like the O(1) LRU/MRU algorithm and data structure that I posted like 5 years ago in the Python cookbook).


Why do I bring this up? Like I said, people want a scheduler within the asynchronous sockets library. One of my motivators, a fellow from Italy named Giampaolo, has implemented his own scheduler using the 'slow' version, based on what is available in Twisted. That Twisted uses the slow method is unfortunate. That Giampaolo has chosen to reimplement the slow method is even more unfortunate. I was originally considering putting the pair heap implementation that I have in the Python standard library, out of a sense of "my way is better, stronger, and faster". But the latter is only asymptotically. In real-life, because it's all implemented in Python, it's probably a bit slower than the "slow" method for most event queues smaller than a few thousand items. If I were to spend the next couple weeks re-implementing it in C it would be faster, but I'm not sure that I want to do that.


What to do. I've decided that I'm going to update Python's sched.py module to offer a reschedule method, make it use the first fast method (with the trimming feature), and add a few other convenience features (locking support for multi-threaded apps, get the items that are ready to execute now, ...). Then I'm going to hook it into Python's asynchronous sockets library rather than my own, because it will be 10-100 times faster than my pure Python implementation in practice. That's not to say that a proper pairing heap implementation wouldn't be useful in Python, it's just not the correct solution for Python at this point in time. And with the way that I'm going to be fixing sched.py, practically speaking, it's a pair heap under a different name.



[1] Mark C. Chu-Carrol talks about heaps here.

[2] Twisted is an asynchronous everything framework for Python and Java (maybe others). It includes hooks for sockets, GUI libraries, and many other things.

Python usually sticks to one implementation that works well for common cases. You call list.sort() and most people don't need to stop and think about what sorting algorithm it uses. And of course list and hash are like the swiss army knives of Python data structures, although the new collections module shows there's room for some more specialized/optimized data types.

Certainly, people using Python tend to use what is available. But Python typically includes some fairly optimized implementations of data structures, and that the sched.py module isn't very optimized is surprising.

dude

(Anonymous)

2008-07-19 02:00 am (UTC)

WOAH, DUDE!
You've got WAY too much time on your hands to be writing all this.
Go OUTSIDE!

I wrote it at work, as a discussion about what I'm going to be doing with my 20% time that Google gives us to work on free software.

Also, I spend at least an hour outside every day. Riding my bicycle to/from work, surfing, and/or just plain walking on the beach.

I'd have to see actual runtimes to be sure, but I suspect in a compiled programming language the "fast but not great method" (lazy deletion with periodic cleaning) beats the solution of using a binary heap together with a hash table to keep track of heap locations. A lot of data structure research seems to be done in unnecessary fear of hash tables, but if you have to do a hash table access every time you swap two items in your heap, that seems like a big slowdown. For Python, of course, a hash table access is as fast as any other single operation, so less of an issue.

In real-world performance, using a hash table to keep track of locations is definitely a loss in comparison to "fast but not great". And in the case of Python, it wouldn't be so bad, except that the heapq module has a C implementation that only works with list objects as the heap, and is easily 4x faster than the Python implementation. Toss in the simplicity of the "fast but not great" method, and the choice in algorithms/structures becomes clear.