Josiah Carlson (chouyu_31) wrote,
Josiah Carlson
chouyu_31

Using data structures properly; making it fast

First, a bit of context.

One of the more-requested features of the asynchronous sockets library available with Python is a scheduler (I maintain the async sockets library). That is, being able to say "execute operation X in Y seconds". Once you have that behavior, two other behaviors become really useful to have; "I know I said run the operation in Y seconds, make that Z seconds", or "don't run X at all". These are generally referred to as 'schedule', 'reschedule', and 'cancel'.

The schedule operation is very easy to implement. You take your minheap[1], put pairs of items into it (absolute time to execute, operation), and any time your current time passes the smallest absolute time in the heap, you remove and execute items until their time is greater than the current time. Simple, fast, and effective.

But what about reschedule and cancel? Well, there's the slow way, the fast but not great method, and then there's what I would call the right way.

The slow way (for rescheduling and canceling) pretty much involves scanning through your entire heap (which takes time linearly proportionate to the length of the heap), removing your item, and running the heapify() operation (which also takes time linearly proportionate to the length of the heap). Generally, we don't like to do operations that are linearly dependent on the size of anything (unless absolutely necessary), because when things grow to be 1 million or 1 billion items, those operations get *really* slow. Twisted[2] and sched.py (available in Python) does this. I know, a big wtf from me.

What about the fast but not great method? Basically, when you cancel an item, you mark it as "don't execute this when it comes up", and when you reschedule, you first cancel the original, then you make a copy for the new schedule. It's not great because if you end up doing a lot of reschedules, your heap can grow. Now, the basic schedule operation doesn't grow linearly, it grows logarithmically. That is, whenever you double the number of items in the heap, you take about one more operation to insert a new item or remove the smallest item. Not a big deal, because 1 million takes 20 operations, and 1 billion takes 30 operations, but if your application is hugely dominated by reschedules or cancels, this can be overwhelming for your memory. If you are smart, you can pay attention to how many canceled items are in the heap, and clear them out when canceled > total/log(total) (this tweaking is known as amortization, because to clear it out will take time linear in the length of the heap, but if you divide the total amount of work by the number of canceled items, it only takes log(total) time for each of the canceled items!)

Now we come to the right way. There is actually a data structure that was pretty much designed for this called a "pairing heap". Schedules, reschedules, and cancels are all O(logn), but the problem is that standard implementation from the literature internally uses a tree, which requires creating tree nodes, linking them, and paying attention to the links. It's not a nightmare, but it's not terribly elegant.

What would Josiah do? Josiah would make a new data structure, and did. Basically you get all the benefits of the pairing heap from the literature; O(logn) schedules, reschedules, and cancels, without any of the nastiness; no trees. It is implemented by using a hash table to pay attention to the location of every item in the binary heap (you know that item X_7 is at position 23, for example) so that when you need to reschedule (or cancel), you know what item to change, and where to adjust within the heap. I actually implemented it publicly in December 2006, but I originally implemented it for closed-source contract software in the summer of 2004. It probably existed before then, but I didn't find it in literature (kind-of like the O(1) LRU/MRU algorithm and data structure that I posted like 5 years ago in the Python cookbook).


Why do I bring this up? Like I said, people want a scheduler within the asynchronous sockets library. One of my motivators, a fellow from Italy named Giampaolo, has implemented his own scheduler using the 'slow' version, based on what is available in Twisted. That Twisted uses the slow method is unfortunate. That Giampaolo has chosen to reimplement the slow method is even more unfortunate. I was originally considering putting the pair heap implementation that I have in the Python standard library, out of a sense of "my way is better, stronger, and faster". But the latter is only asymptotically. In real-life, because it's all implemented in Python, it's probably a bit slower than the "slow" method for most event queues smaller than a few thousand items. If I were to spend the next couple weeks re-implementing it in C it would be faster, but I'm not sure that I want to do that.


What to do. I've decided that I'm going to update Python's sched.py module to offer a reschedule method, make it use the first fast method (with the trimming feature), and add a few other convenience features (locking support for multi-threaded apps, get the items that are ready to execute now, ...). Then I'm going to hook it into Python's asynchronous sockets library rather than my own, because it will be 10-100 times faster than my pure Python implementation in practice. That's not to say that a proper pairing heap implementation wouldn't be useful in Python, it's just not the correct solution for Python at this point in time. And with the way that I'm going to be fixing sched.py, practically speaking, it's a pair heap under a different name.



[1] Mark C. Chu-Carrol talks about heaps here.

[2] Twisted is an asynchronous everything framework for Python and Java (maybe others). It includes hooks for sockets, GUI libraries, and many other things.
Tags: algorithms, python
Subscribe

  • PyPE 2.9

    It's been a couple years, but I just released PyPE 2.9. It's got features and bugfixes that I've been developing over the last 2 years, many of which…

  • compiler.ast vs. _ast vs. ast

    I had recently re-written PyPE's parser stuff to use compiler.ast, and found it to be pretty convenient. I then did some more reading and remembered…

  • Parsing Python

    As work continues on the next version of PyPE, I've discovered a few cases where having the AST for the module being edited is actually the most…

  • Post a new comment

    Error

    Comments allowed for friends only

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 5 comments