First day in Taiwan...

We made it to Taipei, wandered around the city a bit, took some pictures, got to the hotel, and are tired. See Luan managed to get 7-9 hours of sleep on the plane, I may have managed 3-4. On the upside, I did fix a couple bugs in PyPE in my sleeplessness (though claims of power and/or internet on the flight over were outright lies).

Sadly, See Luan managed to dislocate her knee while standing outside a Starbucks (I was getting her a latte), so much of our day after 12:30PM local time included her limping around. :( We got her a hot bath and shower tonight, put some medicated pads on her knee, and plugged in the fridge so ice is on the way. Tomorrow we'll stop by a pharmacy, where apparently pharmacists can sell the good drugs after prescribing them on the spot. Woo for saving money on not needing to go to a hospital. When we get back to the states, I'm going to insist that she see a doctor and get it checked out.

The injury also resulted in a fall, causing See Luan to throw her new Canon S95 halfway across the street (with lens extended). Aside from the lens not wanting to go back in (it can be forced), some amazing scarring to it's metal body, and some serious body warping (some of which I bent back in); the thing still seems to be able to take really nice pictures. I'm going to have to open it up to see if I can fix the lens return (it feels like a bit of metal is bent and is catching just before getting all the way in), which might be an evening project if we can find mini-screwdrivers.

Oh, and the food is cheap. In LA, I'd pay like $9 for a good bowl of Ramen. Here? $3 We're going to catch dinner at an "American" grill downstairs because See Luan is hobbling around, but I guess they are pretty popular in the area.

I'll blog after dinner and a shower about my first impressions of Taiwan.

Please pardon me while I rant for a moment about Blogspot

One thing that is a source of annoyance to me on a regular basis is Blogspot's inability to perform formatting correctly. For example, let's say that we want to have three paragraphs with a blank line between them. In a sane blog software, you would type the three paragraphs, and include a blank line between them. Simple, right? Yes.

What happens when you do the same thing with Blogspot? Well, that depends. Have you gone to the "edit HTML" window? Yes? Ok, then those blank lines turn themselves into 2 blank lines. Doesn't make sense now, does it? Of course not. But that doesn't matter, because that's just the way Blogspot works.

Well, what if you want to throw in some <pre> tags? You know, because you want to embed some code. You can't enter it into the "compose" window, because *it* doesn't understand what you mean. And if you drop into the HTML editor, you are confronted with some horrible excuse for html formatting, where "blank" lines are really two or three spans, and if you try to clean it up to save your sanity, you've now just destroyed any chance at having proper formatting at all.

So what do you do? You copy all of what you wrote out, paste it into notepad (or something else that doesn't keep rich formatting), type your formatting in manually, re-paste it into the html window and... it adds line breaks where there were none.

The solution that I've discovered works pretty well is to type everything out in a plain text editor, add any links, etc., by hand. Leave line breaks where you want them (don't add tags to do it, just leave blank lines), add other extra formatting (like pre tags), and paste it into the "HTML edit" box. Then it seems to behave mostly sane. Which I guess I'll have to take.

Feel free to go back to your regularly scheduled reading.

(no subject)

It's been a couple months since I posted here in Livejournal. I had been syndicating my tech blogging over here, but I stopped doing that because the formatting was awful, and it was too much of a pain to re-do it for every post. But as big_bad_al offered, I probably shouldn't be syndicating my thoughts all over anyways.

As previously stated, this will be where I post more personal, less technical stuff. Which I don't really post a lot of. And all the really juicy details (which I don't really post anymore) are usually off in friends-locked posts.

That said, See Luan and I are doing well (we're off to Taiwan, Malaysia, and Bali mid November to early December for ), work at is going well (I get to build a lot of interesting stuff, technically), my family is doing well, ... Not really a lot to report.

On a completely random note, I really like the movement of the insert key on the Logitech MK320 and similar Logitech keyboard/mouse combos, as the arrangement of the secondary block of keys is *really* nice. Also, I've been very satisfied with the longevity of the combos in terms of durability and battery life.

Windows file permission and indexing service hell...

At some point in a person's life, if they are using a multi-user operating system, they will come into some sort of file permissions hell. Linux, Mac, and other Unix-like operating systems usually have "sudo", which allows you to momentarily tell the system, "To hell with this file permissions bunk, do what I say!" If I remember correctly, Windows 2000 and later had a similar kind of thing, "Run as Administrator", which when used properly, is very effective.

Since getting my most recent computer (Core 2 Quad 2.4 ghz, 4Gb ram, Geforce 8800 GTS 512mb, ...), and moving everything that has ever been on any computer I owned to this one beast (which is faster than all other machines I've ever owned, combined), I discovered a new file permission hell that I've never experienced before. I couldn't even list the contents of directories, never mind moving, deleting, or otherwise mangling them.

Ultimately the problem came down to ownership. Each computer sets up a UUID for the system, and each user (even if they have the same username) gets a different sub-domain id. Since I'd pushed no less than 5 different Windows installations' worth of data (2x Win2k, 3x WinXP) to this one box, spelunking around and re-organizing stuff usually resulted in various prompts of "are you sure you want to elevate your permissions to do this?" in Windows 7. Run as Administrator did not help. If only I had chown...

If you're a random internet surfer, please be aware that the following procedures may reset the modified date/time on your files. If you care about those dates/times, don't do this. Ready?

1. Right-click on whatever ancestral folder you want to alter ownership on (I did this on the root of my data drive), and hit Properties.
2. Go to the "Security" tab, and click the "Advanced" button down on the right.
3. Now the "Owner" tab, then the "Edit" button down on the bottom.
4. Pick yourself as the owner, and check the box that says "Replace owner on subcontainers and objects".
5. Now hit the Ok button, and confirm everything as necessary. In a few minutes, you will have solved your ownership dilemma.
6. Close those properties dialogs.

Now with that problem solved, what about the background Windows Indexing stuff? What? You disabled Windows Indexing Service in Windows 7 you say? Yes and no. There is a successor called "Windows Search", and that thing will touch any file you haven't explicitly disabled. Removing the Indexer was *supposed* to stop it, but it doesn't. And if you remove Windows Search, you can't even search through file names. If you want full-text indexing, by all means, keep it. Me? I find "instant" file searching terribly unnecessary (never mind basically broken with most of these systems). Also, the files I want to full-text search are only a few megs, and I touch them repeatedly with 'ack' (it's like grep, only with all of the good options already selected... it's only issue is that it's written in perl :P ).

1. Go to the root of you drive again, right-click on any (relatively small) folder, select Properties.
2. Hit the "Advanced" button. Notice that "Allow files in this folder to have contents indexed in addtion to file properties"? Toggle it, just to make a change.
3. Hit ok twice.
4. It should pop up with a dialog that asks if you want to apply it to that folder, or all subfolders and files too. Say Ok to that too.
5. You may need to give it administrator privileges, and you may need to "Ignore All" for permissions issues.
6. When it gets done, select every folder in the root of your drive, right-click, Properties.
7. Advanced button again, then make sure that the "Allow files in this folder..." checkbox is unchecked.
8. Hit Ok twice, tell it to apply it to everything, confirm for admin privileges as necessary, and ignore anything that you don't have permission to do.

At this point, most of my friends will probably follow up with, "Josiah, you're still using Windows?" Easily 95% of my time at a computer is spent in my editor PyPE or a browser (99% of my browser time is in Chrome). My editor works better in Windows than on Linux (it is atrocious on OS X), and Chrome on Windows is faster than any browser anywhere else*. Sure, I develop for Linux, but that's what a VM is for (VirtualBox FTW). Throw in NATTed internet, Samba over a second host-only connection, and configuring git to ignore any 'chmod a+x' equivalent changes, and I get the best of both worlds: my primary tools work amazing, and my dev environment can run exactly what our servers run without altering my personal workflow.

* I had a former coworker who benchmarked browsers one time and found that Chrome in Windows XP inside a VirtualBox VM was faster than any browser on the host OS X Core2 Duo laptop, or on the faster Core2 Quad Linux desktop.

I also do weekly snapshots of my dev environment so if a stray system update hoses everything, I spend about 15 minutes troubleshooting, and if that fails, I quick backup my home directory, hit the "go back to a previous snapshot" button, and restore the home directory. What about Windows 7 choking you say? Not an issue. I reboot every 2 months to pick up 2 months worth of Windows patches, even with a standby/wakeup cycle twice a day. I did that for a year and a half on Windows XP too.

For those using other systems effectively, I'm very happy for you. For me, Windows Host + Linux VM is the right answer. I hope you can understand :)

[A Dash of Technology] Building a search engine using Redis and redis-py

For those of you who aren't quite so caught up in the recent happenings in the open source server software world, Redis is a remote data structure server. You can think of it like memcached with strings, lists, sets, hashes, and zsets (hashes that you can sort by value). All of the operations that you expect are available (list push/pop from either end, sorting lists and sets, sorting based on a lookup key/hash, ...), and some that you wouldn't expect (set intersection/union, zset intersection/union with 3 aggregation methods, ...). Throw in master/slave replication, on-disk persistence, clients for most major modern languages, a fairly active discussion group to help you as necessary, and you can have a valuable new piece of infrastructure for free.

I know what you are thinking. Why would we want to build a search engine from scratch when Lucene, Xapian, and other software is available? What could possibly be gained? To start, simplicity, speed, and flexibility. We're going to be building a search engine implementing TF/IDF search Redis, redis-py, and just a few lines of Python. With a few small changes to what I provide, you can integrate your own document importance scoring, and if one of my patches gets merged into Redis, you could combine TF/IDF with your pre-computed Pagerank... Building an index and search engine using Redis offers so much more flexibility out of the box than is available using any of the provided options. Convinced?

First thing's first, you need to have a recent version of Redis installed on your platform. Until 2.0 is released, you're going to need
to use git head, as we'll be using some features that were not available in a "stable" release (though I've used the 1.3.x series for months). After Redis is up and running, go ahead and install redis-py.

If you haven't already done so, have a read of another great post on doing fuzzy full-text search using redis and Python over on's blog. They use metaphone/double metaphone to extract how a word sounds, which is a method of pre-processing to handle spelling mistakes. The drawback to the metaphone algorithms is that it can be overzealous in it's processing, and can result in poor precision. In the past I've used the Porter Stemming algorithm to handle tense normalization (jump, jumping, jumped all become jump). Depending on the context, you can use one, neither, or both to improve search quality. For example, in the context of machine learning, metaphone tends to remove too many features from your documents to make LSI or LDA clustering worthwhile, though stemming actually helps with clustering. We aren't going to use either of them for the sake of simplicity here, but the source will point out where you can add either or both of them in order to offer those features.

Have everything up and running? Great. Let's run some tests...
>>> import redis
>>> r = redis.Redis()
>>> r.sadd('temp1', '1')
>>> r.sadd('temp2', '2')
>>> r.sunion(['temp1', 'temp2'])
set(['1', '2'])
>>> p = r.pipeline()
>>> r.scard('temp1')
>>> p.scard('temp1')
<redis.client.pipeline object="object" at="at" 0x022ec420="0x022EC420">
>>> p.scard('temp2')
<redis.client.pipeline object="object" at="at" 0x022ec420="0x022EC420">
>>> p.execute()
[1, 1]
>>> r.zunionstore('temp3', {'temp1':2, 'temp2':3})
>>> r.zrange('temp3', 0, -1, withscores=True)
[('1', 2.0), ('2', 3.0)]

Believe it or not, that's more or less the meat of everything that we're going to be using. We add items to sets, union some sets with weights, use pipelines to minimize our round-trips, and pull the items out with scores. Of course the devil is in the details.

The first thing we need to do in order to index documents is to parse them. What works fairly well as a start is to only include alpha-numeric characters. I like to throw in apostrophies for contractions like "can't", "won't", etc. If you use the Porter Stemmer or Metaphone, contractions and ownerships (like Joe's) can be handled automatically. Pro tip: if you use stemming, don't be afraid to augment your stemming with a secondary word dictionary to ensure that what the stemmer produces is an actual base word.

In our case, because indexing and index removal are so similar, we're going to overload a few of our functions to do slightly different things, depending on the context. We'll use the simple parser below as a start...
NON_WORDS = re.compile("[^a-z0-9' ]")
# stop words pulled from the below url
STOP_WORDS = set('''a able about across after all almost also am
among an and any are as at be because been but by can cannot
could dear did do does either else ever every for from get got
had has have he her hers him his how however i if in into is it
its just least let like likely may me might most must my neither
no nor not of off often on only or other our own rather said say
says she should since so some than that the their them then
there these they this tis to too twas us wants was we were what
when where which while who whom why will with would yet you
def get_index_keys(content, add=True):
    # Very simple word-based parser.  We skip stop words and
    # single character words.
    words = NON_WORDS.sub(' ', content.lower()).split()
    words = [word.strip("'") for word in words]
    words = [word for word in words
                if word not in STOP_WORDS and len(word) > 1]
    # Apply the Porter Stemmer here if you would like that
    # functionality.
    # Apply the Metaphone/Double Metaphone algorithm by itself,
    # or after the Porter Stemmer.
    if not add:
        return words
    # Calculate the TF portion of TF/IDF.
    counts = collections.defaultdict(float)
    for word in words:
        counts[word] += 1
    wordcount = len(words)
    tf = dict((word, count / wordcount)
                for word, count in counts.iteritems())
    return tf

In document search/retrieval, stop words are those words that are so common as to be mostly worthless to indexing or search. The set of common words provided is a little aggressive, but it also helps to keep searches directed to the content that is important.

In your own code, feel free to tweak the parsing to suit your needs. Phrase parsing, url extraction, hash tags, @tags, etc., are all very simple and useful additions that can improve searching quality on a variety of different types of data. In particular, don't be afraid to create special tokens to signify special cases, like "has_url" or "has_attachment" for email indexes, "is_banned" or "is_active" for user searches.

Now that we have parsing, we merely need to add our term frequencies to the proper redis zsets. Just like getting our keys to index, adding and removing from the index are almost identical, so we'll be using the same function for both tasks...

def handle_content(connection, prefix, id, content, add=True):
    # Get the keys we want to index.
    keys = get_index_keys(content)
    # Use a non-transactional pipeline here to improve
    # performance.
    pipe = connection.pipeline(False)
    # Since adding and removing items are exactly the same,
    # except for the method used on the pipeline, we will reduce
    # our line count.
    if add:
        pipe.sadd(prefix + 'indexed:', id)
        for key, value in keys.iteritems():
            pipe.zadd(prefix + key, id, value)
        pipe.srem(prefix + 'indexed:', id)
        for key in keys:
            pipe.zrem(prefix + key, id)
    # Execute the insertion/removal.
    # Return the number of keys added/removed.
    return len(keys)

In Redis, pipelines allow for the bulk execution of commands in order to reduce the number of round-trips, optionally including non-locking transactions (a transaction will fail if someone modifies keys that you are watching; see the Redis wiki on it's semantics and use). For Redis, fewer round-trips translate into improved performance, as the slow part of most Redis interactions is network latency.

The entirety of the above handle_content() function basically just added or removed some zset key/value pairs. At this point we've indexed our data. The only thing left is to search...
import math
import os
def search(connection, prefix, query_string, offset=0, count=10):
    # Get our search terms just like we did earlier...
    keys = [prefix + key
            for key in get_index_keys(query_string, False)]
    if not keys:
        return [], 0
    total_docs = max(
        connection.scard(prefix + 'indexed:'), 1)
    # Get our document frequency values...
    pipe = self.connection.pipeline(False)
    for key in keys:
    sizes = pipe.execute()
    # Calculate the inverse document frequencies...
    def idf(count):
        # Calculate the IDF for this particular count
        if not count:
            return 0
        return max(math.log(total_docs / count, 2), 0)
    idfs = map(idf, sizes)
    # And generate the weight dictionary for passing to
    # zunionstore.
    weights = dict((key, idfv)
            for key, size, idfv in zip(keys, sizes, idfs)
                if size)
    if not weights:
        return [], 0
    # Generate a temporary result storage key
    temp_key = prefix + 'temp:' + os.urandom(8).encode('hex')
        # Actually perform the union to combine the scores.
        known = connection.zunionstore(temp_key, weights)
        # Get the results.
        ids = connection.zrevrange(
            temp_key, offset, offset+count-1, withscores=True)
        # Clean up after ourselves.
    return ids, known

Breaking it down, the first part parses the search terms the same way as we did during indexing. The second part fetches the number of documents that have that particular word, which is necessary for the IDF portion of TF/IDF. The third part calculates the IDF, packing it into a weights dictionary. Then finally, we use the ZUNIONSTORE command to take individual TF scores for a given term, multiply them by the IDF for the given term, then combine based on the document id and return the highest scores. And that's it.

No, really. Those snippets are all it takes to build a working and functional search engine using Redis. I've gone ahead and tweaked the included snippets to offer a more useful interface, as well as a super-minimal test case. You can find it as this Github Gist.

A few ideas for tweaks/improvements:

  • You can replace the TF portion of TF/IDF with the constant 1. Doing so allows us to replace the zset document lists with standard sets, which will reduce Redis' memory requirements significantly for large indexes. Depending on the documents you are indexing/searching, this can reduce or improve the quality of search results significantly. Don't be afraid to test both ways.

  • Search quality on your personal site is all about parsing.
    • Parse your documents so that your users can find them in a variety of ways. As stated earlier: @tags, #tags, ^references (for twitter/social-web like experiences), phrases, incoming/outgoing urls, etc.

    • Parse your search queries in an intelligent way, and do useful things with it. If someone provides "web history search +firefox -ie" as a search query, boost the IDF for the "firefox" term and make the IDF negative for the "ie" term. If you have tokens like "has_url", then look for that as part of the search query.
      • If you are using the TF weight as 1, and have used sets, you can use the SDIFF command to explicitly exclude those sets of documents with the -negated terms.

  • There are three commands in search that are executed outside of a pipeline. The first one can be merged into the pipeline just after, but you'll have to do some slicing. The ZUNIONSTORE and ZRANGE calls can be combined into another pipeline, though their results need to be reversed with respect to what the function currently returns.

  • You can store all of the keys indexed for a particular document id in a set. Un-indexing any document then can be performed by fetching the set names via SMEMBERS, followed by the relevant ZREM calls, the one 'indexed' SREM call, and the deletion of the set that contained all of the indexed keys. Also, if you get an index call for a document that is already indexed, you can either un-index and re-index, or you can return early. It's up to you to determine the semantics you want.

There are countless improvements that can be done to this basic index/search code. Don't be afraid to try different ideas to see what you can build.

Using Redis to build search is great for your personal site, your company intranet, your internal customer search, maybe even one of your core products. But be aware that Redis keeps everything in memory, so as your index grows, so does your machine requirements. Naive sharding tricks may work to a point, but there will be a point where your merging will have to turn into a tree, and your layers of merges start increasing your latency to scary levels.

My personal search history:
I first got into the indexing/search world doing some contract work for Affini. At Affini, William I. Chang taught me the fundamentals of natural language indexing and search, and let me run wild to bootstrap an ad targeting system over Livejournal users' interests, location, age, gender, ... combined with free micropayments, a patent for a method to remove spam email from your inbox, craigslist search subscriptions (delivered to your inbox), and targeted advertising, Affini seemed to be poised to take over... until it didn't. That happens in the startup world.

Back then, there was no Redis. I built our search infrastructure from scratch; parsing in Python, indexing and search using Pyrex and C. The same system wrapped our email storage backend to allow for email searching, and wrapped our incoming craigslist ads to allow us to direct subscription messages via live search. These problems are much easier with Redis.

New tech blog? Yes.

I've been posting to livejournal for a few months shy of 8 years now. For the 3-4 years before, I posted chunks of content to my own personal web page. Livejournal was a huge step up in the right direction. Over that time, my LJ has been a collection of opinions, politics, technology, personal posts, etc. I'm going to change that a bit.

I just started a new tech blog, . Posts made there will be syndicated here automatically with the subject prefix of [A Dash of Technology]. I'm not going anywhere, I'm just pulling my tech-related stuff over there (because it may actually be interesting), which will arrive here almost immediately, and I'll leave any other personal stuff here.

I also have a twitter account that is going to be tech/work related microblogs , along with my personal Facebook for smaller stuff that is posted here. For the nerdier among you, I had to draw a diagram to make sure that my posts were syndicating out to the right places without duplication.

Hello new world!

P.S. if someone wants to make a huge amount of money in this social networking space, build a tool that auto-syndicates between all of the major services.

Unethical? Really?

Today I was called unethical. Why? Because I had posted an ad to give away my bed. In the ad I said "First come, first served."

A woman had called about an hour after I had posted to schedule a time to check it out and maybe pick it up... tomorrow morning at 9AM. Her friend would be coming along to help her. I got two other calls. The second was a guy with a truck who would be coming by in a half hour to pick it up, while I was at my old place...

So I said sure, come by, take it away. He did. I felt a little bad, but not really. It's a free bed, "First come, first served." The guy showed up first, he got it. Done deal.

I call the woman back, she sounds very happy to hear my voice. I tell her that she's probably not going to be as happy with me in a minute. I apologize, but that a guy was in the neighborhood, and picked up the bed about an hour before. She says "no you didn't!", followed by "how could you do that!", "you're unethical!", "we made arrangements!", ... I tried to explain that the ad said "first come, first served". She tried to claim that because she had called first, that is first served.

Personally, I can't believe how angry she got at me for giving away something that I had every right to give away. How many times have I called places for something that I would *pay* for, made arrangements to check it out, and had someone else buy it first? Countless. Have I ever been angry? Of course not! You can't get angry at people for selling (or giving away) something they have the right to sell (or give away) at their discretion.

Next time I'm just going to leave it in an ally and post the details on CL. It will save me so much hassle. Weirdo CL people.

Dear Lazyweb

I've got a Dell U2410 widescreen monitor at work. When I disconnect the cable from my computer, the monitor goes into a "hey, no cable is connected" screen, and stays that way. It never goes into standby.

Dell's forums have examples of people who have run into the exact same issues with the same monitor. I have a coworker with the same issue, and two coworkers whose identical monitors *do* standby after a few minutes of having no inputs. I'm guessing it's a monitor revision issue (the two that standby are older monitors, the two that don't are newer).

Anyone else have the same issues and/or know how to solve them?

Update 9/12: I discovered the difference months ago. I was using the Displayport for the monitor, another who had the same problem was using an analog vga cable, but the two that weren't having issues were using the mini-dvi connectors for their macbooks to attach to the dvi adapter in the monitor. Removing the adapter on the other monitors gives them the same "no standby" problem. I've gotten used to turning the monitor on and off regularly, so it's no longer really an issue.

Max Lemky

A family friend died today, Max Lemky. I found out because my aunt pinged me on Facebook when I just happened to be at my computer.

It's hard to say how much this will affect my family, as I've known Max my entire life. He was one of those old-school hippies, buying, fixing, modifying, selling VW bugs and vans for as long as I can remember. I'd see him every couple years, hanging out in Sturgis when we were all around, maybe at his mother's place in Thief River Falls, MN; maybe even sleeping on the couch for a weekend or week while trying to sell off one of his newest acquisitions (at least in the 80's and early 90's).

He was always great to talk to. He had stories about parties, about people he'd met, about having to deal with mechanical breakdowns, all of it and more. In the 90's, he met a nice woman and fathered a son, Sammy, among her two previous children (they were great btw, very friendly and well-behaved, even in the insanity of Sturgis camping). In 1998 he gave me an awesome 60's vintage Yamaha tube amplifier he'd picked up for cheap in his travels. He'd already picked up some Marshall stack amps, and had a collection of classic rare guitars that he couldn't fit in his van(s) in his travels. Speaking of which, he could play. Damn he could play.

Max was one of my dad's best friends, and a great friend of the family. He's survived by his son Sammy and mother Mable, and untold amounts of partially-working vehicles, guitars, amps, ... Max, I'm sorry to see you go, and I'm so sorry it's been so long since I've seen you. Good night dear friend.