Begin main content

Improving search with machine learning

Ever since I wrote a small client/server demonstrating AI::Categorizer for Sydney University's Web Engineering group (where AI:Categorizer's auther Ken Williams was researching at the time) I have been interested in the practical application of machine learning and AI in particular (you can see Ken's powerpoint presentation about the demo and theses here if you're interested).

Lately I have been wondering why I don't see it appearing in search solutions more obviously. Sure there's Google personalized search, but I'm not sure I've seen amazing improvements. Certainly I have not been made aware of what those improvements are (and Google's public releases doesn't seem to mention AI).

I can think of any number of ways that existing, proven, AI techniques could improve search, so I've decided to do somehthing about it!

First off, there is the issue of irrelevant topics returned in my queries. A classic example is "Java". We all know it's a somewhat popular programming language. It's a type of coffee too right? But it's also the name of the most populated island on the earth (and holds the capital to the worlds 4th most populous nation), which also contains some active volcanoes. Want to know more about it? Well if you go to Google and search for "Java" you're going to need more patience than me - I got to search result page 10 with no mention of anything other than the programming language (with the exception of one Wikipedia result on page 3 I think).

So I should learn to write better search queries like "Java island". Or use a directory like Yahoo!.

That's pretty much my options currently. What about combining the two? What if I could refine my search results by selecting from the top categories represented in the result set, much like you do when you search yellowpages.com.au for "tyres" - you can drill down to "tyres - retail and fitting". So assuming I'm Google and already have an index of the web, all I need is to categorize every page on the web.

Armed with the already human-categorized data set (or corpus) that dmoz.org handily offers in RDF form, I can train my machine learning categorizer robot. Then I can run that categorizer over the search result set and I have everything I need.

So how effective could that be? Well my rough prototype that I banged together over the weekend has only been trained on 500 documents because I'm running low on my DSL quota for the month so I'm going to wait until nearer the end of my billing cycle before letting it loose on the whole of the dmoz data, but similar machine learning experiments I have seen have consistenly resulted in the 98-99% accuracy band.

The other obvious way to improve results is to apply AI to the data set made up of [my search terms, which link(s) I chose]. I guess this is what Google personal search does, but there is so much more you could do with this data - like using clustering to provide Amazon-style suggestions "other people whose search terms are similar to yours found this link useful".

I'm pretty excited - this is great stuff and very applicable to our current data-laden world. Plus I just like data :)

02:05 PM, 20 Nov 2006 by Mark Aufflick Permalink | Comments (2)

Everyday Parallelization

I just got back from some last minute purchases at our local Coles metro and it struck me how much I enjoy the checkout procedure.

After you have loaded all the goods out of your basket onto the bench, there is always a few moments while the cashier scans the remaining items. With the Coles system, I can swipe my frequent-shopper card, then my credit card and finally select the desired account.

It's a classically paralellizable situation . Two inter-dependant resources with multiple duties, some distinct some shared. One resource (the cashier) almost always takes longer and so it makes sense to allow the other resource (the shopper) to complete as many tasks as possible while it (you) would otherwise be in a blocked state.

It reminds me of discussions I used to have with my friend Matt about the implicit least-cost decisions involved in every day situations, such as picking pedestrian routes.

Does anyone else have examples of well designed human processes that exhibit good parallelization? Or the opposite?

PS: apologies for gratuitously making up words from the parallel stem!

01:12 AM, 20 Nov 2006 by Mark Aufflick Permalink | Comments (0)

XML

Blog Categories

software (24)
..cocoa (12)
  ..heads up 'tunes (5)
..ruby (4)
..lisp (1)
..perl (2)
..openacs (1)
mac (18)
embedded (2)
..microprocessor (2)
  ..avr (1)
electronics (3)
design (1)

Notifications

Icon of Envelope Request notifications

Syndication Feed

XML

Recent Comments

  1. Unregistered Visitor: WFM
  2. Unregistered Visitor: Pie
  3. Unregistered Visitor: Helpful
  4. Unregistered Visitor: Comments about Republishing and RSS Theft
  5. Mark Aufflick: Oh Infinity (to the tune of O Canada)
  6. Unregistered Visitor: very late post
  7. Unregistered Visitor: ipad and apple's first vision
  8. Unregistered Visitor: great
  9. Unregistered Visitor: thanks for the reply
  10. Mark Aufflick: In a similar vein