Improving search with machine learning
Ever since I wrote a small client/server demonstrating AI::Categorizer for Sydney University's Web Engineering group (where AI:Categorizer's auther Ken Williams was researching at the time) I have been interested in the practical application of machine learning and AI in particular (you can see Ken's powerpoint presentation about the demo and theses here if you're interested).
Lately I have been wondering why I don't see it appearing in search solutions more obviously. Sure there's Google personalized search, but I'm not sure I've seen amazing improvements. Certainly I have not been made aware of what those improvements are (and Google's public releases doesn't seem to mention AI). I can think of any number of ways that existing, proven, AI techniques could improve search, so I've decided to do somehthing about it! First off, there is the issue of irrelevant topics returned in my queries. A classic example is "Java". We all know it's a somewhat popular programming language. It's a type of coffee too right? But it's also the name of the most populated island on the earth (and holds the capital to the worlds 4th most populous nation), which also contains some active volcanoes. Want to know more about it? Well if you go to Google and search for "Java" you're going to need more patience than me - I got to search result page 10 with no mention of anything other than the programming language (with the exception of one Wikipedia result on page 3 I think). So I should learn to write better search queries like "Java island". Or use a directory like Yahoo!. That's pretty much my options currently. What about combining the two? What if I could refine my search results by selecting from the top categories represented in the result set, much like you do when you search yellowpages.com.au for "tyres" - you can drill down to "tyres - retail and fitting". So assuming I'm Google and already have an index of the web, all I need is to categorize every page on the web. Armed with the already human-categorized data set (or corpus) that dmoz.org handily offers in RDF form, I can train my machine learning categorizer robot. Then I can run that categorizer over the search result set and I have everything I need. So how effective could that be? Well my rough prototype that I banged together over the weekend has only been trained on 500 documents because I'm running low on my DSL quota for the month so I'm going to wait until nearer the end of my billing cycle before letting it loose on the whole of the dmoz data, but similar machine learning experiments I have seen have consistenly resulted in the 98-99% accuracy band. The other obvious way to improve results is to apply AI to the data set made up of [my search terms, which link(s) I chose]. I guess this is what Google personal search does, but there is so much more you could do with this data - like using clustering to provide Amazon-style suggestions "other people whose search terms are similar to yours found this link useful". I'm pretty excited - this is great stuff and very applicable to our current data-laden world. Plus I just like data :) 02:05 PM, 20 Nov 2006 by Mark Aufflick Permalink | Comments (2) Everyday Parallelization
I just got back from some last minute purchases at our local Coles metro and it struck me how much I enjoy the checkout procedure.
After you have loaded all the goods out of your basket onto the bench, there is always a few moments while the cashier scans the remaining items. With the Coles system, I can swipe my frequent-shopper card, then my credit card and finally select the desired account. It's a classically paralellizable situation . Two inter-dependant resources with multiple duties, some distinct some shared. One resource (the cashier) almost always takes longer and so it makes sense to allow the other resource (the shopper) to complete as many tasks as possible while it (you) would otherwise be in a blocked state. It reminds me of discussions I used to have with my friend Matt about the implicit least-cost decisions involved in every day situations, such as picking pedestrian routes. Does anyone else have examples of well designed human processes that exhibit good parallelization? Or the opposite? PS: apologies for gratuitously making up words from the parallel stem! 01:12 AM, 20 Nov 2006 by Mark Aufflick Permalink | Comments (0) |
Archive
January 2010 October 2009 September 2009 August 2009 July 2009 June 2009 May 2009 April 2009 February 2009 January 2009 December 2008 November 2008 October 2008 September 2008 August 2008 July 2008 June 2008 May 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 February 2007 January 2007 December 2006 November 2006 October 2006 September 2006 August 2006 July 2006 June 2006 May 2006 April 2006 March 2006 February 2006 January 2006 December 2005 November 2005 October 2005 September 2005 August 2005 July 2005 June 2005 May 2005 April 2005 March 2005 February 2005 January 2005 December 2004 November 2004 October 2004 September 2004 August 2004 July 2004 June 2004 May 2004 April 2004 March 2004 February 2004 January 2004 December 2003 November 2003 October 2003 September 2003 August 2003 Blog Categoriessoftware (24)..cocoa (12) ..heads up 'tunes (5) ..ruby (4) ..lisp (1) ..perl (2) ..openacs (1) mac (18) embedded (2) ..microprocessor (2) ..avr (1) electronics (3) design (1) Notifications Request notifications
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||








Request notifications