Begin main content

Improving search with machine learning

Ever since I wrote a small client/server demonstrating AI::Categorizer for Sydney University's Web Engineering group (where AI:Categorizer's auther Ken Williams was researching at the time) I have been interested in the practical application of machine learning and AI in particular (you can see Ken's powerpoint presentation about the demo and theses here if you're interested).

Lately I have been wondering why I don't see it appearing in search solutions more obviously. Sure there's Google personalized search, but I'm not sure I've seen amazing improvements. Certainly I have not been made aware of what those improvements are (and Google's public releases doesn't seem to mention AI).

I can think of any number of ways that existing, proven, AI techniques could improve search, so I've decided to do somehthing about it!

First off, there is the issue of irrelevant topics returned in my queries. A classic example is "Java". We all know it's a somewhat popular programming language. It's a type of coffee too right? But it's also the name of the most populated island on the earth (and holds the capital to the worlds 4th most populous nation), which also contains some active volcanoes. Want to know more about it? Well if you go to Google and search for "Java" you're going to need more patience than me - I got to search result page 10 with no mention of anything other than the programming language (with the exception of one Wikipedia result on page 3 I think).

So I should learn to write better search queries like "Java island". Or use a directory like Yahoo!.

That's pretty much my options currently. What about combining the two? What if I could refine my search results by selecting from the top categories represented in the result set, much like you do when you search yellowpages.com.au for "tyres" - you can drill down to "tyres - retail and fitting". So assuming I'm Google and already have an index of the web, all I need is to categorize every page on the web.

Armed with the already human-categorized data set (or corpus) that dmoz.org handily offers in RDF form, I can train my machine learning categorizer robot. Then I can run that categorizer over the search result set and I have everything I need.

So how effective could that be? Well my rough prototype that I banged together over the weekend has only been trained on 500 documents because I'm running low on my DSL quota for the month so I'm going to wait until nearer the end of my billing cycle before letting it loose on the whole of the dmoz data, but similar machine learning experiments I have seen have consistenly resulted in the 98-99% accuracy band.

The other obvious way to improve results is to apply AI to the data set made up of [my search terms, which link(s) I chose]. I guess this is what Google personal search does, but there is so much more you could do with this data - like using clustering to provide Amazon-style suggestions "other people whose search terms are similar to yours found this link useful".

I'm pretty excited - this is great stuff and very applicable to our current data-laden world. Plus I just like data :)

10:05 PM, 19 Nov 2006 by Mark Aufflick Permalink

You only needed to 'ask'

http://www.ask.com/web?q=java See the options in the right column.

by Unregistered Visitor on 11/21/06

Mmm yes

Yes excellent. I think I might be changing search engines! This reminds me of a quote in the Google book where a professor was trying out Sergey and Brin's test engine, googled his name and got a useful result. "That never happens with any other search engine" he said. Results are where it's at for web search. I think the opportunity is there for someone (maybe ask.com or maybe a smart aggregator like topix.com) to really steal some of Google's market share.

by Mark Aufflick on 11/21/06

Add comment