January 4, 2004

Googlebot the Nomad

So we've got the Deskbar and the Google Toolbar and they're great. I'm assuming one of you geniuses can probably fill me in on a larger question, though: Can Google ever make a desktop search engine, not merely a web client, with indexing of my local content like email or files? I am just curious, because it seems as if any effort to do so would reveal the PageRank algorithm, either through someone hacking the engine executable itself or by analyzing the data index that the bot would create.

Does this mean that poor Googlebot is cursed to wander the net all his days, never finding a home on a single machine? Never resting? Poor weary Googlebot.

1 TrackBack

TITLE: Google Desktop Search Engine? URL: http://aliparvaresh.com/posts/1012.aspx IP: 64.119.111.74 BLOG NAME: Tech Guru DATE: 01/04/2004 09:03:25 PM Read More

23 Comments

Wouldn't such reverse engineering already be possible by anyone with access to a Google Search Appliance?

I would think jjg - that would be a tad hard - since their algorithms in software would be hard-wired as hardware instructions. Try writing an algorithm that searches a data index in a language like assembler (MC68K). One way however - would be to find out where the appliances are manufactured (Taiwan/China?) and pursue from that angle since those guys would have the blueprint ... if you want to!!

Since Google has expanded to company intranets and other "local" pools of files and already indexes files other than html (doc, pdf, for a start), I should think they could definitely produce a version to help you access and search your own files. The reason they likely don't want to is just as you describe.

Equally, however, why would your desktop version of Google need Pagerank? Surely the most relevant results for you on your own hdd are going to be all about when they were created, how often they have been accessed, where the search terms appear, etc, rather than where you have linked to them yourself from other documents on your hdd?

Since PageRank depends on linking between pages, wouldn't it be ineffective on a single machine where documents are very unlikely to be pointing at each other?

Ok, I've never fully understood this word- what are algorithms?

I gotta go with Kiran..there's really no point on having PageRank on a single computer.

More to the point, could Google make any money from it? This is likely to be their bigger question as the IPO comes up. In the end, Google is a media supplier just like a TV station. Should we have NBC "installed" in our TVs?

If you are using Windows, try X1. It is one of the most useful pieces of software for Windows. Google should have done this.

It wouldn't make sense for Google to use much of its web searching expertise for local searching. It's a completely different animal. Instead it should simply turn a few of its researchers loose on the local search problem.

I think something important to remember is that pagerank is not a static secret 'formula' for google. It's more a term describing their philosophy to search. The basic parts of pagerank, being that a documents importance is determine by the pointing or linking documents is an open concept.

However, the specific implementation of the concept is the secret sauce that makes google magic. For the intranet box that google sells (and a hypothetical hard drive search service), I would bet that the algorithm is pretty straight forward. While I'm sure that you won't find it spelled out on their site anyplace, I would bet that it's not considered a top top secret at google. In fact, if you bought one of the google appliances and had it search a set of known documents, you could probably determine the algorithm. Google is not losing any sleep over this however. For, if you had that same appliance index the internet, you would not end up with a mirror of google.com.

The prime reason that the google.com algorithm is secret (and the reason that google is so much more useful to us than all the previous search engines) is both the novelty of using pointing documents to determine relevancy and, more importantly, the anti-gaming features they continually add. Whenever someone can game a search engine, the usefulness of the results decreases markedly. Witness that even google has been strugling with this. (Google-bombs, link-farms, etc) However, unlike alot of the players before them, they have made it a top priority to adjust the algorithm to defeat the gamers.

It's these 'adjustments' to the algorithm that are necessarily secret. And secret not just so that someone can take it and create another google.com, but secret so that the malicious players cannot exploit it. Pagerank, as it is implemented on google.com, is the combination of the linking document relevancy philosophy plus the 'secret' anti-gaming features.

A local search engine does not have to deal with malicious players. (On the assumption that you would not game your own data to make it less useful to you)

Therefore, I think that google could conceivable create a local search tool, without too much worry about it being reverse engineered and ruining their core business.

Why they haven't, I don't know. Maybe it's floating around google labs and will come out soon. Maybe not. It's not quite their core business. But I doubt the reason they have not released it is because of a worry about revealing the secret of pagerank, for pagerank is not a static secret like the 'Coca-cola' formula.

I guess I should follow this up with IANAGE (I am not a google engineer). This is all conjecture. But what the heck, this is the internet. :-)

I dont think that a personal solution like that is far off. I think I heard something about the next Windows doing that and making it available to the internet (with permission for specific files etc) and incorporating p2p technology at some level for this. Sounds interesting, how much I would let loose on the internet is a different story.

Take a peek at dashboard. Only for Gnome on Linux but then not everyone runs winders :-)

I should reveal that I'm probably skewed in my perception of the utility of a local googlebot because I run a local web server that hosts documents that are link-heavy. So desktop googlebot would actually be useful for me, as opposed to most desktop users.

I did consider the Google appliance as a gateway to people sussing out the ranking system, but given the cost of entry and its black box nature, I think it would require an order of magnitude more effort by an individual than the inevitable decentralized cracking effort that would come from geeks if an installable googlebot were released.

"I dont think that a personal solution like that is far off. I think I heard something about the next Windows doing that."

Hah, hah. That's 2006 at the earliest!

Well the current windows also has something that can index and allows you to search the HDD. It is called the Indexing service. And there are plug ins that are freely avbl that will allow the indexing service to index PDFs etc. If you have a custom file format, you can write your own plug in too...

Googlebot: it keeps going and going and going...

What would be the point of using Page Rank to index the files on a single local computer? What Google would likely offer (if they were smart) would be a cheaper version of something like DT Search which is excellent -- but expensive -- for indexing and then searching local files.

But how interlinked are our local files? Would the PageRank philosophy work on personal files, even network shared project files? I don't leave a comment in a Word document to a PNG saying "Good PNG with orange and peach colours".

I imagine even if the linkage is high enough, that to fully get at the linkage value the search would have to go beyond individual Word documents and text files and up into our Exchange servers, Project plans, our Dreamweaver templates and VS.NET solutions.

There does not seem to be enough written-down linkage in our personal space. It is mainly in our heads and discussed at board room tables.

I think never - try other soft for making search engine on your computer. As I remember you can try yandex.ru there is a special tool for making a local search engine.

Apparently LookSmart used to offer a local search utility, like you describe (or was it AltaVista)?

In any case, this "local search" feature is rumoured to be part of the forthcoming "Longhorn" OS from M$. Indeed, there has even been speculation that it might be a peer-to-peer search engine -- so that even *I* could search your emails ;)

Dear Anil,

Look at our site www.nt4me.com you don't have to wait until Longhorn is out. We have desktop search engine using natural language processing that work like your friend is taking your commands and executing your wish for you on your computer. Take a look....This program going to be future desktop.

you sad men... is this how you want to spend the rest of your days, choose life.

If you are running a local web server. I'd suggest a search engine like swish-e. I use it on my public website and it's great. It only deals with keyword relevance. Like the others, I think something like PageRank is useless for most non-webservers.

Leave a comment