Google: A Brain Extension

Over the last few days a subtle error on the part of our home’s gateway to the Internet rendered Google inaccessible. I was in agony. I was unable to write reports and actually felt rather stupid without it. I then realized I had effectively undergone a temporary lobotomy: Google is a part of my brain.

For those of you unfamiliar with the website, they are the web’s finest. It’s hard to really describe exactly how delightfully accurate Google is at finding information above and beyond other search engines, like altavista (and their Google clone), HotBotExcite,
Lycos, or Northern Light. I would recommend simply trying a few searches on each. Any single example I can give you you might claim was contrived, so go do it yourself.

Google loads quickly, finds relevant results, and has a clean layout that lets your eyes rest on the data; not some blinking advertisement. Google, unlike any of the other “search portals,” is not “sticky.” They have been criticized heavily by investors for it, but the short of it is that they are not trying to distract you from what you’re looking for: they want you to get there! In fact, they even have a “I’m Feeling Lucky” button to take you right to the first site that matched your request without popping up a search results page at all.

No, I’m not paid by Google. I’m just a rabid fan. Strangely enough, so is everyone else who has come to their yet-to-be-advertised website. Even my mother, not generally one to “ooh” and “aah” at web technologies, let out a gasp after doing her first search on Google.

Why rant and rave about a search engine, though? Becuase it is tantamount to the success of the Internet as a whole that search engines work well. A recent article on Newsbytes indicated that as many as two million people in the UK alone have stopped using the Internet regularly because it was too difficult for them to find the information they wanted! Without search engines, the Internet becomes impossible to navigate.

The Internet makes it possible for just about all participants to publish profuse quantities of information. That is important and allows for a truly democratic medium. The problem is in
filtering: everyone should have a say, but those with particularly interesting things to say should be heard. Specifically, if you are looking for them, or the things that they are talking about, you should be able to find their information.

When a search engine works well, it becomes more than a web page that can find things. In combination with high-speed, always-on Internet access, it becomes an extension of your brain. In conversation, in reading, and in private thought, a new term or concept can be quickly discovered and researched, allowing users to become instant experts in broadly diverse areas.

This really is what I think was imagined by Diderot and his Renaissance-era contemporaries when they put together the first modern encyclopedias – a vast, accessible index of human knowledge made even more vast and even more accessible by
modern technology. Douglas Englebart, the inventor of the computer mouse, hyperlinking, and drag-and-drop, envisioned the computer as a prosthetic addition to the human mind, capable of extending human powers of recollection, organization, and communication. Google, being effectively instantaneously available to anyone with a high-speed network connection, is a prosthetic addition to our knowledge base. Just glancing over at my roommate’s compressed-air duster, I see that the propellant used is 1,1,1,2 Tetrafluoroethane; type, click, and now available to me are chemical reference sheets, combustability tests, safety data, and its molecular breakdown (C2H2F4), to name a few of the immediately available references. Everything in my daily life that boggles me or interests me can now be an avenue of discovery. And when (or should I say if) I have any useful contributions to make to our collective body of knowledge, I post it on my site and Google automatically sucks it into its archive.

In short, by enabling the Internet to act as the human race’s collective body of knowledge, engines like Google enable us to advance faster, learn faster, and contribute more usefully and plentifully to Mankind’s corpus of knowledge.

Let the second Renaissance begin!

The Peril Of Using ETags In A Cluster

Apache administrators: beware ETags if you have more than one webserver! (If you only have one webserver this article will not be useful to you.)

HTTP/1.1 added the header response “ETag” to allow a server to define its own way of uniquely identifying a point-in-time version of a specific file. The ETag is unstructured data; it’s just a string. The client, when rerequesting a document, submits an “If-None-Match”
header – if this header does not match the server’s ETag for the file, the server must retransmit the document, even if the HTTP/1.0 “If-Last-Modified” header exactly matches the “Last-Modified” date of the file.

This wouldn’t be so bad as-is if it weren’t for the way that Apache implements ETag support by default. The default setting is to incorporate the file’s last modification date, its current size, and its Unix inode. The first two make sense; I can understand wanting to make sure that both the last-modified time and the size match what’s on the client. But incorporating the inode leads to some very bad behavior on clusters, because a given file,
such as LOGO.JPG might have the same size and modification time on all of the webservers of the cluster, but the inode numbers are guaranteed to be different.

This means that if you have four web servers, three times out of four when a client connects to a random web server, the client’s stored ETag will not match the server’s and the server will needlessly be forced to retransmit the file to the client. As the number of web servers grows, the situation quickly approaches the point where effectively no caching is happening at all.

This is all compounded by a bug that I found in Internet Explorer 5 and 6, where if the downloaded file’s Last-Modified header matches the If-Last-Modified header it sent in the request, IE doesn’t bother to update its cached ETag. This means that even if you were to force IE to keep connecting to the same server (with the same inode for the file, etc.), once it’s made up its mind about an ETag it won’t change it until the Last-Modified time changes!

To fix this insanity, stick the following line in your Apache httpd.conf:

FileETag MTime Size

This will tell Apache to construct ETags based on only the modification time and the filesize; specifically, it prevents Apache from using the inode of the file in the ETag. Then touch all of your files to update your last-modified time. The next time a client goes to your page, they’ll re-download the files, since the last-modified time changed, but then they will have the “simplified” ETag (without an inode) and they won’t have to download the file again until the file actually next changes. Your pages will be much snappier! 🙂