Mirroring Web Content

originally posted to the Apache Development mailing list

Hello! I apologize if this has been discussed in this fashion many times, but I have attempted to read around and wasn’t able to directly find any indication that it has been. Please flame me offlist for my naivite.


THE MIRRORING PROBLEM

As a website’s popularity grows, it becomes increasingly desirable to have “mirrors” of the website located in various places, in order to spread the processing and bandwidth expense of serving a page across many servers and to reduce the path length traversed by a packet going from server to client. The Apache Group itself uses mirrors, as do the Qmail and Postfix projects, the Linux Kernel site, and innumerable other popular websites.

There are several ways to inform a client as to the availability of a file on alternate servers:

  1. Click On It Yourself.This approach, the one used by most Open Source project pages, involves a clickable list of mirrors being presented in the HTML body; it is assumed that a “kind” user will find a mirror instead of downloading from the main site. Some sites, like http://qmail.org/, somewhat enforce this usage pattern by prompting for a location before a user can engage the site. Some, like Apache, use a dynamic list of mirrors to reduce the probability that some poor singular mirror that was listed first will get all the traffic.This approach is nicely centralized and is easy to administer, but is a pain for the user. Cookies to remember a user’s preferred location might be useful in helping make localization a one-time effort and not a continuous one. This is also not a standards-based approach. Every website must go it on their own. Thankfully, this is not hard.
  2. Use Clever DNS ServersThis is somewhat the IRC-server approach, and moreso the approach that Akamai adopted. Most largescale commercial websites use “clever DNS” servers that can field a reasonable guess as to what webservers are likely to be closest to you and to return their IP addresses. This requires no client-side intelligence or user interaction. The seamless, scalable, and elegant nature of this approach has made it strongly compelling for the commercial web. I don’t know what Open Source DNS software is capable of location-based IP issuance: I would love to hear of any.This approach is equally centralized but requires control over the DNS server, something that many small to midsized websites don’t have. Getting a “smarter DNS” into ISPs that did proximity-based IP address returns wouldn’t require even modifying MX records, and could be a real coup. But this approach also requires mirroring the site in its entirety.
  3. Use HTTP RedirectsThis approach is not used nearly as often as the first two. A script could be written to redirect a web browser wanting to download a given file to a specific mirror where file resides. This has the advantage of not requiring all files to be on all mirrors, or even the same set of files on all mirrors. This does require writing some (simple) new software to manage the connection redistribution; this could be an Apache module. One of its actions could be to simply let the request be served by the local host until some certain bandwidth/CPU/memory threshold was crossed, at which point it could begin dishing out redirects to mirrors likely to be near the requestor.This approach is more powerful than the above two (it’s seamless, but doesn’t require mirroring the whole site). It would work best as an Apache module, which would require control over the web server being used to service requests, but a user could theoretically change their entire site to be served by a CGI that could perform that same function. This would probably require changing the site’s layout and would involve a great deal of work on their part.
  4. Use HTTP HeadersThe next approach is to use two new fields in the HTTP response to a HEAD request: “X-Mirrored-By” and “X-MD5”. A sample HTTP request/response:
        [client] HEAD /very/big.movie HTTP/1.1
        [client] Host: MovieServer.com
        [client]
        [server] HTTP/1.1 200 OK
        [server] Content-Length: 205392839
        [server] Content-Type: movie/quicktime
        [server] X-Mirrored-By: http://mirror.in.co.uk/movserv/the.movie
        [server] X-Mirrored-By: http://downunder.com.au/mirrors/ms/funny.mov
        [server] X-Mirrored-By: http://friend.in.co.tw/movies/big.movie
        [server] X-MD5: 5FD298A9782394C2

    This would enable the client to find the mirror closest to it and possibly even download the file simultaneously from multiple locations. The MD5 checksum and content length would ensure that the end result was correct, something that the other methods above don’t provide.

    This approach has not yet been implemented; I would like to bring it up for discussion with you, the Apache developers. It could be used today with setups that allow websites to control their own headers.

    I’ve reviewed the HTTP 305 Error Code, which seemed like it might be a good fit for this sort of thing, but I then discovered that only proxies are allowed to transmit that code.

  5. Use an Orthogonal Peer-To-Peer System

    Finally, some recent companies, such as RedSwoosh, have begun rolling out technologies to intercept HTTP requests and attempt to service them on their own network, using the URL as a content key instead of a destination. These new-style networks have the advantage of not having to conform to existing client-server expectations in the HTTP world and can easily benefit from increased security, multipoint downloads, and so forth, often without requiring any changes at all to be made in the webserver.The downside is loss of definitive control over the locations from which a file is being distributed and the dependence upon systems that may not be either open or standards based and may only run on certain platforms.
  6. Use a Generic Index Into Orthogonal Systems

    Bitzi, as an example, provides for XML tags that can specify various properties about a file. An intelligent client could do an HTTP HEAD on the web server, grab the MD5 or Tiger-Tree hash of the file to be downloaded, grab the Bitzi tag based on the hash, and query various P2P networks (Gnutella, Fastrack/Morpheus/Kazaa, AudioGalaxy, etc.) for the file as reported by users of Bitzi. This is a much more ad-hoc situation and perhaps better suited for users producing or mirroring informal rich media files. The server-side implementation would only require sending back an MD5 hash of the file, however.

Thoughts? This certainly does cut out a good deal of work for the Open Source community. It’s quite likely that there already exists software to do most of what I’ve discussed here, but that I’m simply unaware of it. The Apache module to do conditional redirection is the one that I’m currently most excited about.

Please upbraid me now.

Yours,
David E. Weekly

PlayPlay

Tribooting Apple’s Titanium Laptop

sidestory (how i got my laptop)

I recently had the joy of acquiring an Apple Titanium G4 Laptop. This is
kind of interesting in and of itself, because I’m a Linux junkie who
reluctantly uses Windows desktops for client work; a year ago I would
have laughed at you if you had told me I’d be craving an Apple machine.
But the one-two of Apple’s gorgeous notebook design and its release of
OS/X, a consumer-deployed Unix with a gorgeous front end, I was hooked.
I talked my boss into the necessity of porting our company’s software to
OS/X and expensed the purchase. (Hey, I did do the port a week
after I got the box. =) )

I waited on edge for weeks for the laptop to arrive; I had bought it
through the Apple Developer Discount program, so it didn’t really cost
my company very much at all. Unfortunately, that meant that shipping
dates could range up to two months after the order was put in. Finally,
the box came. Having no interest in OS/9, I immediately plopped in the
OS/X CD to install a real operating system. Hurrah! A real laptop!

Urg, sort of.

The first generation DVD drives that Apple decided to slam into these
thin laptops were too thin. So thin that the CDs would grind up
against the roof of the drive and be unable to spin, making a horrendous
WHIZZAWHIZZAWHIZZAWHIZZA at around 100dB. A few coworkers from surrounding
cubes ducked their heads in: “Hey, what’s that sound?”

I had to very sheepishly explain to them that my brand new shiny toy had
shipped broken. Doh! I sent it back the very next day for repair. I didn’t
see it again for a month. So you can imagine how happy I was to finally
get it back: after three months of waiting, I had a functional laptop. =)

installing linux

Last weekend, I went to DEFCON, a computer security convention. A really surprising number of people there had Apple laptops, and a large portion of them were running Linux or OpenBSD on them! I thought “hey, I could do that, too!”…as soon as I got back home, I set off to triboot my computer between OS/X, Linux, and OS/9. (Even though I don’t care for OS/9 much, OS/X can’t play DVDs yet, and a lot of system updates (for the firmware, etc.) are released under OS/9 only.)

I grabbed the Debian PPC ISOs, but the installer was rather unfriendly and
kept puking on me, even when I put in the special boot-options to tell it
to use the OpenFirmware graphics only. Whenever I got to the partition part,
it would tell me I had partitions that were hundreds of gigabytes large.
If I tried to format any partitions, it would crawl, a sector every other
second, through what it claimed to be billions of sectors. I gave up after
five hours and decided to do some more research.

I looked a bit at LinuxPPC, but was told that it’s pretty wildly unstable and uncomfortable. YellowDog 2.0 had been getting some positive vibes from the people I consulted with, so I downloaded the ISO and burned myself an install CD.

Separately, I had been having issues running “Classic” (OS/9) from within
OS/X, and I was told it was a very good idea to have them on separate
partitions. I backed up my handful of interesting data on the box, wiped
my partition table using the disk utility that came with my Titanium’s
“Software Restore” CD, and allocated four partitions: 3Gb for OS/9, 11Gb
for OS/X, 5.5Gb for Linux, and a 200Mb swap partition for Linux. As it turned
out, that was one too few! Yellow Dog Linux wanted an additional 10Mb “boot
partition”. Apple seems to create a whole bevy of little partitions, so it
ended up that my Linux root is on the 11th partition! Crazy. So
you need to allocate five partitions by hand to run the whole setup properly.

I reinstalled OS/9, ran Software Update to update my firmware and the OS,
installed OS/X, ran Software Update a few times, installed my OS/X development
environment and configured my laptop for NIS, and then proceeded to install
Linux. There were some issues with the YDL install, for sure (It died repeatedly
with weird errors when I tried to tell it what NIS domain I was in) and
only supported a text-mode install, but it seemed to generally go alright.

At the end of the install, I made the mistake of choosing to boot MacOS by
default, figuring I’d be presented with a little menu at boot of which
partition (OS/9, OS/X, Linux) I wanted to boot to. I reboot into OS/9. Whoops.

juggling multiple OSes

Incidentally, to get into Open Firmware, reboot, wait for the reboot noise
to finish, then quickly press (in order) Apple-Option-O-F. Many guides
get this wrong and tell you to hold down all four buttons as you are rebooting.
You really have to depress them (in order) right after the reboot to get into
Open Firmware. It really weirded me out that Open Firmware is a Forth
interpreter. Damn it, you’re not supposed to be able to interactively program
your computer at the BIOS level! =) Freaky.

It’s good to memorize what partitions your OSes are on. For me, it was easy.
OS/9 was on partition 9, OS/X on partition 10, and Linux on 11. So in Open
Firmware to boot into OS/9 I’d type boot hd:9,\:tbxi. To boot
into OS/X: boot hd:10,\:tbxi, and to boot into Linux:
boot hd:11,\yaboot. (OS/9 and OS/X use “:tbxi” as their loader,
whereas Linux uses “yaboot“.) I now could boot into any of three operating
systems on my computer! All of them work perfectly! I’ve even got sound under
Linux! (Although sleeping is a little buggy still…)

It gets stranger yet: in OS/X you can run OS/9. It’s called “Classic mode”.
In Linux, you can also run OS/9 – it’s called Mac-On-Linux.
This will let you run MacOS on PowerPC Linux really fast (it’s not
having to emulate anything. It just runs the OS directly!) So you can run
your different OSes inside each other. =) XDarwin lets your run (and compile) X/Windows programs on OS/X, just to keep things interesting. Oh, and then there’s the GNU-Darwin ports collection with several thousand BSD packages for OS/X (and native Darwin). Whee!

Dave’s nifty Tip of the Day: in OS/X, type “>console” as your username at login to get a graphic-free login prompt! =)

More info later: the short of it is that tri-boot’s quite comfortably possible,
and it’s fun to run KDE 2.1 on your Titanium laptop! (And boy is it fast!)
Oh, and OS/X is cool, 95% POSIX-compliant (eck – almost there, guys!), and
has crappy man pages. All hopefully fixed soon, except for the
fsck’ing Alt-Tab issue (linear progression instead of recency stack!). Good
job overall, though!

Later! =)

Presumed Backing

I find myself dealing sometimes (and indeed am guilty) of a phenomenon that I would like to call “Presumed Backing.”

It is neatly summarized as follows: the more people approach you with a certain fact, the more likely you are to believe it is true, because you implicitly believe that there are an increasing number of “backing facts” not explicitly stated. If John tells you “Sue ran down the street naked!” and Jane tells you “Sue ran down the street naked!” you may begin to believe that Sue did run down the street naked. If a third person tells you the same, your credibility in the event increases. Why? Because you presume that each person has verified this event to have happened and you presume their verifications to be independent.

The interesting thing about Presumed Backing is that it can achieve critical mass. With a sufficient number of initial believers in an event, a few even relatively skeptical individuals may be won over through sheer numbers. As these skeptics fall, doubt is introduced into ever higher levels of skeptics, letting a convinced populace truly replace the presentation of facts.

This is how rumors get started. A devious individual plants an idea into a handful of impressionable peers’ heads. These together can then begin to sow doubt into the minds of the populace, and so on and so forth. Once a rumor is widespread, we often implicitly believe there to be equally wide-spread substantiation of the allegations, even when the
truth may be that the ill-advised “fact” had a singular author.

This is also how chain mail works: you receive a letter (e.g., “A little boy is dying of cancer…email this to a dozen of your friends and the Red Cross will give a penny for every ten people that get this email”) directly from a friend. You are much more likely to presume
the email to be factual since it came from someone you know: you presume there exists backing evidence, and you might go forward the email on yourself.

Once you recognize the process of Presumed Backing, it’s easy to stop. All you have to do is inquire as to the direct evidence. “Who saw Sue run down the street naked? Are they sure it was Sue? Are they sure she was naked?” — asking Sue herself might not be a bad idea either. This eliminates the false assumptions about the strength of the verification
undergone by your peers and will help you from contributing to rumors and other “false memes”.

If you redistribute information, it is wholly your responsibility to check that information for correctness.