Introduction
Long before websites, during the dark ages of the BBS, on the internet there was (well it’s still there!) a distributed messaging system called usenet. There are countless topics on just about everything that was full of all kinds of incredible conversations. Before the walled gardens, and the ease of running individual bulletin boards, the internet had prided itself on having one big global distributed messaging system. It was a big system, and one thing that was always taken for granted was that it was too big to save, and that whatever you put out there would probably be erased as all sites had a finite amount of very expensive disk space, and they would only keep recent articles.
But it turns out that in the University of Toronto, in the zoology department they had a tape budget, and were in fact archiving everything they could. In all they had amassed 141 tapes spanning from February 1981 (though these are not Usenet posts, just internal netnews University stuff) all the way up to about midnight of July 01, 1991!
While the archive was made available to a few people in 2001, it was made generally available in 2009, and then in 2011 on archive.org where I downloaded a copy of it. There is some interesting backstory over on Dogcow land, as it took quite a bit of effort to get the data from the tapes, and then slowly released out into the wild.
As mentioned on the archive.org site:
This is a collection of .TGZ files of very early USENET posted data provided by a number of driven and brave individuals, including David Wiseman, Henry Spencer, Lance Bailey, Bruce Jones, Bob Webber, Brewster Kahle, and Sue Thielen.
OK, so back a few months ago, I had setup AltaVista personal desktop search along with the UTZOO usenet archive for the purpose of using something more sophisticated than grep, but maintaining that legacy/retro feel us using outdated technology. To recap the first challenge is that the desktop search product, is only meant to be used from the desktop of a Windows 98/NT 4.0 workstation. It uses a super ancient version of JAVA as the webserver, and they chose to bind it to 127.0.0.1:6688 . So the first thing to get around that was to build a stunnel tunnel allowing me to effectively connect to the webserver remotely. And since the server assumes it’s locally I had to use Apache with mod_rewrite to setup some simple regex expressions to massage the pages into something that would be usable from a non local machine.
So with that word salad up, let’s have a brief picture!
Flow diagram
Stepping it up
On my ‘general’ hosting machine, I use haproxy to reverse proxy out multiple sites out the single address. This is a super simple solution that allows me to have all kinds of different backends using various hosting platforms, such as Apache 1.3 on Windows NT 3.1. So for this to work I just needed to create an altavista.superglobalmegacorp.com DNS record, and then the following in the haproxy config:
frontend named-hosts
bind 172.86.179.14:80
acl is_altavista hdr_end(host) -i altavista.superglobalmegacorp.com
use_backend altavista if is_altavista
backend altavista
balance roundrobin
option httpclose
option forwardfor
server debian8 10.0.0.18:80 check maxconn 10
So as you can see it’s really simple it looks for the string ‘altavista.superglobalmegacorp.com’ in the host header, and then sends it to the backend that has a single web server, in this case a lone Debian server, aptly named debian8 that throttles after 10 concurrent connections.
The next thing to do was generate a SSL self signed cert, which wasn’t too hard. The stunnel installer has a profile ready to go, so it was only a matter of finding a version of OpenSSL that’ll run on NT 4. As this isn’t public encryption I really don’t care about it using crap certs.
On the Debian server is where all the regex magic, is along with the stunnel client to connect to the NT 4.0 Workstation.
client = yes
debug = 0
cert = /etc/stunnel/stunnel.pem
[altavista]
accept = 127.0.0.1:8080
connect = 10.0.0.19:8443
Likewise on NT stunnel will need a config like this:
cert = c:\stunnel\stunnel.pem
; Some performance tunings
socket = l:TCP_NODELAY=1
socket = r:TCP_NODELAY=1
; Some debugging stuff useful for troubleshooting
debug = 0
output = c:\stunnel\stunnel.log.txt
[altavista]
accept = 8443
connect = 127.0.0.1:6688
Now with that in place, I can hit my personal AltaVista search. The next insane thing was to rename all the files from the UTZOO dump adding a .txt extension, and then re-encoding them in MS-DOS CR/LF format. I found using ‘find -type f’ to find files, and then a simple exec to rename them into a .txt extension. Then it was only a matter of using ZIP to compress the archives, and then transferring them to Windows NT, and running UNZIP on them with the -a flag to convert them into CR/LF ASCII files on Windows. This took a tremendous amount of time as there are about 2.1 million files in the archive.
Now with the files on Windows, now I had to run the indexer.
Indexed in under 7 hours!
While I had originally had an IIS 4.0 instance on the same NT 4.0 Workstation serving up the result files, I thought it may make more sense to just serve them from the UTZOO mirror server I have in the same collocation so it’d be much faster, so that way only the queries are relying on servers in Hong Kong, instead of being 100% located in the United States.
So here we go, my search portal for all that ancient usenet goodness:
altavista.superglobalmegacorp.com
If you are hoping for the wealth of knowledge to be gained from people posting on usenet from 1980 to 1991 then this is your ticket. Keep in mind that usenet being usenet, there is discussions on everyone and everything, and like all other forums before you know it it’ll end with calling people Hitler, and how the Amiga is the greatest computer ever (well it was!).
While the story of AltaVista is somewhat interesting, but much like how Digitial screwed up the Alpha market by trying to hoard high end designs, they also didn’t set the search people free to focus on search. And the intranet stuff was crazy expensive, look at this ad from 1996 which translate to a minimum of $10,000 USD a year to run a single search engine! But as we all know, the distributed model of google won search and AltaVista never had a chance as it was caught up in the Compaq/HP mess then spun out to be quickly absorbed by Yahoo.
Meanwhile it appears the original owners of altavista.com, AltaVista Technology, Inc. of California, are actually still in business. If anyone cares I’ll put the installation files, and some of the config’s in this directory.