Planet.python.org

Andrew Dalke: Calvin Mooers

2014-06-19

Mooers became an information scientist at the time when information
science was just getting started. I came across his name in "An efficient
design for chemical structure searching" by Feldman and Hodes,
JCICS 1975 15 (3) pp 147-152. The paper is basically built on work
Mooers had done a few decades previous, and includes a magic value of
0.69 as the "Mooers limit." I made a mental note to follow up on that
later. This essay is a result of that followup, and is a small
biography of Moeers' involvement with chemical documentation.

Influence of Mooers: connection tables, screens, and canonicalization

As I read more of the literature, I realized that Mooers had a big
influence on the early decades of cheminformatics. He seems to have
been the first person to use connection tables for coding molecular
information on a computer, the first to describe substructure
enumeration-based screens, and the first to consider a canonical
representation of a molecule.

Here are some quotes to show you what I mean.

connection table and screens

Perhaps the founding paper of cheminformatics is Ray and Kirsch's "Finding
Chemical Records by Digital Computers" in Science 25 October 1957,
pp 814-819. It describes a project to use computers for substructure
search of the patent database. It has the earliest use I know for
"screen" (or "screening device"), as used for substructure search.

It contains two references to Mooers, the first concerning a
connection table:

An example of a code suitable for machine searching was described by
Mooers in the "Zatopleg" (5) system of ciphering structual
formulas. Mooers' method of representing compounds provided the basis
for representing the input data in the SEAC structure search routine
described below. Methods for actually searching such data had to be
developed.

where (5) is C. N. Mooers, Ciphering Structural Formulas –
the Zatopleg System (Zator Co., Cambridge, Mass., 1951).

The second concerns a substructure screen, which Ray and Kirsch call
"Mooers' N-tuple descriptors":

It has been suggested by Mooers (7) that, for purposes of retrieval,
complex structures such as chemical diagrams can be represented in
terms of a list of, say, all of the triples of atoms and bonds
occurring within the structure. This, chloral (Fig. 3) would be
described as consisting of combinations of the triples in the following
list: Cl-C; C-C; -C-; -C=; C-H; C=O.

where (7) is C. N. Mooers, "Information retrival on structural
content," in Information Theory (Academic Press, New York,
1956), pp. 121-134. (See below for the full citation; there are many
books titled "Information Theory".)

Cossum, Krakiwsky, and Lynch in "Advances in
Automatic Chemical Substructure Searching Techniques", J. Chem. Doc.,
1965, 5 (1), pp 33-35 reaffirm that the Mooers' "code suitable for
machine searching" is actually a connection table.

The connection table which we use is a development of that furst
suggested by Mooers, and tested in search by Ray and Kirsch.

canonical connection table

People usually refer to Morgan's "The Generation
of a Unique Machine Description for Chemical Structures
J. Chem. Doc., 1965, 5 (2), pp 107-113 as the first paper to describe
a molecular canonicalization algorithm. In the paper, Morgan writes:

Since it can be shown that the set is finite for any graph composed of
a finite number of nodes, it is possible to select the unique table by
generating all members of the set, lexicographically ordering the
members of the set based on the characters involved in the
description, and then selecting the first member of the resulting list
as the unique table. This concept is a restatement of a technique
proposed by C. N. Mooers for generating a unique cipher based on a
process of making all possible "cuts" and comparing the resulting
ciphers. [10, 11].

In this case, [10] is the same "Ciphering Structural Formulas –
The Zatopleg System" as before, and [11] is C. N. Mooers, "Generation
of Unique Ciphers for a Finite Network," Zator Technical Bulletin
No. 49, Zator Co., 79 Milk St., Boston 9, Mass.

I can't help but conclude that Mooers' ideas had a big influence on
the early days of cheminformatics.

Speaking of Morgan, I came across a reference to the Morgan
canonicalization method in an essay by Charles Davis titled "Indexing and Index
Editing at Chemical Abstracts before the Registry System":

Dyson, however, is remembered for having succeeded in lobbying for his
system of linear notation, which won the approval of the International
Union of Pure and Applied Chemistry (1961). This triumph over the more
popular Wiswesser notation was something of a pyrrhic victory since
linear notation ultimately would never be important to CAS. Other
organizations, especially the Institute for Scientific Information,
would go on to use Wiswesser notation, and it became an industrial
standard for those who needed linear notation in their work (Davis &
Rush, 1974a). However, the principal reason that CAS did not use
either system was that during the early 1960s a young mathematician
named Harry Morgan developed the famous algorithm that lead to the
Registry System (Morgan, 1965; Davis & Rush, 1974b).

... Lynch's paper (M. F. Lynch, personal communication, 15 November
2002) expands on events during this era; moreover, he makes it clear
that the Morgan algorithm was actually a revised version of the Gluck
algorithm developed at DuPont.

When I read Gluck's paper
about the Du Pont system, I wondered what algorithm they used in
1964 in order to canonicalize their molecules, since it was a year
before Morgan's paper. Now I know! (For what it's worth, the priority
rule is based on publication, not use, so we'll still correct in
citing Morgan, not Gluck.)

Who was Calvin Mooers?

Calvin Mooers and his wife gave an oral history of
their - mostly Calvin's - lives. I used that as the scaffolding for
this mini-biography.

He worked for the Naval Ordnance Laboratory during World War II. At
the time, von Neumann was an advisor for the Navy. Von Neumann wanted
a computer, and convinced the Navy to build one. The Navy decided to
do so at NOL. Atanasoff, who invented the first electronic digital
computer, was put in charge of the project.

At about 26, he decided to go to graduate school at MIT. He wanted to
apply some of the skills he had from working with computers,
considered a few possibilities, and decided to work on library
science. His wartime experience showed that library systems didn't
handle the enormous amount of new publications which didn't fit into
the existing classification system.

In his history he recounts Not long after (at MIT), I went to a
lecture by Claude Shannon that he gave about 'information theory.' One
of the conclusions of the lecture was that a random process had the
statistics required for passing the highest quantity of
information. Shannon and Weaver's The Mathematical Theory of
Communication (1949) and Norbert Wiener's Cybernetics
(1948) mark the dawn of the information theory age. Wiener was also at
MIT, and Mooers in one of his papers notes that both books are very
interesting.

His Master's thesis, Application of random
codes to the gathering of statistical information, describes his
"zatocoding" system of superimposed codes, which I'll get back to in a
future essay. Appendix Z of the thesis was originally presented at an
American Chemical Society meeting.

Indeed, he has a long association with the ACS. At MIT he met the
chemist James Perry, who was interested in "chemical literature" and
punch card-based information systems for chemical literature. Perry
arranged things so that Mooers could present his early ideas at the
ACS in 1947. This interaction helped lead to Zatocoding.

He continued to be associated with the ACS. For example, he presented
"Making
Information Retrieval Pay" before the Chemical Literature Division
of the ACS in September 1950, which described his audacious plan to
index the Library of Congress catalog on a set of Zatocoded punch
cards, and build a mechanical search engine that could conduct a
search within a few minutes.

(I'll add that he is widely acknowledged for coining the term
"information retrieval" and presenting it earlier that year. He also
coined the term "descriptor":

Before 1948 the word did not exist, of course, and was not in the
dictionary. It's now in the dictionary and most people don't know that
it was my neologism. I made it up because I wanted the new word to
mean exactly what I described and, unfortunately, that never
happened. That is, the word descriptor now means almost anything.

According to the Wikipedia entry for index term, Moore's
definition is in particular used about a preferred term from a
thesaurus.)

He started a company to commercialize his ideas. The first sale was to
Merck, Sharp and Dohme (the US Merck, known still as "MSD" outside of
the US). I assume it was a punchcard-based indexing system for
chemical record lookup. In any case, it was this period around 1950
where he did most of his thinking about applying computers to
chemistry records, which resulted in his Zatopleg system.

I think he didn't continue with chemistry in large part because he
wanted to work on larger problems of library and information
management, and of thinking machines. In the previously mentioned
"Making Information Retrieval Pay", he proposed a DOKEN
("documentary engine"), a mechanical retrieval engine capable of
searching a 100 million record catalog in 2 minutes, or the 10 million
catalogued items of the Library of Congress in 10 seconds. I don't
think chemistry documentation was as interesting to him.

In any case, history shows he was right. It took another 10 years
before cheminformatics as its own field really started, and there were
and are a lot more library systems than molecular database systems.

For various reasons, his business didn't do so well. It seems like
librarians didn't like his ideas much. Not only did he want to replace
a lot of manual indexing systems with random numbers, but the random
numbers didn't make much sense to a non-mathematician. I've read some
of his papers, and his style is an unfortunate combination of abstruse
and opinionated that could put people on edge. It also the combination
that can energize
people. The trick is to energize the people who will pay you
money.

Other pioneers

My view is that while his specific background put him ahead of most
others, in how to think about information theory and computing
devices, there were many others very close behind him, and who were
less prickly to deal with.

Hans Peter Luhn

For example, at the suggestion of the Hollerith Company, Dr. Dyson
presented to IBM his ideas for using punched cards [containing
molecules in Dyson notation]. Accordingly, he and Mr. Peter Luhn (of
IBM) build, in 1949, a machine which would sort free field code
cards. In this way, the Luhn scanner came into operation. (From
the book "Survey of Chemical Notation Systems.") This led to
[Luhn's] interest in literary data processing.

"Mr. Peter Luhn" here is Hans Peter
Luhn, another pioneer of information science. Among other things,
he invented the checksum algorithm used in every credit card. He kept
in touch with Dyson. Lunh developed the KWIC
permuted index in 1958 or 1960 (sources differ, and this is too much
of a tangent to track down). Dyson, who by then was in charge of
research and development at Chemistry Abstracts, invited Luhn to visit
and to present this work. That's when CAS realized they should be
looking towards computers, and then acquired IBM hardware. One result
of this was the KWIC index for Chemical Titles, which served as a
product to compete with the success of Eugene Garfield's ISI.

Mortimer Taube

Mortimer
Taube, who was a librarian before becoming the chief of general
reference and bibliography of the Library of Congress in 1945, started
Documentation Inc in 1952. He had a much closer ties to the world of
library science, and to government. Its first customer was the US
military, and it provided library services to NASA when NASA was
created in 1958. The underlying technology was based on "uniterms",
which was presented in a paper titled "Coordinate Indexing of
Scientific Fields", delivered at the Symposium on Mechanical Aids to
Chemical Documentation, Division of Chemical Literature, American
Chemical Society, New York, Sept. 4, 1951.

According to Heting Chu in the book "Information Representation and
Retrieval in the Digital Age", Taube introduced boolean search systems
to information retrieval, in the form of coordinate indexing. (On the
other hand, other sources say that Mooers was the first propose using
Boolean operations. I think the difference here is between "propose"
and "convince customers to use.")

According to Mooers, in his autobiography (which means he has every
right to tell it the way he wants to):

What [Taube] did was that he cooked up a simplified variety of my
descriptors which he called Uniterms. He was a great salesman and a
smooth talker and he charmed the librarians. He had worked as a
librarian. So he set up Documentation, Inc. which made quite a
commercial splash. Taube's message was that you don't have to worry
about the fact that you can't understand Mooers, you do it the Uniterm
way, you can understand it, and it's easy. So they flocked in his
direction. Well, his methodology can be cynically characterized as
follows: How do you index documents? You take a collection of
documents in a certain field and you give them to somebody that is not
really in that field. You sit him down with a colored pencil and ask
him to go through the documents and to underline every term that he
doesn't understand [laughs], and to use those underlined terms for
index terms. You've heard of key terms, key words? Well, key words are
the direct descendants of Mortimer Taube's Uniterms and have the same
sort of loose-jointed semi-applicability to the field at hand.

If Mooers thinks that's "loose-jointed" then imagine what despair he
might have had with folksonomy.

Chemistry and documentation

Did you notice how Mooers, Luhn, and Taube all had ties with the ACS?

I had always wondered why the "Journal of Chemical Documentation" had
that name. I have a better understanding now. The American
Documentation Institute started
in 1937 with money from the Science Service of the National
Academy of Sciences. In the post-war era, the number and rate of
scientific publications grew enormously, and especially the number of
technical reports. I recall from Eugene Garfield's essays that
Chemical Abstracts at the time was years behind indexing the
literature. This presented a market opportunity for his ISI (Institute
for Scientific Information), which used computers to index the most
popular chemical journals.

The ADI at this time consisted mostly people with scientific and
technical backgrounds. This caused some animosity, as many of the
newcomers believed that specific library training "was outdated or
unnecessary", while others believed, as the ASIS link quotes from
elsewhere, "documentation was librarianship performed by amateurs."
Also around this time, the American Chemical Society Division of
Chemical Literature group started, as one of many more specialist
groups. There was a large overlap in the readership between these
various organizations.

Then the Soviets launched Sputnik in 1957. The US and other western
governments started pouring money into science and technology. Quoting
Mike Lynch:

There were great stirrings in science information at that time because
of Sputnik, the challenge to the United States from the Soviet Union
in October 1957. Sputnik's beep-beep tones took the world totally by
surprise. When the dust had settled, it became apparent that the
Soviets had published their intentions in the open literature, but the
science information system in the West was in disarray. The system had
not been considered sufficiently important nor was it well enough
funded to keep up with the vast increases in the numbers of scientists
employed and publishing in the postwar period. There was said to be
a cocktail called Sputnik, one part vodka and three parts sour grapes.

The Journal of Chemical Documentation started shortly after Sputnik,
in 1961. My guess, though I've not read any of the American
Documentation articles, is that the topics in each field became too
specalized to have a single, wide-ranging journal. I also guess that
the money going into science research meant more money going towards
developing computers meant more drug companies could have a computer.

That said, I still don't understand how the ACS, as compared to any
other field, was so tightly coupled to these three key figures of
information science. Anyone know?

Copyright, patents, and trademarks

I get the feeling the Mooers wanted to follow the ideal of an American
inventer: a thinker who comes up with ideas, patents them, and makes
money by licensing the right to use the patents.

For example, he tried patenting his Zatocoding system. According to
his oral history, it took 23 years for the USPTO to grant the parent,
which was past the time it was commercially viable.

I double-checked. The granted patent is "Battery controlled machine",
US 3521034 A. (I
linked to Google instead of the authoritative USPTO because the latter
only has images in a hard-to-use interface, while the Google has an
OCR'ed copy on a single page.) Note the text This is a
continuation-in-part of application Ser. No. 392,444, filed Nov. 16,
1953, which was a continuation-in-part of application
Ser. No. 774,620, filed Sept. 17, 1947.

In his oral history he comments that he:

... was becoming more and more critical of what I could do in the
library field. That is, by 1960 there were now computers and
"operators" like Herb[ert R. J.] Grosch at General Electric (GE) were
moving in, and being the big boss of the computer at a company and
were going around looking for business. And the library field was
beginning to wake up to the fact [that] there might be something
here. You don't take your business to a little hole in the wall like
Mooers was operating. You take it to GE or you take it to MIT. There
was an "operator" – Overhage at MIT – who set up a big
project, INTREX, to solve all the problems for all time of libraries
with computers at MIT. Herb Grosch was taking contracts at GE. This
was the situation. The result of all of this was that in the mid
1960s, I more or less turned off my public interest to the information
and library field, although I kept following it to some extent in
private, and turned on my interests in programming languages and
TRAC.

It's indeed hard for a lone inventor to compete against someone with
close library and government connections (Taube) or big business
connections (Luhn). That's where the limited monopolies of copyright
and patent might help, but at the time no one, including lawyers,
thought they could be used for software. Instead, he filed for a
trademark on the name "TRAC", to the detest of many. Quoting from
Mooers:

The first issue of Dr. Dobb's Journal, one of the early publications
in the personal computer field, has a vitriolic editorial against
Mooers and his rapacity in trying to charge people for his computing
language.

Show me the documentation!

For someone so intersted in document retrieval, and whose ideas seem
to inspire much of core cheminformatics, it's surprisingly hard to
read his important papers. The most critical ones (for my interests)
were published basically as white papers from his company. They aren't
in WorldCat, or accessible by various Google general search
engines. These are:

Generation of Unique Ciphers for a Finite Network, Zator Technical Bulletin No. 49

Ciphering Structural Formulas – the Zatopleg System, Zator Technical Bulletin No. 59

Finding Chemical Records, Zator Techniques No. 3 (or Zator Technical Bulletin No. 64)

as well as Information retrival on structured content,
pp. 121-134 from Information theory; papers read at a symposium on
information theory held at the Royal Institution, London, September
12th to 16th, 1955 (Academic Press, New York, 1956). (It appears
that the National Library of Sweden has a copy of this book, so I
didn't put it on the list.)

I'll even go so far as to say that any paper published in the last 20
years, which referenced one of those first three citations, is likely
making the reference second-hand through other papers, and not from
actually reading the original paper. The only possibility I've found
to get copies of the papers is to contact the curators at the Charles Babbage Institute. They
have 39
boxes of his papers, including those three. I'm waiting for a
reply to my email asking how to get a copy.

Zatopleg in the patent literature

Meanwhile, I have some sideways views through the patent records. These are:

US 4118788: Associative information retrieval (1978):

One prior art technique, originated in the 1940's, that is designed to
permit associative retrieval in mechanical type systems rather than in
conjunction with computers, is sometimes referred to as
"Zatocoding". A complete description of a Zatocoding system, including
some of the background mathematics, is contained in British
Pat. No. 681,902 issued to Calvin Mooers on Oct. 29, 1952.

... While many of the features of the Zatocoding system, including the
theory of superimposed coding, may be quite valuable in enabling
associative retrieval, it nevertheless remains that the technique was
generally oriented toward manual type storage systems and was never
expanded so as to be useful in the environment of modern day
computers.

(Note: I can't find this British patent! Can you?)

More importantly for my interests, the patent literature is the only
Internet-accesible source I've found which describes the Zatopleg
system, in US3476311: Two-dimensional structure encoding (1969):

According to the Zatopleg system, a random number is attached to each
atom, which number cannot be assigned more than once within a
molecule. This would be followed by lists showing which other atoms
each atom is linked to. Therefore a Zatopleg code for one atom of a
molecule would consist of: first, an arbitrary number assigned to the
atom within the molecule; second, an identification number for the
kind of atom (e.g. its atomic number); and, third, the numbers of the
atoms to which it is connected.

So close, but frustratingly incomplete.

Eugene Garfield's tribute

Eugene Garfield wrote a tribute
to Mooers in The Scientist, Vol: 11(4)March 17, 1997. If
you've made it this far through my text, you'll be able to understand
nearly all of the context of that tribute; something I couldn't have
done two weeks ago.

Garfield pointed out one last difficulty that Mooers had as an
independent, for-profit researcher:

I remember resenting the fact that he was "selling" us on a
commercial, for-profit product, which I inherently mistrusted. I
hadn't yet overcome the idealistic notion that all good things were
nonprofit, which was probably a reflection of youthful
naïveté. I appreciated how difficult and frustrating it
was to compete with the arrogance and market advantages of these
nonprofit establishments. However, my later experiences with large
government agencies and nonprofit institutions changed that view.

(There may be a transcription error. I think it makes better since as
"I didn't appreciate how difficult...".)

That's something that I, an independent, for-profit researcher should
bear in mind.

Comments?

This is incomplete research, and I don't know where I'll go with
it. There are a lot of books about the history of information science,
and I know so very little about the topic. I know there are still many
fans of Mooers out there; to them, I hope I did a good job.

My interest is in understanding the evolution of cheminformatics
systems, especially machine-based systems. If you have any details or
comments to add, please
do so, or send me email to dalke at dalkescientific dot com.