2015-06-04

Ben Denckla explains step-by-step how he converted Lionel Lord Tennyson’s 1933 autobiography into an ebook, and in the process updated it for the 21st century.

By Benjamin Denckla

This essay describes how I created an ebook edition of Lionel Lord Tennyson’s 1933 autobiography, From Verse to Worse.

Normally, I create ebooks from sources. By “sources” I mean the computer files that were used to create the paper book. So, I thought I might learn something by creating an ebook edition of an older book for which no such sources exist.

I came across From Verse to Worse while researching some family history. L. H. Tennyson was briefly married to my grandmother’s stepbrother’s widow. This concise statement merits some unpacking. My grandmother, K. N. R. Denckla, was adopted by the metal magnate W. H. Donner. W. H. Donner had a son, Joseph, whose widow, née Carroll Elting, was briefly married to the Baron.

Intrigued by this distant family connection, I got From Verse to Worse by inter-library loan, and was delighted to discover what an interesting and entertaining book it was. So naturally it came to mind when trying to think of a book I might convert.

From Verse to Worse is protected by copyright until 2021: 70 years after the death of its author. A big part of this project was determining the copyright holders and contacting them to get permission to publish this ebook. Toward those goals, I was graciously helped by the following persons.

Alan Tennyson

David Tennyson

Patrick John Leslie and other heirs of Mark Aubrey Tennyson

The production process



Ben Denckla

The ebook production process is usually secret, and the results are usually shoddy. I will share my process here, and I hope readers will find its results to be far from shoddy.

The ebook was created by combining the results of OCR and manual typing of the book. My hope is that the errors of these two processes are largely uncorrelated. If true, the combined process should have fewer errors than either of the two individual processes would have on its own, holding cost constant. In other words, my hope is that it is better to divide a fixed budget between OCR and typing rather than devote all of it to a somewhat higher-quality version of one process or the other.

To create the scans for OCR, I used a Bookeye 4 Basic planetary scanner. I had to do a non-destructive scan since I was using a library copy of the book. An unfortunate limitation of the copy I used is that all the photographic plates have been marked “Univ. of California” with a perforating punch!

For the OCR itself, I used ABBYY FineReader.

For post-OCR processing, I used a variety of software from the Ubuntu distribution of the GNU/Linux operating system, including emacs, aspell, Inkscape, grep, Perl, PHP, and git. I wrote some small bespoke programs for this project, and adapted some others from previous projects. I ran Ubuntu under VirtualBox. I used Bitbucket as a central git repository and issue tracker.

I checked my EPUB files using FlightDeck by eBook Architects. I checked my underlying XHTML files using the W3C Validator Suite.

For manual typing, I hired freelancers on Elance and oDesk (now Upwork) for about $0.25 per 100 words. The book is about 77,000 words long, so this came to about $193. The quality of their work varied widely. One freelancer used OCR despite claiming to have manually typed the chapters assigned. This was easily detectable from the types of errors present. On the bright side, one freelancer’s work was particularly high quality: Gordana Lukić, of Serbia.

Differences from the paper book

I have tried to tastefully choose which aspects of the paper book should be captured, from among those aspects that can be captured. I have tried to implement those choices with skill and grace.

Despite all this, I still regret what was lost in conversion. I suppose I am susceptible to sentimentality for paper books.

One way to mitigate the losses from leaving paper is to increase gains from going digital. For example, for some books, though I think not this one, internal hyperlinks can be added to enhance the reading experience. A more subtle way I found to add digital value to this book was to tweak the underlying representation of the text to improve features such as search and copy & paste.

Note that in this respect, creating an ebook is quite unlike typesetting a paper book. In a paper book, it hardly matters what means are used to achieve an end, and ebook conversion is rarely considered. For example, it hardly matters if a small caps result is achieved using an underlying lowercase representation (h.m.s.). But, for an ebook, it does matter. For example, if the user does a copy & paste of that text without formatting, the letters should be uppercase (H.M.S.). Thus, in this ebook, the underlying representation of such text is uppercase, and its styling not only applies the small caps font variant but also transforms the text to lowercase.

The other way I can think of to mitigate the losses from leaving paper is to document some of those losses. These losses will be discussed extensively in what follows. I will also discuss the more positive topics of what was retained, and even what I think was improved (corrections)!

Index section dropped

Perhaps the most notable loss in the ebook edition is the index section. The paper book includes three indices:

General Index

Cricket and Cricketers

Great War

I chose to drop the index section to save time on the conversion project. It is quite a lot of work to truly convert an index. See my discussion below of how an index’s page number references should be converted.

In defense of my omission, an index is less relevant to an ebook since most ebook software can search. Yet, see my discussion below regarding the importance of keeping ebooks “in conversation with” their paper counterparts.

Page numbers

Ebooks based on a paper book can include data about the original page numbering. I chose not to include this data to save time on the conversion project. Even when it is included, I doubt that it can be displayed by a lot of software. But, even if I were sure there was little current support for this data, I’m not sure that would be a good excuse for my omission.

Is an ebook like software, expected to evolve along with its surrounding technology? If so, then perhaps it is valid to say that I expect to add page numbers when they become widely supported.

Or is an ebook more like a paper book, in that, aside from minor fixes, it is usually done once and for all with its first and final release? In that case, its first and final release must strike a tricky balance between the realities of today’s software and some guess about the software of the future.

Obviously, these questions go way beyond page numbers. There’s no easy answer to the question of whether an ebook is more like software or more like a paper book. All I can say for sure is a lot more people in the publishing industry should be deeply engaged with this question. From what I can see, so far most publishers have struck the worst possible balance. Ebooks have inherited the worst part of software, namely, its terrible quality, without taking advantage of sofware’s saving grace: its ability to evolve towards higher quality. Another way of putting it is that most publishers evolve their ebooks even less than their paper books, but their first ebook releases have none of the quality controls that have made this “release strategy” feasible for paper books!

Swinging back to page numbers, and today’s realities, it is important to keep in mind that most Kindle environments can display paper page numbers, but what is being displayed is not data provided in the ebook that the publisher uploaded to Amazon. It is data that Amazon has independently collected and injected into the ebook. So, for Kindle, it doesn’t matter whether you include page numbers; whether they will appear is out of your control. From Amazon Kindle Publishing Guidelines:

Amazon may make page numbers available for books as additional book metadata. Amazon generates these page numbers based on its own internal technology.

References to page numbers

Now, what about references to page numbers, as in an index, as opposed to the page numbers themselves? I didn’t include such references in this ebook. Most of these references didn’t even have a chance to appear since I didn’t include the index section. The paper book’s only other page number references are in the table of contents and the table of illustrations. I didn’t include those page number references.

But I’m using this afterword as an opportunity to criticize my choices, and discuss these choices in general. So I will pursue the question of the importance of page number references in an ebook. Opinions vary, such that some people think the right version of that question is, how bad is it to include them?

Since we happen to be talking about Amazon already, let’s look at its rather extreme stance:

There should not be any reference to page numbers in the book. Page numbers should not be included in cross-references or the index.

Amazon’s stated reason for this is that “Kindle books do not always map directly to page numbers in physical editions of the book.” I’m not quite sure what they mean by this, but I assume they mean something like “Paper page numbers do not always map directly to a location in a Kindle book.”

If that is what they mean, I don’t think it is a good reason, for several reasons laid out below.

Though paper page numbers do not always map directly to a location in an ebook, they usually do. For example, occasionally the order of some text and an image is reversed in an ebook if that order is more logical for the ebook. For example it usually does not make sense for an image to appear in the middle of a paragraph just because this was the location of a paper page break. In such situations, it is unclear where to identify the paper page breaks in the ebook. But such situations are relatively rare, and in any case they are usually not a problem, due to the next point.

Page number references in paper books are not usually referring to a page of the paper book per se. They are usually referring to specific content on that page. Perhaps Amazon engineers do not understand this, but readers do. Is Amazon’s concern that it is misleading to have page number references whose underlying links lead to locations that do not correspond to the top of that paper page? Indeed not only is it not misleading, it could be considered a bug if the ebook converter did not take the trouble to point the link to, semantically, what was intended. It would be lazy to just point to the content corresponding to the top of the paper page since that is usually only a rough approximation of the location intended.

Finally, a reason for both page numbers and their references is that an ebook does not exist in isolation. An ebook exists, or should exist, “in conversation with” its corresponding paper book.

A digression on page numbers

The IDPF’s EPUB 3 Accessibility Guidelines elaborate on this:

Do page numbers really matter anymore?

Yes. Despite the assertions of the futurists and technophiles, print still reigns supreme. As a result, anyone in a mixed print/digital environment — using an assistive technology or not — needs a way synchronize electronic and print content.

Also note that reading is not a solitary activity. From educational settings to reading clubs, the need to coordinate print and digital is very real.

For some annotated books, it can be useful to use two copies at the same time. This is often the case when the publisher has avoided the challenge of placing the primary text and its annotations within view of each other. If these distant annotations are correlated with the primary text by page numbers, and one of the two copies is to be an ebook, then it is important for that ebook to have page numbers and/or page number references.

Consider this, from Appel’s Annotated Lolita:

In a more perfect world, this edition would be in two volumes, text in one, Notes in the other; placed adjacent to one another, they could be read concurrently. Charles Kinbote in his Foreword to Pale Fire (1962) suggests a solution that closely approximates this arrangement, and the reader is directed to his sensible remarks, which are doubly remarkable in view of his insanity.

Kinbote’s sensible remarks are as follows:

I find it wise in such cases as this to eliminate the bother of back-and-forth leafings by either cutting out and clipping together the pages with the text of the thing, or, even more simply, purchasing two copies of the same work which can then be placed in adjacent positions on a comfortable table—not like the shaky little affair on which my typewriter is precariously enthroned now, in this wretched motor lodge, with that carousel inside and outside my head, miles away from New Wye.

By the way, the ebook of Appel’s Annotated Lolita is not just bad, as most ebooks are, but tragically bad. This is because it makes a valiant attempt to link the annotation, but fails. Thus, unlike most ebooks, its failure has some pathos to it.

Even when well done, linked annotations have their limitations. It is generally preferable for the primary text and its annotations to be within view of each other. Unfortunately this challenge is even greater in an ebook than in a paper book. It is perhaps not surprising that the most advanced ebook-like solutions to this problem come from the world of Bible software. As was the case with the printing press itself, the Bible was one of the first and best applications of ebook-like technology. (Pornography, too, tends to exploit new media early and well.)

Though Bible software greatly pre-dates current ebook standards, these standards did not “learn” from Bible software. In particular, unlike Bible software, there is no generic ebook format that easily allows a text and its annotations to be within view of each other.

Binding, numbering, and paper

After that rather liberal digression, let us return to what was lost in this ebook edition of From Verse to Worse.

Of course like every ebook, most of the physical aspects of the paper book were lost, so I will document them here, out of respect not only for the object but also the anonymous Edinburgh workers who made it.

The book is bound in 18 signatures. The first signature is not marked. It is, implicitly, signature A. All 17 other signatures are marked with the capital letters B through S, skipping J.

It was sometimes the convention to skip J as a signature mark because it was not part of the classical Roman alphabet. Think of Julius written classically: Ivlivs! The book is not long enough for us to learn whether the letter U, too, would have been skipped.

All signatures except the last have 4 bifolios. The last signature has only 2 bifolios. So the book contains 17×4 + 1×2 = 70 bifolios. These 70 bifolios yield 140 folios, or leaves, each having a recto or verso, yielding 280 pages. The last numbered page is 277.

The pages are counted according to these signatures. In particular, the plates, tipped in by gluing, are neither numbered nor counted in the numbering. In the table of illustrations, the page numbers given are for the pages that precede the plate in question.

The paper of the bifolios is laid or, more likely, the paper is wove but has a laid pattern imparted to it. I lack the expertise to tell the difference. The paper of the plates differs from that of the bifolios. I found no watermarks on either of the papers.

Fonts

The paper book uses a single font, except for the phrase “To the Memory of” on the dedication page. That phrase uses some variant of the blackletter font called Tudor Black. This font was identified for me by the generous experts on the Font ID forum of typophile.com. In this ebook, I chose to capture neither this font excursion, nor the main font.

Like the paper book, the ebook uses italic and small caps variants of its base font. Like the paper book, the ebook uses italics for the common Latin words and abbreviations v., i.e., and via. This may strike some readers as quaint, belabored, or both.

Cover

The cover of the paper book is green cloth on cardboard. The front and back covers are plain. The spine has the labels “From Verse to Worse,” “Lionel Lord Tennyson,” and “Cassell.” The labels are horizontal, in gold capitals. The font is intriguing but I have not identified it. To save time I have not included a scan of the spine, but I should have.

The cover of this ebook will probably be viewed by some as an abomination of graphic design, but it was obtained at excellent value, since only my time was spent on it.

Colophon

I retained the colophon from the paper book. Many ebooks blindly include the paper colophon. In contrast, this ebook notes the parts of the colophon that may be useful for documentation, but are not directly applicable.

I have added two ebook-specific elements to the colophon. One is a copy of the EPUB standard’s modified metadata field in the dcterms (Dublin Core) namespace. This provides a version identifier that is visible even in software that does not expose the modified field to the user. Both the underlying field and this visible copy of it are generated by my build process, so I have reasonable confidence that they will remain in sync with each other and with reality. I.e. there is no manual step, as there often is, such as “remember to update the time stamp in files x and y before release.”

The other ebook-specific element I added was a character support check. This makes it easy for me (or a user) to see whether the more exotic characters of this book are supported by the current software and font in use.

Below is an example of the colophon. I call it an example of the colophon, rather than the colophon, since the modified time stamp may vary compared to other revisions of the book “in the wild.”

First published 1933

[dcterms:modified 2015-05-14T22:15:37Z]
[character support check: ← £ ½ æ œ à ç é è ê î ᶜ ć]
[The following text does not apply to this edition, but is included for documentation.]

Printed in Great Britain by the Edinburgh Press, Edinburgh

F20.933

Section breaks

The paper book uses vertical whitespace to indicate section breaks within chapters. Since the default styling of HTML often indicates paragraph breaks with such vertical whitespace, I chose to re-represent section breaks in the ebook with horizontal rules like the following.

This avoids ambiguity between section and paragraph breaks.

One could argue that instead I should have reset all default styling and captured the appearance of the book more closely. But, I think that is inappropriate for most ebooks, including this one. Just because an ebook can capture certain aspects of the paper book does not mean it should.

Undoing breaks

Because the book used only whitespace for section breaks, it is impossible to tell whether certain page breaks were also thought of as section breaks. Where a page started with a new paragraph and a new topic, it was tempting to guess that a section break was intended and insert a horizontal rule in the ebook, but I refrained from doing so.

This is similar to the ebook converter’s problem of determining, in a poem, whether a page break is also a stanza break. This is only a problem in poems whose lines are not clearly grouped into stanzas by rhyme, meter, punctuation, or other formal convention or invention.

It is also similar to the ebook converter’s problem of determining whether a hyphen at a line break is hard or soft. This can be a difficult problem, since whether to hyphenate two words or run them together is often a question of taste, and such tastes vary over time, place, author, and editor. For this ebook, if I was lucky enough to find the same word not spanning lines, I used its spelling there as a way to decide on the hardness of the hyphen in the line-spanning case. In other cases, I was forced to guess. Sometimes I was able to use Google Ngrams and Google Books to get a sense of the hyphenation conventions of this book’s time and place. In these cases at least my guess was a somewhat educated one.

These problems highlight another way in which creating an ebook is quite unlike typesetting a paper book. One of the main tasks of typesetting a paper book is breaking the text into lines and pages, but one of the main tasks of creating an ebook from a paper book is undoing these breaks! Unfortunately, as we have seen, how to undo these breaks can be ambiguous.

Far fewer breaks may need to be undone if one is lucky enough to be creating an ebook from sources, that is, from the same computer files that were used to create the paper book. But, as we have mentioned, in typesetting a paper book, it hardly matters what means are used to achieve an end, and ebook conversion is rarely considered. So, in the sources, line and page breaks are sometimes achieved by unsavory means. This means that even when creating an ebook from sources, line and page breaks can be hard to undo.

With respect to breaks, is ebook conversion just a destructive task, undoing the usually careful work of the typesetter? Asked another way: since a typesetter is also known as a compositor, does that make an ebook converter a decomposer?

The answer is no. Decomposition is a critical natural process, but its overtones are both destructive and unsavory. The ebook converter encodes the book so that it can be re-composed, live, by ebook software. In this process most of the control over line and page breaks is ceded to the software, but not all. The careful ebook converter makes judicious use of non-breaking spaces and other mechanisms to allow or forbid line breaks in certain locations. Similarly, the careful ebook converter forces page breaks in certain locations, and discourages them in others.

There is no guarantee that ebook software will honor such directions, particularly those having to do with page breaks. Still, as I have discussed above, ebooks must be coded with a balance of realism about the present and optimism about the future.

Page breaks

In this ebook I have quite widely discouraged page breaks. I have discouraged page breaks within elements like lists and block quotes. This can result in many pages that stop before they are full. Though this looks unfamiliar from a paper perspective, I think it is appropriate to an ebook. In an ebook, one need not be concerned about wasting paper, though a page turn should be considered to have some cost to the reader. This cost must be balanced against the value of keeping related material, e.g. a list, together.

It is important to note that there are two types of page breaks in a paper book:

a mild one, from verso to recto, i.e. across a spread

a severe one, from recto to verso, i.e. a page turn

On small screens or windows, ebooks are displayed as if they were all recto, to borrow an analogy from Robert Bringhurst. That is to say, an ebook is a bit like a paper book that is only printed on the recto of each page. Regardless of the analogy, my point is that on small devices, all ebook page breaks are severe. On larger screens or windows, the ebook may be displayed in two or more columns. This creates something like the mild page break across a paper spread.

Currently, the ebook coder has no way to distinguish between mild and severe page breaks. So, whereas a typesetter for paper might allow a list to split across a spread, but not from a recto to a verso, the ebook coder must assume the worst and discourage the break unconditionally.

Line breaks

It is desirable for some bits of text to stick together. On the other hand, it is also desirable to provide as many line break opportunities as possible. This is particularly true in the case of an ebook, like this one, that lacks embedded soft hyphens. Since, unfathomably, a lot of Kindle software fully justifies without hyphenating, line break opportunities are sorely needed in an ebook lacking embedded soft hyphens.

This used to be a problem with all Kindle software, but very recently (May 27, 2015) Amazon added hyphenation to some of its Kindle software. Unfathomable mysteries persist, though: it seems that to get hyphenation, not only the Kindle software but also the Kindle books themselves need to be updated. Most (all?) other hyphenation solutions require only the books or the software to be updated, but not both.

Below I describe the line break decisions I’ve made for this ebook, both in terms of line breaks I prevent, and line breaks I allow.

In this ebook I prevent line breaks in the following places.

between initials in names like L. H. Tennyson

in family names starting with de, du, van der, and von

in names in group photo captions

after St. (Saint)

between the last two words or so of a long caption or chapter title

I prevent other line breaks, but only on an ad hoc basis, i.e. not as the result any rules. Some of these cases are as follows.

in “I Zingari”

between exclamation marks

before a single letter, J, that ends a paragraph in this afterword

Here is the case of multiple exclamation marks, which also gives some flavor of the book:

Next day a motor ambulance took us to Bailleul pending transfer to Boulogne and England. There I had my first bottle of champagne for months, and met the Rev. F. H. Gillingham, the famous Essex cricketer, who was chaplain in the hospital. He was awfully nice and kind, came and talked to me, and wrote home to my parents about me. I passed the night of the 15th at the Hotel Splendide at Wimereux which I had visited in peace time and which was now a hospital. There I found Miss Evie Gore, a neighbour of ours in the Isle of Wight, who was nursing and was very kind to me. I also found in my ward Stewart Richardson who had come down from the line for a new set of false teeth. His own he had lost in rather a singular fashion. Whilst shaving one morning, he had put them on the back of a mule. The Germans started shelling. The mule ran away. Stewart Richardson’s false teeth disappeared with the mule! ! !

I permit line breaks in many cases where others might prevent them. Many of these cases involve a number that some people feel should stick together with a word, such as the following examples from the book.

85 lb.

365 feet

8.15 a.m.

50 per cent

7 Portland Place

111 not out

10 to 1

George V

July 13

Interrupted paragraphs

In paper books, a paragraph can be interrupted and then resume. The interruption might be something like a list, poem, or block quote. Typically the first line after an interruption is flush left. This tells the reader that this is the resumption of a paragraph rather than the start of a new one. Of course this implies that all paragraphs have an indented first line.

In EPUB-based ebooks like this one, a p (paragraph) element cannot span most interruptions. This is a limitation of the content model of HTML. In that content model, paragraphs can only contain phrasing content. But, most interruptions are not naturally represented as phrasing content.

When an interruption cannot be represented semantically, it can still be mimicked, visually, through styling. I chose not to do so. As with section breaks, where I chose to use horizontal rules, here I sought to be compatible with a variety of paragraph styles. In this case I sought to be compatible with paragraphs with flush left first lines.

I chose to misrepresent the resumption of a paragraph as its own p element. But I also chose to add a leftward arrow (←) to indicate such a resumption. This arrow is added as content, not styling, to be resilient to environments without styling.

I found only two paragraphs in the book that are interrupted, though one is interrupted twice. It is perhaps no coincidence that the three interruptions are all preceded by the somewhat extraordinary combination of a colon followed by an em dash (:—). Since this combination appears nowhere else in the book, my leftward arrows are, technically, redundant. By the way, this combination of colon followed by an em dash required some careful treatment to prevent a line break before the em dash.

One interrupted paragraph is as follows.

The rest of the team were a formidable combination, as these names will show:—

J. W. H. T. Douglas (Essex)

Hon. L. H. Tennyson (Hants)

M. C. Bird (Surrey)

[. . .]

← This selection was somewhat criticized at home, [. . .]

By the way, the list above required some careful treatment to encode it, semantically, as an unordered list (ul), while inhibiting the bullets that are usually the default styling for unordered lists.

The other interrupted paragraph is as follows.

One [letter] came from the old Master of Trinity, an intimate friend of my grandfather’s, who wrote on my behalf the following charming letter:—

“Dear Sir,

“Allow me to say a few words in favour of the Hon. Lionel Tennyson, [. . .]

“I am, dear Sir,

“Faithfully yours,

“H. Montagu Butler.”

← Another letter in my favour was sent to the War Office by Canon the Hon. Edward Lyttelton, Headmaster of Eton, whom I have already mentioned once or twice in these Memoirs. It ran:—

“I hereby certify that Mr. L. Tennyson showed himself at Eton [. . .]”

← There was, Colonel Maxse had written, only one vacancy for a probationer in the Coldstream Guards, but I got it.

By the way, the first letter above required some careful treatment in order to allow its imbalance of double quotation marks. This imbalance is inherent to the the quaint style of block quoting used in this book. Some care was also required to mimic the cascading indents of the letter’s closing.

Corrections in general

Correction is tricky. The situation with text is similar to that in software, where what seems to be a bug may in fact be a feature, especially when considered from another point of view. The tradition of biblical scholarship gives us a principle worth considering:

lectio difficilior potior

This can be translated as “the more difficult reading is the stronger.” Strict adherence to this principle interprets “stronger” as “preferable.” Here is my interpretation, which uses, or perhaps abuses, the leeway afforded by translation: “the more difficult reading is more often correct than you might expect.” This verges upon the saying, “truth is stranger than fiction.”

While this book is hardly a sacred text, these traditions inform my care for words, which I consider to be a hallmark of civilized behavior, if not holiness. This care for words does not bar their correction. On the contrary, it implies a responsibility to correct, but only to correct responsibly. If possible, corrections should be documented, as I do in the sections that follow.

Though risky, corrections are one of those few areas where an ebook may improve upon its original. If one is lucky, one may fix more than one has broken. I hope all my corrections are correct, and I hope that they outnumber the errors I have undoubtedly introduced elsewhere.

Dots removed

In the paper book, “per cent” has a dot (period) after it in one of the two instances of that phrase. It is merely quaint, not wrong, to use a dot to note that “cent” is an abbreviation for the Latin centum. But, since it was done inconsistently, I felt I had the leeway to drop it. Though quaint, I felt it would be distracting to the modern reader.

In the paper book, some Roman numerals have dots after them but most do not. It is merely quaint, not wrong, to place a dot after a Roman numeral. It indicates that letters are being used to form a number not a word. Hebrew’s geresh and gershayim are used similarly. But, as with the dotted “per cent,” I felt that inconsistency left me with the leeway to move to the modern, dotless convention.

The I Zingari dot

In the paper book, “I Zingari” has a dot after the “I” in one of the four instances of that phrase. (Two instances are in the index section, so they do not appear in this ebook). Perhaps this dot was added because the “I” was mistaken as a Roman numeral. Or, perhaps it was added because the “I” was thought to be an abbreviation of the Italian article “Gli” rather than the article “I” in its own right. Indeed “Gli” would be the standard Italian article to use there. But in that case, wouldn’t an apostrophe be more appropriate, as follows?

’I Zingari

The origins of the use of “I” in “I Zingari” are shrouded in mystery. From the I Zingari website:

Having availed himself of a substantial amount of claret, R. P. Long drifted off into a ‘vinous slumber’. When the conversation eventually turned to a name for this fledgling [cricket] club, he deigned to murmur ‘The Zingari, of course.’

The next day, this became ‘I Zingari’, much to the continued confusion of defenders of the definite article and Italian purists.

The I Zingari Wikipedia page claims that the use of “I” rather than “Gli,” though nonstandard, is dialectical. Whether this dialect is simply the one spoken by intoxicated English cricketers, it does not say.

I did find another example of the dotted “I Zingari”: though there is no such dot in its title, see the essay “150 years of I Zingari: the vagrant gypsy life” by John Woodcock. It appeared in the 1995 Wisden Almanack and is archived on the ESPN “cricinfo” website.

Further research on this topic would be absurd, that is to say, right up my alley. Yet, I must press on. As with “per cent” and Roman numerals, I felt that inconsistency left me with the leeway to make my own choice. In this case it was not a choice of quaint v. modern. As far as I can tell it was a choice between a mysterious minority dotted convention and a majority dotless one. I chose to go dotless.

Dots added and dots retained

In the paper book, a few instances of “Mr” and “Mrs” are dotless, but the vast majority are dotted. I eliminated this inconsistency in favor of dots. I left the two instances of “Messrs” as they were printed: undotted.

Before leaving the scintillating topic of dots, I think it is worth mentioning one instance in which I preserved a quaint dot, despite its jarring look to the modern eye. In the paper book, “exams” is dotted, since the word is, or originated as, an abbreviation of “examinations.” Similarly, “exam” is dotted in the index section, though that section is not included in this ebook. Since there is no inconsistency to resolve, I felt I had no leeway to modernize, and retained the dotted form of “exams.” I did mark it with a “sic” in square brackets, and made that “sic” link to this paragraph.

In trying to get a sense of the prevalence of the dotted form of “exam” and “exams,” I ran across the following sentence from the story “The Last Term” in Rudyard Kipling’s Stalky & Co.:

‘An exam.’s an exam.,’ etc., etc.

It fairly bristles with punctuation, does it not?

Though it uses an apostrophe rather than a dot, another word-shortening that, quaintly, is noted by punctuation in this book is as follows.

’planes

Corrections regarding hyphens

In searching for hyphenation inconsistencies I might have introduced while undoing line breaks, I discovered various hyphenation inconsistencies present in the paper book. I resolved the following inconsistencies in favor of the hyphenated rather than run-together form.

fox-hunting

heavy-weight

life-like

light-hearted and light-heartedness

In some of the cases above, the hyphenated form is in the majority. In other cases, where there is no majority, I favored the hyphenated form. The hyphen can help break a line more gracefully if this book is rendered by software that does not do its own hyphenation.

As already noted, I did not embed soft hyphens. This was just to save time. So, unfortunately, lines will not break in the middle of words unless the ebook is rendered by software that does its own hyphenation. This lack of embedded soft hyphens is almost always the case in ebooks, yet that does not excuse my omission, since shoddy quality is also almost always the case, and I seek to rise above that.

On a related note, this book’s CSS neither encourages nor discourages ebook software to do its own hyphenation. In other words, the choice of whether to hyphenate is left up to the defaults of the software, or possibly the choices of the user, if the software provides such choices.

I will now briefly return to my handling of inconsistencies in hard-hyphenation. In contrast to the four words listed above, I opted for the run-together form of “setback.” I cannot justify that choice other than to note that the hyphenated form only appeared in the index section. This is not that compelling a justification since in other situations I counted the index section as an equal partner to the main text in resolving inconsistencies in spelling.

I left the inconsistency of “Littlego” and “Little-go” alone since the hyphenated form only appears in a block quote. It may have been an explicit editorial decision to respect the choice of the author of the block quote, while making a different choice for the main text. This is a tricky area of editorial decision-making: what aspects, if any, of the main text’s style should be imposed on quotes?

I added a hyphen to the one instance of “Carlton Levick” to match the one instance of “Carlton-Levick.” Though this seems to be the arbitrary breaking of a tie, I find it more likely that a hyphen was accidentally dropped from a name than that it was accidentally added.

Misc. corrections

I removed an apostrophe from “it’s bow window.”

Misc. changes

The following are not corrections, in that they don’t address issues that I think are errors. They are additions, deletions, or changes that I felt compelled to make for a variety of reasons, including my own preferences. An ebook converter should usually suppress the urge to humor his own preferences, but in a few cases, I failed to do so.

I added row labels like “Back row,” in square brackets, to the caption for one group photo. Otherwise, the wrapping that can occur when viewing the ebook could make the caption hard to decipher.

I added “Image credit,” in square brackets, to image credits and moved them below all captions. In the paper book they are above all captions, tight up against the image, aligned to its right edge.

I added a few other things in square brackets, mostly in the front matter. Square brackets are not used in the paper book so anything in square brackets is my addition.

I “harmonized” the table of illustrations with the captions for the illustrations.

I did not replicate the paper book’s generous (excessive?) spacing inside quotation marks.

I removed the dot leaders from both tables in the book. They are difficult to replicate in HTML and I didn’t feel they were important.

I added em dashes to separate the names of four cricket players in a caption. In the paper book they are carefully spaced across the page, roughly following their locations in the picture. In an ebook this is difficult to do and in this case would offer little reward.

Quotation marks

I turned one instance of nested double quotes into single quotes. In the following, “Little-go” is in double quotation marks in the paper book.

“Allow me . . . ‘Little-Go,’ . . .”

This instance of nested double quotes was caught by a program I wrote that checks for various forms of balanced punctuation, including double quotation marks, parentheses, and square brackets. These are easy checks to do, yet, to my knowledge, no popular software does them. As a result, one of the many crimes against literacy committed by ebooks is unbalanced punctuation stemming from OCR errors. Unless there is a bug in my program, there are no such errors in this ebook. I am reminded of a famous remark by Donald Knuth, containing an inscrutable mix of hubris and humility:

Beware of bugs in the above code; I have only proved it correct, not tried it.

An imbalance of double quotation marks is allowed in order to capture the quaint style of block quoting used in this book. In a multi-paragraph block quote, all paragraphs except the last lack a right double quotation mark. Or, if you prefer, you can think of it as all paragraphs except the first having an excess left double quotation mark.

Curly braces are not used in this book, though they are candidates for such balance-checking. In languages such as Spanish, exclamation and question marks would be candidates as well! Single quotation marks are difficult to check, since the right-hand one is used as an apostrophe. And, as we shall see below, in this book the left-hand one is used where you might expect a superscript “c”! Such problems can be overcome by introducing a source representation of the book that respects such distinctions as right single quotation mark versus apostrophe. But this book makes sparing use of single quotation marks so it was easier to just check them by eye.

Captured but notable

Though I did capture them in this ebook, a few aspects of the paper book are still worth noting.

The left single quotation mark, or turned comma, is a familiar character, but it is used in what may be an unfamiliar way in this book. It is used where you might expect a superscript “c” in the following names:

M‘Bryan

M‘Gahey

M‘Kay

M‘Lean

It was tempting to replace these instances with a superscript “c,” e.g. one of the following.

MᶜBryan (Unicode)

McBryan (HTML sup)

But, for all I know, the use of turned comma was a conscious choice, not the result of typographic poverty. So I left them alone. A drawback of this decision is that this ebook, like most ebooks, needs to work with a variety of fonts, and some of these fonts may have a turned comma that is not shaped much like a “c.” Then again, I suspect that the relationship between even the curviest of turned commas and a “c” is pretty distant in most readers’ minds.

An aside: another case where turned comma and superscript “c” can be used somewhat interchangeably is in the transliteration of the Semitic letter ayin. I have seen this cause a lot of confusion. Fortunately Unicode now has a widely-supported code point devoted to this purpose, “modifier letter left half ring.”

The book makes use of the ligatures for ae and oe in the following words:

Cæsar

mediæval

bœuf

manœuvres

The two-em dash, or omission dash, e.g. X——, is a somewhat familiar character, but it poses a challenge in an ebook. It is used in two places in the book (three times in each place). A Unicode code point exists for this dash, but has been added so recently that its presence cannot yet be relied upon. So, I used two em dashes. Depending on the font, these may or may not perfectly join to have the same appearance as a two-em dash. And, reducing letter spacing to overlap them can create an undesirable bump in the middle. (This bump is due to the additive composition of two overlapping anti-aliased strokes. By the way, I think this “gotcha” in the rasterization of vector graphics should be more widely known.)

My solution to these problems is to style the two em dashes to have transparent “color” and line-through decoration (strikethrough). Thus, as with the example of small caps h.m.s., the styled result is preferable, but the unstyled result is still reasonable. Here’s a comparison (of course only applicable to the environment in which you happen to be reading this):

styled: X——

unstyled: X——

Ellipsis

In one paragraph, ellipsis is used to indicate an omission instead of a two-em dash.

I well remember getting into disgrace and being sent up to my room on account of a certain well-known personage, a friend of the poet’s whom I will call Mr. K. . . . Old Mr. K . . . stammered a good deal, not only during his conversation, which was diffuse and ornate, but also during his meals, with somewhat disastrous results to those in his immediate neighbourhood.

Since this seems like an inconsistency, I was tempted to substitute in my approximation of a two-em dash:

I well remember [. . .] Mr. K——. Old Mr. K—— stammered a good deal [. . .]

But I resisted the temptation. It also seems like an inconsistency that in one of the two times this “omission ellipsis” is used, it is tight up against the first letter (K), whereas the other time, it is spaced away from it:

K. . . . Old

K . . . stammered

To me, the tight case seems to indicate (K, period, ellipsis) rather than what I believe is intended, (K, ellipsis, period). But, some styles prohibit such a distinction, insisting that all instances of four dots, whatever their interpretation, be tight up against the preceding text.

There is one example of (ellipsis, period) spacing in the book:

“Yes—he is,” said Vincent, “but still . . . .”

Whether this means that such spacing was allowed, or whether it was an accident, is hard to say. In all these cases, I resisted the temptation to change anything, for thus it was written (sic erat scriptum).

This ebook prevents line breaks inside ellipsis. This should go without saying, but I have chosen to note it since, to my dismay, I have seen ebooks that fail to prevent such breaks. This book also prevents line breaks before ellipsis, so that a wrapped line may not begin with an ellipsis. Though a line break within an ellipsis is simply beyond the pale, the question of whether a line should be able to break before an ellipsis could be classified as a question of taste rather than a question of right and wrong.

This ebook does not use the precomposed ellipsis character. This character does have some merits. In particular, it inherently prevents a line break within itself. But, it is typically spaced too tightly for my taste. And, it prevents the even spacing of groups of four dots, be they period-ellipsis or ellipsis period. That is, it prevents even spacing from being achieved in a font-independent way, which is of course the way most ebook styling should be done.

Conclusion

I hope this essay gave some insight into what it takes to make a good ebook edition of paper book without sources. I realize that the level of detail may have been, at times, excruciating. But I hope any suffering I may have inflicted on the reader had a purpose, showing that ebook conversion, if it is to be of good quality, is not a task to be outsourced to the lowest bidder. The task has many worker requirements that make it inappropriate for such commodification. Some of those requirements are as follows.

The worker needs a self-imposed commitment to quality, since the quality of an ebook is expensive to measure, and therefore hard to impose externally.

The worker needs knowledge (or ability to acquire it quickly) in a variety of areas including the following.

Ebook technology, e.g. HTML, CSS, and EPUB3

Typography and orthography of the following:

the paper book’s time and culture: for this book, British English circa 1933, with a smattering of French, and even one instance of Italian (I Zingari)

current paper books, in the cultures of the ebook’s intended audiences

The worker needs that vague, subjective non-commodity called “taste.” Notably, taste is needed to modulate the application of the above-mentioned knowledge. For example taste is needed in deciding when to mimic the paper book, and when not to do so.

Benjamin Denckla is an independent software engineer specializing in ebook creation without OCR. His particular focus has been ebooks containing Biblical Hebrew. Past phases of his software career have included realtime digital audio processing, build and test systems, and LAMP web sites.

The post How To: Improving a Print Book By Converting it Into an Ebook appeared first on Publishing Perspectives.

Show more