in Uncategorized

Some Additional Thoughts on Large ebook Conversions

I absolutely love exploring books and acquiring new reading material. The quest for more reading material has often lead me all over the public domain-loving internet looking for obscure texts. Gutenberg, the Internet Archive, CCEL and Sacred Texts are among my favorite haunts. I often find myself attempting to convert texts for display on my nook SimpleTouch (this older piece of tech is probably worth its own post at some point). Calibre is, of course, a natural tool of choice, but I have found something odd: when dealing with larger texts, especially those of a more technical nature (as opposed to general fiction), Calibre has very limited options for taking the book from plain text to a formatted version. Most of the options it does present are based heavily on Markdown.

This design choice is a reasonable one, but often breaks down for texts that are not sufficiently close to Markdown. One of my recent conversions is an excellent example of this. I have been looking for good concordances of the Bible for my ereader to help with Bible study and general writing when all I have is a paper notebook and my Nook. It turns out that the options for concordances in either the Barnes and Noble or Amazon stores are relatively limited. So, I turned to CCEL and was attempting to convert “Nave’s Topical Bible.”

When attempting to convert from plain text, one of the biggest difficulties is structure detection. If you look at the Calibre documentation on structure detection (
https://manual.calibre-ebook.com/conversion.html#structure-detection), one of the more obvious things is that the chapter detection occurs after a book has been converted to HTML. There are effectively no options to control structure detection in the conversion from plain text to HTML.

What I wound up doing was falling back on the old txt2html tool, which has some more complete options than those in Calibre. I ended up using commands like the following to convert to HTML manually.

 $ txt2html -pi -pe 0 ttt.txt -H '^[A-Z]$' -H '^\s\s\s[A-Z][A-Za-z- ]+$' > ntt.html

This approach isn’t all gravy. It requires some manual tinkering to find good regexes for each individual book. Moreover, different books require different regexes. Here is another example from a book I converted.

 $ txt2html -pi -pe 0 ntb.txt -H '^[A-Z]$' -H '\s\s\s[A-Z-]+$' > ntb.html

In some cases, I even had to add a level of headers for use in the books.

Write a Comment

Comment

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax