Converting Large Text Files to epub with Calibre

I spent some time debugging a long-standing issue I have had using Calibre to convert large text-documents to epubs for viewing on my nook. The normal course of events was that I would feed a large (multi-megabyte–the example I was debugging with was 5.5 MB) text document into Calibre and attempt to convert it to an epub with the defaults. After a lot of churning, Calibre would throw a deep, deep stack trace with the following message at the bottom:

calibre.ebooks.oeb.transforms.split.SplitError: Could not find reasonable point at which to split: eastons3.html Sub-tree size: 2428 KB

I have long been aware that large HTML documents have to be chunked for epub conversion, although I do not claim to know whether this is mandated in the spec, or allowed and needed as a technical requirement for individual readers. In either event, Adobe Editions devices, like the nook, require chunks of 260 KB. The error is clear in this light. For some reason, Calibre was unable to create small enough chunks to avoid issues.

My working assumption had been that Calibre would chunk the files at the required size. So, every 260KB, give or take a bit to find the start of a tag, would become a new file. The default, however, is to split on page breaks. Page break detection is configurable, but defaults to header-1 and header-2 tags in HTML. When your document is in plain text, as opposed to Markdown or some such, few, if any, such headers will be generated. This can cause Calibre to regard the entire document as a single page, which it cannot determine how to split into smaller files.

Converting a large, plain-text document to Markdown or HTML by hand is a task that is much too manual for someone who simply wants to read an existing document. My approach was much more straightforward. What I did was change the heuristic used to insert page breaks.

On the Structure Detection tab (when using the GUI), there is an option entitled “Insert page breaks before (XPath Expression):”. I replaced the default (which was the XPath for H1 and H2 tags) with the following:

 //p[position() mod 20 = 0] 

This will insert a page break every 20 paragraphs. The number was utterly arbitrary. Because paragraphs are usually well-detected, this worked fine. My large 5.5 MB file, a copy of Easton’s Bible Dictionary from CCEL, converted as expected.