Announcing Blutwurst 0.5

This one really isn’t new – I cut the release over a month ago. Blutwurst is a test data generation program. It is written in Clojure and offers a ready way to create data for use in a database or in unit tests. I plan to write a longer introduction to the application later.

There is a more detailed changelog entry at https://github.com/michaeljmcd/blutwurst/blob/master/CHANGELOG.md but the big change was the addition of JSON Schema parsing to the options for developing a schema and XML as an output format.

https://github.com/michaeljmcd/blutwurst/releases/tag/v0.5.0

Announcing mm2tiddlywikitext v0.1

I’m trying to catch up on writing posts about some of the open source stuff I’ve been building out lately. This is an older one, but still useful (I hope) to someone.

mm2tiddlywikitext is a stylesheet that reformats a FreeMind mind map as TiddlyWiki bulleted text. My main use-case for this is importing my mind maps in a searchable way into TiddlyWiki.

The easiest option is to use it directly from within FreeMind. Go to File > Export > Using XSLT…. In the dialog that pops up, provide the path to stylesheet.xslt and an output location. This will create a TiddlyWiki 5 JSON file that can be imported into TiddlyWiki.

Release link:

https://github.com/michaeljmcd/mm2tiddlywikitext/releases/tag/v0.1

Importing FreeMind Mind Maps into TiddlyWiki 5

One of the largest improvements I have made to my personal development workflow is keeping a commonplace book of all the things I have been tinkering with, for “fun” and for work. The process has worked best since I started keeping it in a TiddlyWiki, a nice digital format. This may be worth a post at another time, though at least one post has already been made.

I still use FreeMind mind maps when doing brainstorming or freewheeling research when things are very much in flux. I tend to love mind maps during this phase, but don’t find them as attractive a tool for longer term knowledge management. This usually leads me to the point where I want to import my notes into TiddlyWiki. Now, it is entirely possible to simply export an image or HTML page from FreeMind and add the file to TiddlyWiki. It is also possible to attach the raw .mm file. In some cases, this may even make sense.

Sometimes, however, it would just make more sense to dump it as an outline in wiki text format. To help with this, I have created an XSLT stylesheet (mostly because I’ve never done real, dedicated work with XSLT) that can be used fairly readily. It is on github at https://github.com/michaeljmcd/mm2tiddlywikitext under an MIT license. One of these days I might package it into a better standalone utility. Maybe not. We’ll see.

Creating Integration Tests with JNDI

Technically speaking, an automated test that requires JNDI is not a unit test. As an aside, it is preferrable to segregate the portions of the application accessing JNDI to leave as much of the application unit testing as possible. Nevertheless, if JNDI is used, something must ultimately do it and it is preferrable to be able to test this code prior to its us in production.

So far, the best starting point for me has been to use Simple JNDI to create the provider and allow the rest of the code to work unimpeded. The original Simple JNDI project decayed a little bit. An update with bug fixes is available on github with a different group id (https://github.com/h-thurow/Simple-JNDI). I also used H2’s in-memory database so that I could put real connection information in the test case.

To get started, I added these dependencies to my pom.xml:

  <dependency>
      <groupId>com.github.h-thurow</groupId>
      <artifactId>simple-jndi</artifactId>
      <version>0.12.0</version>
      <scope>test</scope>
  </dependency>
<dependency>
    <groupId>com.h2database</groupId>
    <artifactId>h2</artifactId>
    <version>1.4.192</version>
    <scope>test</scope>
</dependency>

It took some trial and error to get the configuration right, which is one of the motivations for this writeup. The first thing you need is to add a jndi.properties file to your test resources. The settings I will discuss below were chosen to emulate the Tomcat Server setup that we use here.

java.naming.factory.initial = org.osjava.sj.SimpleContextFactory
org.osjava.sj.root=target/test-classes/config/
org.osjava.sj.space=java:/comp
#org.osjava.sj.jndi.shared=true
org.osjava.sj.delimiter=/

Notice that the root is given relative to the pom. This is one thing that caused me a great deal of grief on the first pass. Another important element was the use of the space option, which was needed to emulate Tomcat’s environment. Within the test resources, I added a config folder, containing a single file: env.properties. This latter file had contents like the following:

org.example.mydatasource/type=javax.sql.DataSource
org.example.mydatasource/url=jdbc:h2:mem:
org.example.mydatasource/driver=org.h2.Driver
org.example.mydatasource/user=
org.example.mydatasource/password=

The data source in question did indeed have a dotted name, which only added to some of the initial confusion. This allowed my code-under-test to work as anticipated. For reference’s sake, this is what the Java code looked like:

private DataSource retrieveDataSource() {
    try {
        Context initContext = new InitialContext();
        Context envContext  = (Context)initContext.lookup("java:/comp/env");

        return (DataSource)envContext.lookup(DATA_SOURCE_NAME);
    } catch(NamingException e) {
        log.error("Error attempting to find data source", e);
        return null;
    }
}

Sources

Some Additional Thoughts on Large ebook Conversions

I absolutely love exploring books and acquiring new reading material. The quest for more reading material has often lead me all over the public domain-loving internet looking for obscure texts. Gutenberg, the Internet Archive, CCEL and Sacred Texts are among my favorite haunts. I often find myself attempting to convert texts for display on my nook SimpleTouch (this older piece of tech is probably worth its own post at some point). Calibre is, of course, a natural tool of choice, but I have found something odd: when dealing with larger texts, especially those of a more technical nature (as opposed to general fiction), Calibre has very limited options for taking the book from plain text to a formatted version. Most of the options it does present are based heavily on Markdown.

This design choice is a reasonable one, but often breaks down for texts that are not sufficiently close to Markdown. One of my recent conversions is an excellent example of this. I have been looking for good concordances of the Bible for my ereader to help with Bible study and general writing when all I have is a paper notebook and my Nook. It turns out that the options for concordances in either the Barnes and Noble or Amazon stores are relatively limited. So, I turned to CCEL and was attempting to convert “Nave’s Topical Bible.”

When attempting to convert from plain text, one of the biggest difficulties is structure detection. If you look at the Calibre documentation on structure detection (
https://manual.calibre-ebook.com/conversion.html#structure-detection), one of the more obvious things is that the chapter detection occurs after a book has been converted to HTML. There are effectively no options to control structure detection in the conversion from plain text to HTML.

What I wound up doing was falling back on the old txt2html tool, which has some more complete options than those in Calibre. I ended up using commands like the following to convert to HTML manually.

 $ txt2html -pi -pe 0 ttt.txt -H '^[A-Z]$' -H '^\s\s\s[A-Z][A-Za-z- ]+$' > ntt.html

This approach isn’t all gravy. It requires some manual tinkering to find good regexes for each individual book. Moreover, different books require different regexes. Here is another example from a book I converted.

 $ txt2html -pi -pe 0 ntb.txt -H '^[A-Z]$' -H '\s\s\s[A-Z-]+$' > ntb.html

In some cases, I even had to add a level of headers for use in the books.

Weighing in on JavaScript Package Managers

I have quite recently begun work on an open source project with a node back-end and front-end work planned to be done in React. This is my first full effort to work with the latest and greatest in JavaScript tooling. We use Ext JS and Sencha Cmd at work and, whatever else you want to say about that stack, it is different. My last full blown front-end development was before the real Node boom and I pretty much did it the old fashioned way — namely, downloading minified JavaScript by hand and referencing it in my markup (shaddup ya whippersnappers).

JavaScript saw a real explosion in package managers a few years ago, which was a natural place to go from a growing ecosystem to have none. Market forces naturally took over and many of the earlier examples have been culled out of existence. There are really two main options at this point: NPM and Bower. Bower has enjoyed a healthy following, but (by my entirely unscientific survey), it appears as though the NPM uber alles faction within the JavaScript world is growing stronger.

The sentiment is echoed in other places, but http://blog.npmjs.org/post/101775448305/npm-and-front-end-packaging gives a good overview of the fundamental syllogism. It basically goes that package management is hard, NPM is large and established, so you should use NPM everywhere rather than splitting package managers.

The argument has a certain intrinsic appeal – after all, the fewer package managers, the better, right?

The real problem is that it is possible to use NPM as a front-end package manager, but it is deeply unpleasant. Systems like Browserify and Webpack are needed to prepare dependencies for usage on the front-end. These are complex and, to a degree, brittle (I ran into https://github.com/Dogfalo/materialize/issues/1422 while attempting to use Materialize with an NPM application).

Even if one assumes that every package can ultimately be Browserified (and it doesn’t seem like an overly-optimistic assumption), the effort seems to be pure waste. Why would I spend time writing complex descriptors for modules on top of their existing packages? For it’s shortcomings, Bower seems more robust. I spent a few hours fiddling with Browserify and Materialize without much success (although I think I do see how browserify would work now), but mere minutes wiring up Bower.

This does not get into the fact that Browserify/Webpack require additional information to extract CSS, images and web fonts. Even when things are working, it would require constant effort to keep it all up to date.

At the moment, NPM, even NPM 3, simply does not have good answers for setting up front-end development. The NPM proponents really, in my opinion, need to focus on making front-end modules more effective rather than pushing tools that are little more than hacks, like Browserify and Webpack. At this point, I am just going to rock out with Bower. Maybe someday I will be able to trim out Bower — but I would rather spend time coding my application than giving NPM some TLC.

Converting Large Text Files to epub with Calibre

I spent some time debugging a long-standing issue I have had using Calibre to convert large text-documents to epubs for viewing on my nook. The normal course of events was that I would feed a large (multi-megabyte–the example I was debugging with was 5.5 MB) text document into Calibre and attempt to convert it to an epub with the defaults. After a lot of churning, Calibre would throw a deep, deep stack trace with the following message at the bottom:

calibre.ebooks.oeb.transforms.split.SplitError: Could not find reasonable point at which to split: eastons3.html Sub-tree size: 2428 KB

I have long been aware that large HTML documents have to be chunked for epub conversion, although I do not claim to know whether this is mandated in the spec, or allowed and needed as a technical requirement for individual readers. In either event, Adobe Editions devices, like the nook, require chunks of 260 KB. The error is clear in this light. For some reason, Calibre was unable to create small enough chunks to avoid issues.

My working assumption had been that Calibre would chunk the files at the required size. So, every 260KB, give or take a bit to find the start of a tag, would become a new file. The default, however, is to split on page breaks. Page break detection is configurable, but defaults to header-1 and header-2 tags in HTML. When your document is in plain text, as opposed to Markdown or some such, few, if any, such headers will be generated. This can cause Calibre to regard the entire document as a single page, which it cannot determine how to split into smaller files.

Converting a large, plain-text document to Markdown or HTML by hand is a task that is much too manual for someone who simply wants to read an existing document. My approach was much more straightforward. What I did was change the heuristic used to insert page breaks.

On the Structure Detection tab (when using the GUI), there is an option entitled “Insert page breaks before (XPath Expression):”. I replaced the default (which was the XPath for H1 and H2 tags) with the following:

 //p[position() mod 20 = 0] 

This will insert a page break every 20 paragraphs. The number was utterly arbitrary. Because paragraphs are usually well-detected, this worked fine. My large 5.5 MB file, a copy of Easton’s Bible Dictionary from CCEL, converted as expected.

Features & Identity

Recently, I was reading the Wall Street Journal’s article about Facebook working to incorporate a Twitter-style hashtag in its platform (Source: http://online.wsj.com/article/SB10001424127887323393304578360651345373308.html). The article has comparably little to say. Like most mainstream treatments of technology, it is mostly a fluff piece, but one thing caught my eye.

The writer, and, most likely, Facebook itself have lost sight of vision while staring at features. Twitter’s hashtag concept works because Twitter is built as a broadcast system. What I say to anyone, I say to the world. So, cross-referencing user posts by tag gives me an idea as to what everyone on Twitter has to say about a specific topic.

Facebook is not, by design, a broadcast system. It really does aim to be more of a social network. When I use Facebook, the focus is on the set of people that I know. The cross-referencing idea has very limited usefulness in the echo chambers of our own friends, family and acquaintances. For better or worse, we probably already know what they think.

Both Twitter and Facebook need to concentrate on vision, especially the latter, which seems to have the larger share of feature envy. The focus is not on hashtags. It is on whether I want to communicate with a circle of friends or broadcast to the whole world. In all honesty, there is room for both provided that they can find a way to monetize the affair. This has actually been the sticking point for all social networks so far. They get big, they get popular and they do so with venture capital. Then they collapse when their growth can no longer be maintained. Therein lies the sticking point: coming up with a social networking concept that accomplishes the members’ goals in a sustainable way (and, yes, that means making money).

SpiralWeb v0.2 Released

SpiralWeb version 0.2 has just been released. I felt the urge to scratch a few more itches while using it for another project. As with version 0.1, it can be installed from PyPi using pip. The changelog follows:

== v0.2 / 2012-10-08
* bugfix: Exceptions when directory not found
* bugfix: PLY leaks information
* bugfix: Create version flag
* bugfix: Top level exceptions not handled properly
* bugfix: Exceptions when chunk not found
* bugfix: Pip package does not install cleanly
* Change CLI syntax
* Cleanup default help

Also of note is the fact that the source code has been moved over to github:

https://github.com/michaeljmcd/spiralweb.

Now, off to bed. I have to get to the gym in the morning.