ISO and ANSI are Irritating Me…

Lately, as I have been swinging around various technologies I have been increasingly finding myself directed to various standards issued by the ISO. The latest ones (with reasons raging from curiosity to serious research) include ISO 8879, published in 1986 and specifying SGML the ancestor of HTML and XML, and ANSI/INCITS 319-1998 which specifies SmallTalk (this one I managed to find on the web in the form of its last draft). Now, maybe I am getting spoiled by the IETF, ECMA, and W3C, which release both their drafts and their final standards for free, but I find the prices that the ISO and ANSI charging are ridiculous.

In fact, I would argue that if you have to pay for a digital copy of the standard, then the whole point of standardization has been bypassed. Standardization allows anyone to pick up a copy of the standard and implement what it says with the expectation that it should work with other implementations (hey, that’s the theory; practice varies). Charging for paper copies is, of course, understandable and fair since it costs to put those together. When you charge for access to a standard, and especially when you charge a lot, it dramatically reduces the number of people able to attempt to implement it (slowing down the spread of standards-compliant versions of the technology). It also reduces the industry’s ability to check behavior and see if it matches the specification. Let us say, for example, that the World Wide Web Consortium charged ballpark $1,000 to have a copy of the XML 1.0 specification. I am using some XML parsing library and it does not behave as I expect. If the cost for the spec is $1,000, I cannot afford it. There are many companies who will refuse to afford it. After all, how valuable can that specification be? How can I check whether I am correct or if the parser is correct? I can’t.

Moreover, many of the people chairing these boards are professors or researchers in whatever technology they are working on. So their salaries are basically paid by someone else anyway. The rest of the costs of running the organization should be minimal. Both of these organizations are non-profits, ANSI being a private non-profit and the ISO being an NGO, so why charge for the standards? They are not supposed to be making a profit anyway (I, personally, think that non-profits and our entire tax structure are insane, but that is, again, another blog post), so why not disseminate the standards more widely?

This probably sounds like me deviating from my capitalist roots. It is not, really. That would be the case if I wanted some sort of governmental agency to handle standardizations (blasphemy!) or some sort of regulation enforcing the behavior I described above–but I believe nothing of the sort. The ANSI and ISO have every right in the world to try and make an industry out of technology standards, but I, as a consumer, have the right to try to refuse them. It all comes down to supply and demand and, I think, over the long haul the trend will go very much against the ISO and ANSI. I am sure some larger corporations are only too glad to whip out their wallets and pay up for these documents, but these are in the minority.

Over time, I am certain that ISO and ANSI will be forced to shift towards the policies of the IETF, ECMA, and W3C. For example, when Microsoft decided to publish a standard for C# and the .NET framework, they did not go to ISO and ANSI. Instead, they went to the ECMA. Most likely, this was done for reasons of time and expense, but this too is what differentiates the smaller organizations from the ISO and ANSI. The demand will continue to move in this direction, for more efficient standardization, cheaper standardization, and wider availability of said standards. Ultimately, all standards organizations have two bodies of clients: the standardizers and those who would read the standards. If no one publish specs through your organization, the rest does not matter. Similarly, if you publish specs that no one reads you will be irrelevant. It is ultimately in the best interests of both parties to have things quick, easy, and cheap. Let us say that Microsoft did not want their spec to be widely available. Why would they get it published at all? It would be easier to just use it for you dev teams and never let it see the light of day. With the advent of the world wide web, no one needs a standardization comittee to make their work widely available. A few minutes and a few dollars and it is up for the world to see. Largely, what these committees offer is prestige, but inaccessible prestige fails to serve either clientele.

Twitter

I separated out my blogs because I believed that most people interested in my programming opinions would not be interested in my personal writings and vis a versa. Every once in a while, there is a little bit of crossover. I just wrote a post on my personal blog on Twitter, which may be of interest to what little audience I may (but probably do not) have here: http://writing.mad-computer-scientist.com/blog/?p=131

The Days of the Cybersquatter are Limited

Run any number of searches for domain names and you will find most of the truly good ones taken. This is a minor irritant when a legitimate organization or person of some kind is using it. What is annoying are the boatloads of domains that have been claimed by some person or organization who does nothing but put a page of banner ads for them–then attempts to sell said domain for copious quantities of cash. In short, these entities, known as cybersquatters, claim any domain name that they think someone might possibly want in hopes of cashing in big. Over time, I have heard various ideas for dealing with cybersquatters, most of which involve some regulatory agency (usually ICANN) stepping in. The US has already put some laws into place to help combat this. Really, ICANN, the US, India, and whoever else may take an interest in this should just forget it. The market will work this out and it has already begun to.

The economics of cybersquatting rely on the given domain name being so important that some other person or entity feels they have no choice but to pay an exorbitant sum to acquire or reacquire the domain. This motivation is dying and with it will die the profits, real or imagined, that can be obtained through cybersquatting. For established businesses the motivation is, rightfully, a powerful one. People expect that if they go to ibm.com it will take them to the website of International Business Machines. It is simply too big a company to expect otherwise. Most of these establishments have already acquired the domains they wanted or needed. IBM will not likely lose a domain name any time soon. Even if there were once some large sums made, the big dogs are done playing this game. New businesses by and large cannot afford (or are unwilling to afford) the purchase of a squatted domain. Instead, the choice of business name is made alongside the search for a domain name. If a suitable domain cannot be had, people are moving towards choosing another name rather than pay what amounts to protection money. Squatters are, no doubt, are still attracted to this little get rich quick scheme ¬†because money really has been made that way in the past. Those profits are dwindling and will continue to dwindle until only a few foolish people continue to attempt it. In that day, cybersquatting will be all but dead–without the help of any bumbling, meddling regulatory agency.

Disabling Author Info on Drupal

You know those little headers that say “Authored by at 10:00”? So far, I haven’t had a Drupal set up in which they were actually wanted. I googled it and the best suggestions I had come up with were to add display:none; to the info’s class in a custom CSS file. Not bad, but it seems a little clunky. Someone on StackOverflow almost had it right when they suggested making changes to the settings of a specific theme. Of course, not every theme (including the ones I have been using) have this option. It turns out, though, that the settings can be set globally by going to Administer -> Site Building -> Themes, then clicking the “Configure” tab. Simply uncheck whichever node types you want to disable the author information for under “Display post information on”.

Much cleaner.

A Quick Rant

For a language that is supposed to be web oriented, PHP really stinks for setting up web services. Take the default SOAP library. It let’s you set up a request handler and populate it with methods, but it has no mechanism for automatically generating WSDL. What the heck? In ASP.NET, when I code up a class and mark methods as WebMethods, the WSDL is built automatically. With the default SOAP library, you have to provide a URI to the WSDL. In short, you have to use a 3rd party WSDL generator or write it by hand. Why in the heck would anyone want to do that? And even if you add a 3rd party tool, it adds one unnecessary step to the process if you make changes: you now have to regenerate the WSDL if you change the signature of a method, add a new method, or drop an existing one. NuSOAP is a little better, but come on. This is the default library. It is fine to consume web services, but who would ever want to write a full blown web service in this environment?

I didn’t know you could do this…

I was paging some code through less and accidentally hit the ‘v’ key, and it launched my editor on the file. Unfortunately, it doesn’t work when the file is coming through stdin (though you could rerun the command and redirect the output, launching the editor afterwards). This would be easy to implement. Dump the input until EOF into a file in the /tmp directory, then launch the editor on it. I pulled open the man page and confirmed it. I guess this just falls under live and learn.

What it takes to take Google

Since Google’s meteoric rise, many self-proclaimed Google-killers have come along. Obviously, they were more smoke than flame. Google is bigger and badder than ever. The most recent is probably Cuil and Microsoft’s reanointed Bing, but none of them has of yet made any meaningful dent in Google’s size. The reason is simple. The search they offer is inferior to Google’s. Google has diversified a great deal, but their bread and butter is search. If Google lost at the search front, their other applications would not sustain them. Sure, there is some advertising space in GMail as well as Google Earth, but virtually everyone who owns a PC has used and knows of Google search. A smaller percent use Google’s more expansive offerings. So, to defeat Google one of two things must happen. Either Google has to be beaten on the search front or the internet itself as we know it must become irrelevant. The latter is hard to imagine, but, then again, the fall of the mainframe and the internet would have been hard to imagine a ways back.

So, as things stand, to take down Google, our hypothetical company must defeat Google in the search arena. Again, Microsoft and Cuil are, so far, thinking along the same lines. The problem is that they are not really building a better search than Google is. Google is not invincible, here. With Googlebombing and Googlespamming on the rise, the signal to noise ratio in Google searches as dropped off noticeably. Any search engine that mirrors Google’s algorithm will fall to the same problem. Research needs to go into what it takes to build a next generation search engine. In its essence, Google’s algorithm takes some combination of popularity (i.e. links to the page) with the number of times that¬† the search words (with some fuzziness built in) occur. It is actually a very good little algorithm, but we are seeing its weakness. I propose that the next generation search engine (whether it is built by Google or someone else), hereafter titled NGSE, will have to be a little more intelligent. In fact, I would go so far as to say that this NGSE is little more than a massive artificial intelligence riding the back of the spiders that crawl the web now.

Work in artificial neural networks (ANN) and pattern matching would be key in something like this. Rather than looking dumbly at what the page says its about and what other people say about it (and you’ll notice that the problem with links leading in is that it does not indicate that the person linking actually liked the site; they might be linking to it to run it down), it tries to see if the page matches the pattern of the person searching. This sort of engine would be based on what you mean, not just what you say. It would take the semantic web to the world, but without requiring the world to adapt to it, as every proposal has so far.

Case in point: Google Images. From firsthand experience, I can say that when I search for images there are almost always better matches for what I was looking for than what Google brought up. If I had to guess, I would say that the alt attributes on the img tags go a long way in determining the ranking. The caveat is obvious. Most web masters do not put alt attributes on all their images, even if they should. Imagine, instead, an ANN that could, with a high degree of success, scan the image and deduce what it is showing. The more precise it was, the better it would be at showing the users what they wanted to see.

Whether or not an ANN/spider based search engine is even really feasible is an open question. Especially for the image matcher discussed above, nothing close has even been created. Even if we could build such an engine, would its computational cost be prohibitive to using it across the whole internet? After all, one of the keys to Google’s success was their ability to paralellize their algorithm on the massive scales required. Ultimately, something like this would have to be built to defeat Google on their home turf. The way to win when the opposition has a massive advantage is to be significantly better. Parity just doesn’t cut it and Google got where they are for a reason. Their methods are nothing if not sound.

On Cargo-Culting

Like many a working programmer, I get to see the results of cargo cult programming a lot. To those of us who know better, it is evil, but I decided to sit down and write up a quick article on why it is evil. After all, the very reason that most cargo culters cargo cult is that they do not believe that it is evil.

Here is my composite picture of a cargo cult programmer: our cargo-culter is Joe Cargo (I’m feeling creative today). Joe’s interest in computers is mild. There is probably a fascinating story of how he got stuck in IT in the first place. Maybe it started out by setting up a wiki for a few friends. Or doing a quick and dirty website for a local ma and pa shop. Perhaps he worked at Megacorp, where the path to IT aid is a mile long trail of paper and he got conscripted by his real department to fill the gap in their IT resources, only to find his stopgap skills worth more than whatever it was he got hired for (as though anyone could remember). In any event, he never really moved beyond that point. He surfs the web and slaps together whatever kind of, almost, sort of, probably, if you don’t look at it cockeyed works to complete the task at hand. He has no formal training and has never given any thought to what “best practices” would be. Manual, repetitive work is a way of life. He does not give it a second thought. Almost every other job in the world is based around repetitious labor, why should this be any different? Joe meanders from project to project and company to company always in the dark as to the real world of programming and computer science. If Joe ever meets a true practitioner of the craft, he would regard him as a wizard, dark and terrible, but useful.

Joe Cargo is probably fairly proud of his work. Not excited by his craft, but satisfied with a job that he believes is well done. It runs, after all and there are a lot of lines packed into a lot of files. He probably has no clue that someone more skilled than he, let’s call him Sam Sixpack, views the whole creation as the spawn of Satan. Sam Sixpack looks at Joe Cargo’s work and sees unnormalized tables–and I don’t mean the kind that should be 5NF. No, I mean the 250 column wide variety with repetitive data and would-be primary keys that are based on names that are not always consistent. He sees work that takes hours to run, rather than minutes, worst case. He sees code that has been copied and pasted all over the code base, rather than centralized in a function, class, module, or what have you. When Sam sees this, he groans at the hours it will take him to fix or update every single instance of that one block of code. Sam sees code that feels dirty, rather than clean or elegant. It lacks formatting, it rambles, it does unnecessary work. In general, it just does not make sense.

If we assume that this is a reasonably accurate composite of most cargo culters, it is not too hard to examine it and pick out the hows and the whys. Why is easy. Cargo culting is the result of laziness. Larry Wall once wrote that one of the virtues of a programmer was laziness, but this is another kind of laziness. The laziness that Wall wrote about was a programmer who refused to do work that could be automated and, so, would put in extra work to save manual effort in the future. Cargo culting is based on a laziness, not of overall work, but of the mind. They cannot be bothered to think. They see something, but it would require straining the brain too much to understand it. If everyone took this approach to technology, we would still be pushing rocks around with our bare hands because no one would have seen the utility in investing in tools.

What can those of us who care more do? Unfortunately, not much. Cargo culters got where they are through sheer sloth. If they could not be bothered to learn on their own, when computer books, articles, and resources are plentiful (or even just learn from the code they steal on a regular basis) there will not be much that you can teach them. Those who want to learn, learn more from having a knowledgeable person nearby. Those who do not, will not learn anything either way. In the final analysis, cargo culting is like any other form of laziness in business. It can only be handled by the person in question shaping up or shipping out.

It is sad because it is bad for everyone, whether they realize it or not. Businesses get sloppy, second rate software. Users get a tool that is often the bane of their very waking existence. Next-gen coders get headaches from trying to clean up the mess sufficiently to keep their own jobs. Finally, the cargo culters themselves get a bad time of it. The fruits of this mudball building cannot be hidden indefinitely, even if the cause of it can. The cargo culter may be fired, or leave under increasing pressure to maintain the impossible. In any event, they do not get the best they could have, had they built something well. Their own skills (what few they have) decrease in value, since the culter does not learn (if they did, they would not remain cargo culters) and keep up with an ever changing industry. So, just remember, when you see a cargo culter, you see a history of everyone involved losing.

Stick a fork in Chandler

I wrote not too long ago about my impressions of Chandler and its development after reading the book Dreaming in Code. Now, before I continue, I would like to point out that I understand that the open source world has, by and large, forgotten about Chandler. For good reason, too. It is the open source world’s equivalent of Duke Nukem Forever–well funded, ambitious, and hyped vaporware. So, to some extent, the world has already stuck a fork in Chandler, but bear with me. The interest in Chandler may be minimal, but if you go to the OSF’s web site the dream is clearly still alive. When I logged into my Gmail today and was marking some mailing list messages read, I noticed that there was a new option: “Add to Tasks”. Hmm. Gmail now has an option whereby an e-mail can “become” a task. This sounds strangely like the Chandler concept of stamping, the goal of which was to “knock down the silos” dividing the different types of information. The dream that was behind that software is starting to leak out and spread. Pretty soon, they will have nothing to bring to the table, not even the vision that kept everything going. Gmail is rapidly heading, through evolutionary development, where Chandler has only dreamed of.

Philosophical Language

I began reading In the Land of Invented Languages today after hearing about it on Lambda the Ultimate. Currently, I am reading about John Wilkinson’s failed attempt (one of, apparently, many) to build a philosophical language. Like several readers at LtU, my mind turned to its application to programming. Like the noble readers of that blog, I feel that the correlation between constructed languages (from Elvish and Klingon to Esperanto) and programming languages is a strong one. The irony is that the latter has gained more traction than the former. Many constructed languages, like Wilkinson’s, are based around the idea that ambiguity should be removed from language. In programming, it is not a matter of taste. Ambiguity must and, eventually, is removed. In complex languages like C++ (which, I assert, is complex in entirely the wrong way but that is a post for another time), it may be unclear from a spec how a feature should be implemented, but the implementors ultimately make some decision. So, we have dialects: Visual C++, GNU C++, Borland C++, etc ad nauseum. In human language, however, ambiguity is not neutral. It is actually a positive. Literature and poetry revel in the ambiguity of language, in puns and rhymes and all those stupid idiosyncrasies. John Wilkinson would probably have made one heck of a programmer.

Arika Okrent, the author of In the Land of Invented Languages points out that Wilkinson’s language was a great linguistic study and completely unusable as a spoken tongue. She is right. A language that is unfit for human speech is not necessarily worthless. As evidence, look at the myriad of computer languages available. These are all useful (well, almost all), but you would never catch me speaking to a person in C#, Java, PHP, Lisp, or what have you. The philosophical language is the kind of thing that computers love. Lacking in ambiguity, with new concepts as simple as placing a stub in a massive dictionary. The understanding comes almost for free. A great deal of effort has gone into trying to get machines to understand human language. At the current stage of development this is a lost cause. Hopefully it will not always be, but right now our combination of machine and algorithm cannot untangle the ambiguities of human speech. The example one of my computer science professors used was how a machine would figure out the meaning of the phrase “fruit flies like a banana”. Is it that flies, fruit flies in particular, enjoy bananas? Or that fruits fly through the air as a banana would?

The philosophical programming language might be the next step. True, it might be a little harder than picking up BASIC or PHP, but it would be a great deal more expressive. I know. This also sounds like it is approaching the heresy of building a DSL and expecting random business personnel to do their own programming. That’s not really what I have in mind. The programming would still have to be done by programmers–but more of a dictation and less of a description. As I looked at the excerpts from Wilkinson’s tables it reminded me strangely of Prolog predicate evaluation. It would be easy to represent his whole vocabulary as a sequence of facts in the opening of a Prolog program. With a nonambiguous grammar, the whole thing could be parsed, understood, and executed.

To the best of my knowledge, this has never been tried. I would love to see a first shot at it, wrinkles and all. Give me a shout if you know of or are working on something like this.