Monday, April 02, 2007

Early Optimization Considered Essential

The ink on the previous posting is not yet dry, or a posting appeared on a German Go site: "Moyo Go Studio partially throws the towel in the ring". The intention was honorable though - drawing attention to the precarious situation the software is/was in, in terms of its commercial survivability, as a consequence of illegal/immoral attacks on me and my activities in the comp. Go scene.


Switching from semi- full-time to "hobby" isn't exactly "partially giving up", at least not with me, I have done lots of things as a hobby, including writing a commercially available 2D-CAD system that rendered faster than AutoCAD at the time.

Since yesterday, I did two things - fix a customers' PC that didn't want to POST - I made 800 NOK (about 100 Euro) on that - and I hobbied a little on Moyo Go. Programming is so much more fun when you don't rely on it to pay the bills!

Enfin, I just hobbied. I hobbied the translation into C++ of a Object-Pascal class definition that I wrote for the TsumeGo (Tactical Go-problem solving) module. The entire class is done (I mean written in Delphi), and now I'm porting it to C++. I said it before, C is beautiful for applications like TsumeGo.

As to the implentation, I have no idea how it works anymore (I only remember it was a work of either a raving madman or a genius ;-) and besides I optimized it so agressively that even if I did remember the particulars, it would be utterly impossible to map sourcecode back to high-level concepts.

That's not so bad, really, because this class is small and does nothing else but maintaining the board state. It's a black box. All I care about is the absolute fastest means of making and unmaking moves, and getting all state info I require. For the rest, all that matters is documenting the data formats. The thing should get the crap debugged out of it (the first time I will use automated testing, it would be insane to omit that in this case) and from that moment on it will be left untouched forever.

I really mean that, because I spent ages in optimizing the thing. Optimizing early is essential in this case, because otherwise you end up building a henhouse instead of a Gitmo, and you'll have to recycle the henhouse later anyway. Speed is not just part of the spec, it is the spec.

I mean, it's easy to hack up a state machine - what it's easy is to design one that maintains everything I need and is lightning fast. Wist lightning fast I mean as fast as possible. This needs explanation. "As fast as possible" depends on the following:

- choice of algorithm
- choice of language
- compiler optimization

The devil is in the details, however. Many people simply design an algorithm using what they learnt at comp. sci or from a book on algorithms (or perhaps they conjure up something clever), but I work a little differently. I look at the hardware at one end, and the problem at the other end, and I base my algorithm on the strengths of the hardware alone. Nothing else. I don't care about easy of implementation, I don't care about how much RAM is required, I don't care about anything, I just care about using the hardware as clever as possible to achieve the highest possible speed, when speed is requirement #1. In move generators, speed is the only requirement. All things being equal, more speed is better. Nobody cares that the code looks ugly when you make moves twice as fast, (and get high-level tactical data twice as fast) as the closest competitor.

So what did I do - I read the AMD optimization guide for their 64-bit CPU's and I based my design on the strengths of 64-bit processors and the peculiarities of caches, buses, RAM, pipeline stalls, instruction latencies and whathaveyou. I came up with a pretty bizarre solution, ending up with code that looks outlandish and, after optimization, does not look at all like a state machine for Go positions. It basically is a bunch of complex logical operations on a bunch of 64-bit registers. And what it yields is linked lists of chains and their adjacent chains (also empty chains) and their properties. Unique is that it generates Nth-order liberties with breakneck speed. Having access to the number of 4th-order libs at the same speed as other engines maintain the same state minus order-libs will yield an enormous advantage. Hence spending months on a fast state machine. Hell, even if it would take years I would still do it.

I had to switch to C for the TsumeGo engine - nothing else compiles to native 64-bit EXE and optimizes that code to the max. I turned off comments on this blog, to avoid having to explain why JITters and GC suck :-) The reason Java is so popular is mainly because the current generation of comp. sci profs don't understand pointers. The reason C# is popular is because C/C++ sucks at writing GUI-heavvy apps. But you can not, I repeat, can not, write a viable Go engine in those languages. To be viable, you have to be able to control what happens on the hardware level. Example: aligned_malloc and inlined prefetch instructions or NOPs. Is there such a thing in Java or C#? My definition of "viable" is the same as that of the top Chess-engine hackers.

The crux of this monologue is that sometimes, speed is the only requirement. And, perhaps counter-intuitively, you should not first make "something that works" and "optimize later", because you simply can't turn a llama into a leopard. Instead of performing expensive plastic surgery on the llama, go the extra mile to get a leopard.