Friday, May 12, 2006

Tactical Musings

Designing a tactical Go module is also about understanding the underlying hardware.

Most design decisions have an impact on performance, and ultimately it is mostly performance that will provide the advantage over the competition or, more interestingly for Go players, greater playing strength. Because search algorithms and heuristics are plentiful and the very best are freely available. What gives the edge is the speed of the move/unmove engine, the speed & quality of the plausible move generator/orderer and the speed & quality of the evaluation function. Clever pruning schemes are public, but of course the heuristics are a higher form of black magic.

I have made tremendous progress in speed & quality of the plausible move generator with the pattern learning efforts. I've gained experience with the complexity of an efficient state machine required for the board tactics, and now, on a 64-bit architecture, it's "for real".

One of the compromises needed to be made is the tradeoff between maintaining state and the penalties of branch misprediction. I can avoid if's (= branch mispredictions) by maintaining more state, but maintaining state is expensive, in terms of computation as well as in terms of potential cache-trashing.

The obvious solution would be to use bitmaps exclusively, but this only works in theory. In practice, a hybrid solution is optimal because due to Mr. Kernighan & Richie, modern CPU's now don't have the necessary opcodes. They told Intel: "We won't be using them anyway in our compiler" so Intel, happy not to have to debug and regression test all that microcode for all eternity (and free up precious die space), dropped them. By now, they could have been optimized enough to be useful. As I have said before, I need popcount but an instruction to shift to the next set bit and get a running total of the number of shifts required in a register would be even more useful.

Chess programmers belong to different "schools", some delta-update everything they need to maintain, others re-calculate stuff on an as-needed basis, minimizing state size. I am taking the hybrid approach. Things that can be calculated with relative ease, or things that are expected to get an opcode-equivalent in the next decade, are calculated on an as-needed basis. Initial design decicions will be made on the basis of educated guesses. Benchmarking experimental alternatives will be done as soon as building on a sub-optimal solution would constitute a "point-of-no-return".

I don't like to fill the RAM with the state space of the search tree but prefer to stay inside the cache at all times, because the bandwidth gap RAM/cache is steadily widening. I'm betting on ever expanding L1 caches for the Zobrist hash tables. AMD, with its shorter pipelines, holds the performance advantage for TsumeGo stuff, but Intel wins hands down on cache size. I spend enormous time on tweaking the design for the the optium of speed vs. utility. For example, I am maintaining separate lists of chains based on how many liberties they have, because I often need to iterate through chains in atari. But I do not keep their coordinates, because mutating them after a capture is computationally intensive and requires a lot of state-change, potentially trashing the cache. Instead I do a potentially slow bitmap-to-index conversion, until either the CPU supports it in hardware or I find a better algorithm. For that purpose I have just ordered "Hacker's Delight" from Amazon.

I don't think I'll be divulging more about the tactical module. I'll have to read up on Opteron instruction latency and refresh my bitfucking skills, I'll benchmark endlessly, alternated with plenty of "Archimedes" meditation (bathtub brainstorming between my left- & right brain halves :) and slowly something beautiful will emerge. I'm not a singer, painter or a poet - I can't bring beauty to this world. Instead, perhaps one day I can give something I consider beautiful (in its algorithmic implementation) to a select few Go players.

It will be a long and arduous road and success is by no means secure, my weak point is my lack of knowledge of Go heuristics but there have already been published more such -provenly useful- heuristics by strong players/Go programmers than any Go program implements, and the Go knowledge than can be gained from books is of a much higher level than to be found in current Go programs. I will certainly brush up my playing skills, but I prefer Go programming over playing it.

I believe that the Go playing skill required to write a strong Go program is less than the Chess playing skill required to write a strong Chess program, but that is something for another blog entry. There are a few lights on the horizon for Go AI. A disadvantage is the search space and the need to maintain a lot, and, during the search process, wildly changing state, but an advantage is the binary, semi-static nature of the beast and the ternary symmetry of the chains, a key to efficient state maintenance.

My next posting might be a benchmark, comparing the speed of a 64-bit design compiled to a reasonably-optimized 32-bit target (Delphi) to an as-of-yet poorly optimized 64-bit target (Free Pascal).

I am eager to get some comparative results (I have to code a few more weeks). I expect the 64-bit version to be around 3 times faster.

Update: The code is developing into something that translates well to VHDL. It would run much faster on an ASIC with the same frequency as the CPU because a lot of the time-consuming bitboard operations easily can be put into gates. Because the process is massively parallel, it would even be fast on an FPGA.