That book - Hacker's Delight - is everything I hoped it would be. This is what I got out of it:
1. I need only 16 cycles + 1 if to know whether a certain chain is adjacent to a board point. Normally this would take 4 ifs, but a non-taken branch costs at least 10 cycles. My code is at least twice as fast, on average.
2. I found a perhaps novel, if-less way to count the number of trailing 0's in a 64-bit register, five times faster than right-shifting and examining the LSB. The book still used four comparisons to do that. The code that shifts, compares and increments a point coordinate takes five cycles per trailing zero, so the average with sparsely populated registers (a 1 on position 31) is 150 cycles. My code takes 30 cycles for any number of trailing zero's. To arrive at this solution, I modified and combined two techniques described in the book and adapted them to 64-bits. This code efficiently converts bitmaps to coordinate lists.
I am starting to get convinced that MoyoGo's tactical module will be the absolute fastest. I maintain data structures that allow me to determine life/death status in near-theoretical optimum time, when running the most efficient, recently published algorithms.
So the design & implementation process of the TsumeGo module went like this:
A. Find the fastest method of determining L&D in Go literature.
B. Find the data structures one needs to maintain in order to do (A).
C. Implement a state machine that incrementally maintains (B).
D. Use every trick in the book to speed up (C).
There will be (E), namely finding and implementing a set of heuristics to prune the search tree. I already have a move-suggester in the form of a superfast, Zobrist-hash based pattern classification engine, and Go literature will yield the rest. But A, B, C and D are quite independent of E. As long as certain questions can be quickly answered about the state of the state machine, E can be efficiently implemented. I expended a lot of effort reading relevant Go programming literature and studying books on life & death to ensure that compex queries about the state can be executed in minimal time, so I am confident about (E) as well.
By making this code as fast as I can, I am putting distance between me and the competition. It's like putting a 20 GHz CPU in my customer's PC that can only be used to run Moyo Go. Chess programmers spend their prime shaving off cycles, for Go this is as least as important because Go needs a much faster computer to be able to play well. Saying that computers are too slow to play well and that therefore speed is irrelevant is defeatism.