Every serious hacker sooner or later needs the popcount instruction.
This "population count" instruction counts the set bits in a register, and is so useful that the NSA demands that all computers they purchase implement it in hardware.
Decryption, uncompression (decompression is something divers do :), pattern recognition, Zobrist hashing (generating keys in a certain Hamming distance range) and cellular automata all profit from a "popcount" (better to be called BC) instruction, and the time and effort it saves is very substantial, compared to a kludge in MMX or SSE2, where it's done on 64 or 128 bits in parallel.
But even with hundreds of millions of transistors in current CPU's, there isn't a single mainstream desktop CPU that implements this so-called "canonical NSA instruction".
This reeks of a conspiracy.
The Itaniums and some less prolific CPU's implement it (V9 SPARC and Seymore Cray's Cyber, both at the explicit request of the NSA - as well as Sun's UltraSparc, the PowerPC and Dec's Alpha 21264, heck - even the T800 Transputer had it), but neither the latest Pentiums, nor the latest Opterons support it.
Conspicuously absent, in spite of a billion transistors in real estate.
It wouldn't be the first time that a microprocessor manufacturer (Intel) would implement undocumented opcodes, only to "licence" them to major - sworn to secrecy - software producers.
Those opcodes can have aliases, so that when they leak, they can be traced back to the licencee.
That way, Intel could provide Microsoft with a competitive advantage in producing Codecs and unZIPpers (to give but an example), whilst Microsoft would, for Intel's benefit, delay their 64-bit OS for the Opteron for more than a year, to give Intel time to catch up with AMD. Of course this is all conjecture, but it's inspired by the historical fact that a certain CPU manufacturer has sold the specs to undocumented opcodes before, whilst we all know that Microsoft is getting convicted for illegal monopolistic practices on a regular basis.
In the past there was COCOM, which was a set of export restrictions to prevent the Cold War adversary from getting their hands on powerful CPU's, amongst other things. But nowadays the "enemy" is everywhere, the new enemy uses encryption and the NSA would like to keep the edge.
One of the ways to maintain an edge is to either keep certain capabilities of current CPU's a secret, or to prevent these capabilities from being implemented in affordable mainstream PC's altogether.
If mainstream CPU's would have a popcount instruction, I could make a Go position evaluation function run twice as fast as it does now.
In Go, chains of stones have liberties, and when chains are merged or captured, liberties need to be re-counted.
Merging chains can be done very efficiently using bitboards (as used in chess), and it would be ideal to count liberties using this much missed instruction.
At the moment I need to use a parallellized algorithm that uses MMX or SSE2.