Friday, February 28, 2014

FSE tricks - Memory efficient subrange maps

 With last blog post, we were left with a description of the FSE compression algorithm. However, the code associated was a bit cryptic. So let's spend some time understanding it.

Here are the lines of code :
nbBitsOut = symbolTT[symbol].minBitsOut;     nbBitsOut -= (int)((symbolTT[symbol].maxState - *state) >> 31);     FSE_addBits(bitC, *state, nbBitsOut);     *state = stateTable[(*state>>nbBitsOut) + symbolTT[symbol].deltaFindState];
The first 3 ones are easiest to understand.
As suggested in an earlier blog post, the first task is to determine the number of bits to flush. This is basically one of 2 values, n or n+1, depending on crossing a threshold.
The threshold is stored into  symbolTT[symbol].maxState .The second line looks a bit complex, but it's just a convoluted way to say :
    nbBitsOut += (int)(*state > symbolTT[symbol].maxState);
The >>31 trick transforms a negative number into a -1. Such trick is supposed to be done automatically by the compiler, but since my tests showed there was a benefit in performance in explicitly coding it, I did. (Edit : See this complement information from Arseny Kapoulkine)
The 3rd line just flushes the bits. So we are left with the more complex 4th line.

The 4th line is the one realizing the conversion from newState to oldState (since we are encoding in backward direction). Let's describe how it works.

An naive way to do this conversion would be to create conversion tables, one per symbol, providing the destination state for each origin state. It works. It's just memory wasteful.
Consider for example a 4096 states table, for a 256 alphabet. Each state value uses 2 bytes. It results into 4K * 256 * 2 = 2 MB of memory. This is way too large for any L1 cache, with immediate consequences on performance.

So we need a trick to reduce that amount of memory.
Let's have another look at a sub-range map :

Remember, we have the same destination state for all origin state values within a given sub-range. So what seems clear here is that we can simply reduce all origin state values by 9 bits, and get a much smaller map, with essentially the same information.

It's simple yet extremely effective. We now have a smaller 8-state map for a symbol of probability 5/4096. This trick can be achieved with all other symbols, reducing the sum of all sub-ranges map to a variable total between number of state and number of states x 2.

But we can do even better. Notice that the blue sub-ranges occupy 2 slots, providing the same destination state.
Remember that the red area corresponds to n=9 bits, and the blue area corresponds to n+1=10 bits.  What we just have to do then is to shift origin state by this amount of bits. Looks complex ? not really, we already have calculated this number of bits. We just have to use it now.

For this trick to work properly, we need to scale state values, not from 0 to 4095, but from 4096 to 8191. If you do this, it results in the following sub-range map :

A few important properties to this transformation :
- There are as many cells as the probability of Symbol. It's not a random example, it's guaranteed to be always the case.
- The first cell Id is the same as symbol probability (in this example, 5). It's also guaranteed.
- The sub-ranges are now stored in order (from 1 to 5). This is desirable, as it will simplify the creation of the map : we will just store the 5 destination states in order.
- Since sub-ranges map have now the same size as symbol probability, and since the sum of probabilities is equal to the size of state table, the sum of all sub-ranges map is the size of state table ! We can now store all sub-range maps into a common table, of size number of states.

Using again the previous example, of a 4096 states table, for a 256 alphabet. Each state value uses 2 bytes. We essentially now disregard the alphabet size, which has no more impact on memory allocation. It results into 4K * 2 = 8 KB of memory, which is much more manageable, and suitable for an L1 cache.

We now have all sub-ranges map stored into a single common table. We just need to find within it the segment corresponding to the current symbol to be encoded. This is what symbolTT[symbol].deltaFindState does : it provides the offset to find the correct segment into the table.
Hence :
 *state = stateTable[(*state>>nbBitsOut) + symbolTT[symbol].deltaFindState];

This trick is extremely significant. In fact, it was the decisive factor in the decision to publish an open source FSE implementation.

Tuesday, February 25, 2014

FSE encoding, Part 2

 In previous article, we learned how to determine the number of bits to write, and which bits to write, when encoding a symbol using FSE compression. We are therefore left with the conversion exercise, from newState to oldState (remember that we encode in reverse direction). This is the most important step of the compression process.

Remember that we said we could simplify the determination of nbBits by a simple threshold comparison, simply by looking at sub-ranges map, such as this one :

Well, in fact, thanks to this map, we know quite a bit more : we know in which exact sub-range fit the current newState.
Remember that we have numbered these sub-ranges, starting with the larger ones :

What we need to know, is the ordered position of symbols into the state table. Since we have 5 sub-ranges, it simply means we also have 5 symbols, directly associated.

And that's it, we know the sub-range, we know the destination state of the encoding process, oldState.

Encoding is therefore just a repeat of this process :
- get Symbol to encode
- look at current state value
- determine nbBits, flush them
- determine sub-Range Id
- look for Symbol position of same Id : you get your next state

If you look into FSE source code, you'll find that the following lines of code perform this job :

nbBitsOut = symbolTT[symbol].minBitsOut;     nbBitsOut -= (int)((symbolTT[symbol].maxState - *state) >> 31);     FSE_addBits(bitC, *state, nbBitsOut);     *state = stateTable[(*state>>nbBitsOut) + symbolTT[symbol].deltaFindState];

If you feel that the correlation between the text and the code is not obvious, it's because it's not (especially the last line).
To reach this compact expression, a few non-trivial tricks have been used, reducing the amount of computation, and more importantly, reducing the amount of memory required to describe the sub-ranges map.

These will be explained into a later post.