A maximally-dense encoding for n-choose-k

Last month¹, John Regehr asked:

Random question: 32 choose 8 gives less than 11e6 possibilities whereas 24 bits has more than 16e6 choices of value. It seems like there must be a standard fast trick for encoding the choices in the bits. Anyone know it?

John Regehr, Google+ post

As it happens, this is something I’ve occasionally wondered about, too.

The short answer is that there is a standard (though not particularly fast) way to do this called the “Combinatorial Number System”. As usual, Wikipedia has the details, though in this case I found their explanations a little hard to follow, so I’m going to go through it here myself as well.

First, let’s back up for a bit and make sure we know what we’re talking about.

If you wanted a way to compactly encode any combination of — let’s say — 32 choices, your best bet (assuming a uniform distribution) would be to use a 32-element bit-vector, with one bit set per choice made.

If we were implementing that in C++, we could do something like this:

uint32_t encode(const std::set<int>& choices) {
  uint32_t result = 0;
  for (int choice : choices) {
    assert(0 <= choice && choice < 32);
    result |= 1 << choice;
  }
  return result;
}

This pattern is common enough that in practice you’d be more likely to use the bit-vector directly, rather than materialise anything like an explicit set of choice numbers. But while it’s a good solution for the general case, what if your problem involved some fixed number of choices instead of an arbitrary number?

For the purposes of the rest of this discussion, let’s pretend we’re going to make exactly four choices out of the 32 possible. That gives us about 36,000 different possible combinations to encode, which we should be able to fit into a 16-bit value.

Actually, for almost all purposes, there’s still nothing really wrong with the implementation above, even though it is using twice as many bits as needed — unless perhaps you had a very large number of combinations to store (and are willing to trade simplicity for space). However, as we’ll see later, the solution to this problem has a few other uses as well. Plus, I think it’s an interesting topic in itself.

If you just want to read about how the combinatorial number system itself works, feel free to skip the next section, as I’m going to briefly take a look at a ‘better’ (though still non-optimal) alternate to the above.

20 is less than 32

As an improvement to the above one-bit-per-choice implementation, we could simply encode the number of each choice directly using four groups of five bits each:

uint32_t encode(const std::set<int>& choices) {
  assert(choices.size() == 4);
  uint32_t result = 0;
  for (int choice : choices) {
    assert(0 <= choice && choice < 32);
    result = (result << 5) | choice;
  }
  return result;
}

This is still fairly simple, and at 20 bits output, it’s more compact than a 32-element bit-vector, but it’s also larger than our optimal result of (approximately) 16 bits.

We can actually see why this is: the encoding we’ve chosen is too expressive for what we actually need to represent². In this case, there are two ways in which our encoding distinguishes between inputs that could be encoded identically.

The biggest waste occurs because we’re encoding an ordering. For example, \( \lbrace 1, 2, 3, 4 \rbrace \) and \( \lbrace 4, 3, 2, 1 \rbrace \) have different encodings under our scheme, yet really represent the same two combinations of choices.

The second (and much more minor) inefficiency comes from the ability to encode ‘impossible’ combinations like \( \lbrace 4, 4, 4, 4 \rbrace \). Optimally, choices after the first would be encoded using a slightly smaller number of bits, as we have less valid choices to choose from by then.

In this case, we can actually quantify precisely the degree to which this encoding is sub-optimal: being able to represent an ordering means that we have \( 4! = 24 \) times too many encodings for the ‘same’ input, while allowing duplicates means that we have \( 32^4 \div ( 32 \cdot 31 \cdot 30 \cdot 29 ) \approxeq 1.2 \) times too many (i.e. the number of choices we can encode, divided by the number we should be able to encode). Combining the two factors gives us the difference between an optimal encoding and this one.

It’s interesting to think about how we might improve on the above: perhaps we could canonicalise the input by ordering the choices by number, say, then rely on that fact somehow: if we knew that the choices were in decreasing order, for example, it’s clear to see that we could identify the combination \( \lbrace 3, 2, 1, 0 \rbrace \) entirely from the information that the first choice is number 3.

But let’s move on to something that is optimal.

and 16 is less than 20

And so we arrive at the “combinatorial number system”. This system describes a bijection between any combination of (a fixed number of) \(k\) choices and the natural numbers.

Interestingly, this means that this scheme does not depend upon knowing the number of things you’re choosing from: you can convert freely between the two representations just by knowing how many choices you need to make.

First, a brief refresher on binomials. The number of ways we can choose \(k\) things from \(n\) is the binomial coefficient, which can be defined recursively:

\[ \eqalign{ \binom n 0 &= \binom n n = 1 \cr \binom n k &= \binom {n-1} {k-1} + \binom {n-1} k } \]

… or, in a way that’s simpler to compute directly — as long as the numbers involved are small — in terms of factorials:

\[ \binom n k = {n! \over k!(n - k)!} \]

For example, we can compute that there are exactly 35,960 different ways to choose four things from a set of 32:

\[ \binom {32} 4 = {32! \over 4!(32 - 4)!} = 35960 \]

Note that in some cases (for example, the one immediately below), we may also need to define:

\[ \binom n k = 0 \text { when } k \gt n \]

That is, it’s not possible to choose more things from a set than were originally present.

But that’s enough about binomials. The combinatorial number system, then, is then defined as follows:

Given a combination of choices \( \lbrace C_k, C_{k-1}, \dots, C_1 \rbrace \) from \(n\) elements such that \( n \gt C_k \gt C_{k-1} \gt \dots \gt C_1 \ge 0 \), we compute a value \(N\) that encodes these choices as:

\[ N = \binom {C_k} k + \binom {C_{k-1}} {k-1} + \dots + \binom {C_2} {2} + \binom {C_1} {1} \]

Somewhat surprisingly, this produces a unique value \( N \) such that:

\[ 0 \le N \lt \binom n k \]

And since the number of possible values of \( N \) is equal to the number of combinations we can make, every \( N \) maps to a valid (and different) combination.

\( N \) will be zero when all the smallest-numbered choices are made (i.e. when \( C_k = k - 1 \) and so \( C_1 = 0 \)), and will reach the maximum value with a combination containing the largest-numbered choices.

We could implement this encoding using something like the following:

uint32_t binom(int n, int k) {
  assert(0 <= k);
  assert(0 <= n);
  if (k > n) return 0;
  if (n == k) return 1;
  if (k == 0) return 1;
  return binom(n-1, k-1) + binom(n-1, k);
}

uint32_t encode(const std::set<int>& choices) {
  std::set<int, std::greater<int>> choices_sorted(
      choices.begin(), choices.end());
  int k = choices.size();

  uint32_t result = 0;
  for (int choice : choices_sorted) {
    result += binom(choice, k--);
  }
  return result;
}

… though in reality, we’d choose a much more efficient way to calculate binomial coefficients, since the current implementation ends up calling binom() a number of times proportional to the resulting \(N\)!

Decoding can operate using a greedy algorithm that first identifies the greatest-used choice number, then successively removes the terms we added previously:

std::set<int> decode(uint32_t N, int k) {
  int choice = k - 1;
  while (binom(choice, k) < N) {
    choice++;
  }

  std::set<int> result;
  for (; choice >= 0; choice--) {
    if (binom(choice, k) <= N) {
      N -= binom(choice, k--);
      result.insert(choice);
    }
  }
  return result;
}

We could also choose to remove the initial loop and just start choice at the greatest possible choice number, if we knew it in advance.

Number systems, how do they work?

As to how all this works, consider the list produced for successive \(N\).

For \(k = 4\), the enumeration of the combinations begins:

\[ \displaylines{ \lbrace 3, 2, 1, 0 \rbrace \cr \lbrace 4, 2, 1, 0 \rbrace \cr \lbrace 4, 3, 1, 0 \rbrace \cr \lbrace 4, 3, 2, 0 \rbrace \cr \lbrace 4, 3, 2, 1 \rbrace \cr \lbrace 5, 2, 1, 0 \rbrace \cr \vdots } \]

As you can see, this list is in order. More specifically: it’s in lexicographic order. This isn’t by coincidence, but is actually a direct result of the way we construct the equation above. Let’s do that.

First, construct an (infinite) list of all possible \(k\)-combinations, with the choices that form each individual combination in descending order, as above. Sort this list in lexicographic order, as if the choices were digits in some number system, again as shown above.

Pick any entry in that sorted list. We’re going to count the number of entries that precede our chosen one.

To do so, for each ‘digit’ (choice) in our chosen entry, count all valid combinations of the subsequence that extends from that digit to the end of the entry, while only making use of the choices with numbers smaller than the currently-chosen one. Sum those counts, and that’s the number of preceding entries.

That sounds much more complex than it really is, but what we’re doing is equivalent to, in the normal decimal number system, saying something like: “the count of numbers less than 567 is equal to the count of numbers 0xx–4xx, plus the count of numbers 50x–55x, plus the count of numbers 560–566”. Except in this case, the ‘digits’ in our numbers are strictly decreasing.

I skipped a step. How do we find the number of combinations for each subsequence? That’s actually easy: if the choice at the start of our subsequence currently has the choice number \(C_i\) and the subsequence is of length \(i\), then the number of lexicographically-smaller combinations of that subsequence, \(N_i\), is the number of assignments we can make of \(C_i\) different choices³ to the \(i\) different positions in the subsequence.

Or alternatively,

\[ N_i = \binom {C_i} i \]

and so:

\[ \eqalign{ N &= \sum_{i=1}^k N_i \cr &= \binom {C_k} k + \binom {C_{k-1}} {k-1} + \dots + \binom {C_2} {2} + \binom {C_1} {1} } \]

\(N\), of course, is both the count of entries preceding the chosen entry in our sorted list, and the index that we assign to that entry.

Random

Okay, so perhaps it’s not that common to need to encode a combination as an integer value, but there’s another way to use this to be aware of here: if you pick a random \(N\) and ‘decode’ it, you end up with a uniformly-chosen random combination. That’s something that I have wanted to do on occasion, and it’s not immediately clear how you’d do it efficiently otherwise.

(For completeness, a third usage that Wikipedia mentions is in being able to construct an array indexed by \(N\) in order to associate some information with each possible \(k\)-combination. I can see what they’re saying, but I can’t see many cases where this might actually be something that you need to do.)

Open questions

A simple bit-vector is optimal for representing any number of choices \( [0, n] \) made from \(n\) items. The combinatorial number system is optimal for representing a fixed number of choices \(k\).

What’s an optimal way to represent a bounded number of choices \( [0, k] \)? Or more generally, any arbitrarily-bounded number of choices?

I’m a slow typist. ↩
The other possibility is that the encoding may be wasteful in a straightforward information-theoretic sense, akin to using six bits to represent “a number from 0 to 37” rather than ~5.25. While this is strictly a superset of the over-expressiveness problem mentioned in the main text, it seems useful to differentiate the two, and consider the semantics expressed by the encoded representation we’ve decided upon. ↩
That is, the count of those choice numbers that are smaller than \(C_i\), since choice numbers start from zero. ↩