Saving space with numbers

3/13/2011

A while ago I was presented with a classic “we need to reduce our bandwidth usage” problem. Of course there are a number of techniques to reduce bandwidth, but in order to use the most appropriate tool I needed to understand my problem domain.

In this particular application, the network messages consisted of a large set of Longs and Integers. The problem was made harder by the requirement of a low latency server response. So how do you quickly pack these sorts of data types to take less space?

Examining the data I found out that 90% of the Longs would fit in an integer and some of them in a short. Most systems would represent Longs, Integers and short in 8, 4 and 2 bytes respectively. Surely a long that fits in a short doesn’t require 8 bytes. But how would the receiver determine how many bytes that number is using? Enter Variable Length Quantity, VLQ in short.

VLQ has been around for a while. Wikipedia says WAP used it and google’s protocol buffers use a similar format. In short VLQ, as its name implies, makes the data type’s size variable. Instead of having a number of datatypes (long, int, etc..), each with a fixed byte size, we could have a marker in our byte stream to show where the data unit starts and finishes. The following diagram illustrates the concept. The diagram shows an example where three units of data encoded with a marker signalling where each unit ends and another starts.

A popular scheme is to have the marker represented by the last bit of each byte. In this scheme, this bit is reserved as a marker as it is not used as part of the value. The most significant bit can be set to 1 if more bytes are to come, 0 otherwise. This can be illustrated with an example. Consider the integer 11733. In a most significant first system, the receiver would see a byte stream appear in the following order (left to right)
00000000 00000000 00101101 11010101
Using the most significant bit as a marker, the receiver would get:
11011011 01010101
When the receiver/reader sees the first byte, it knows that more bytes are available since the MSb is set to 1. The receiver/reader stops once it notices a 0 in the first bit. At that point it can remove the markers and compact the received bytes to construct the original data type.
At the sender/.writer side, the leading zeros are discarded, and the stream is broken down in 7bit segments. All bytes are then marked with a 1 in their MSb except the last one.

The ACME guide on how to build your own VLQ reader and writer.

Let’s start with the reader. You’ll need a way to read and remove the marker from the stream and a way to compact the bytes together. The first part is the easiest.

There are probably many ways to do this, but a quick way to test your marker is to ‘AND 128’ your byte and then checking if the result is zero or not:
11011011 AND 10000000 = 10000000
Having a result of 0 means you’ve reached the end. All of this can be written as:

boolean end = ((bytesRead[i] & 128) == 0);

Removing the markers is easier as you can just OR the rest:
11011011 OR 011111111 = 01011011
Translating to code this gets:

bytesRead[i] &= 127;

Rebuilding the original stream involves some thinking. The solution I came up with is to left shift the bytes into place. You start from the last byte read and OR it to a 0 place holder (int, long, etc). Then you left shift the next to last by 7 bits and OR it to the place holder. Do the same with the next byte but left shift it by 14 and so on until you have no bytes left. This is the algorithm I came up with (where x is the result integer):

int j = 0; int x = 0;
for (int i = values.length-1; i >= 0; i--){
x |= values[i] << ( 7*j );
j++;
}

Writing a sender/writer turns out to be simpler than the reader. The sender simply makes space and applies the marker bit. To make space, we need to segment our stream in 7 bit parts. Again the shift operators come in handy. By continually masking the first 7 bits and right shifting them away, we can build an array of bytes to write/send (i is the integer to write):

int j = 0;
while (i != 0){
bytesToWrite[j] |= i & 127;
i >>>= 7;
j++;
}

Notice how it’s using the unsigned right shift. This is to avoid getting stuck in an infinite loop if the number is negative.
To write the bytes (or send over network), we’ll have to start from the back since the receiver is expecting the most significant byte first. When writing we then simply mark all but the last bytes with a 1 in their top bit i.e.:

bytes[i] |= 128;

Download

There you go. You can now build your own VLQ reader/writer. Oh hold on... you can also just use this open source one:
https://sourceforge.net/projects/vlqstream/

0 Comments

*not just a java Blog

Saving space with numbers

Leave a Reply.

Author

Archives