The RPK Image Format

Slim, Streamlined, Speedy

From QOIG to RPK

A Step in the Other Direction

The origin of this project was in the implementation of the QOIG image format as an extension to the Quite OK Image Format. After getting it 75% done, I tweeted the above document to QOI's creator, Dominic Szablewski. His response:

@phoboslab replying to @quintopia: Good read! It’s one of the nicer QOI “extensions” I’ve seen. Though I still think about how QOI could have been simplified even more, instead of extending it :)

He wasn't the only one thinking about that:

@quintopia replying to @phoboslab: Thanks! Our goals were obviously very different, but I think about that too. If I had gone fully down that rabbit hole, I would have made something that looked nothing like QOI.

I singlemindedly focused on improving compression with QOIG. I could outperform QOI 100% of time on compression ratio, and even outperform PNG on some types of images, but there was a cost.

  1. Turning on all the extra compression enhancements increased the time it took to encode by 55% and increased decoding time by 70%.
  2. The extensions required several kilobytes of memory overhead: more than 3 kb for encoding and more than 2kb for decoding.
  3. The added complexity from all the extensions made the codec a nightmare to debug. There were so many little details to get right in order that compression should be maximized to the limits of the format.

Obviously, I was acutely aware of these things, and always in the back of my head, I was wondering what I could do if I had set out to design my own image compression algorithm from scratch rather than trying to make a format based on QOI. In short, what would I do using what I learned in implementing QOIG in order to tackle the same goals Mr. Szablewski was facing down: speed and simplicity in the face of limited resources. In theory, if I had the right idea, it would be really easy to do.

Goals of RPK

I was right. My new idea took a fraction of the time to get working that QOIG did. Part of that was that I could use the same basic framework from my QOIG codec, just tear out the QOIG algorithm and install a new one. Furthermore, since I didn't really care about the header, I had no reason to mess with that. I left that part of the format largely unchanged. But the bigger part of it was that I was intentionally making something so much simpler, which meant less code to write and decreased potential for bugs. We're talking a few days to implement instead of a few weeks. My goal was to beat QOI in terms of speed and simplicity without any particular bloat in the amount of resources needed. While I was much less concerned with compression ratio, I wanted to employ the strategies that seemed to work best for QOIG if I could do so without any added complications. Here's how I planned to do it:

  1. Simplicity: Fewer op codes that require fewer special cases.
  2. Speed: Fewer tests to determine which op codes to issue and less computation required to construct them or to compute colors from them.
  3. Resource streamlining: Limit the amount of memory required for both encoding and decoding.

Format Outline

Ideas That Work

I do not intend this document to be a complete specification of the RPK format. My reference implementation, linked below, is the specification, and a complete description of the encoding scheme can be found in the file rpk.h. Instead, this is merely intended to give some of the rationale for the choices I made. Everything in the format stems from my experience on what worked well in QOI and QOIG and my thoughts on how to simplify those techniques.

Idea #1: Bigger, better color caching

I saw a lot of great improvement in compression ratio in QOIG when I added the larger secondary color caches. I also found that, while there's no beating a hash cache for speed, the performance of a cache for the purpose of compression can vary greatly depending on which hash function is used. While in QOIG, there was nothing I could do about the main cache hash function since I had to match QOI's for compatibility, I was able to tinker with hash functions for the secondary cache and get some minor improvements in terms of compression. However, the secondary cache option was out of the question here as it not only added complexity to the algorithm and took more time, it also meant using the extra resources to store that larger cache. I struck a compromise with RPK's cache scheme:

  1. The main cache is twice the size of QOI's main cache: 128 colors, requiring half a kilobyte of memory.
  2. The hash function, a variant of FNV-1a, hand-optimized for hashing to a single byte, has better dispersion than QOI's. More of the cache gets used sooner and colors live longer in it.

Idea #2: Computing Colors From The Last Color

In most images, it is common that color changes are gradual. Newly seen colors often differ only somewhat from the color adjacent. Three of the ops QOI uses (OP, DIFF, and LUMA) are founded on that assumption and they work out well for it in terms of compression. One of those, RUN, I have essentially kept intact, but more on that in the next section. The other two, LUMA and DIFF, work pretty well as implemented for QOI in terms of compression, so I wanted to include something similar. However, I needed to adapt them for speed. What seems to slow them down on the compression end is testing whether the current and last color are similar enough that they apply. Because they use wrapping component subtraction to determine whether a color component is in the correct range, each component must be compared separately, and two comparisons are required per component (except with alpha, which those ops cannot modify). I wanted to be able to compare colors all at once, one test per op rather than six. The method that occured to me sacrifices a bit of compression power: I simply consider the last two colors as 32-bit integers and XOR them. Then, by masking out the bits I wanted to modify and seeing if anything was left, I know whether an op applies.

I implemented two versions: one that applies when only the last two bits in each component differ, storing all four differences in a single byte, and one that compares the least significant five or six bits of only the RGB components, storing all three differences in two bytes. One might think that this means I need two bytes per new color when only two bits change per channel, or three bytes per new color when only four bits change per channel, as there is no room left in these bytes to store the op code. However, it is actually less than that on average, as will soon be seen.

Idea #3: Long runs

Run-length encoding is probably the second oldest form of lossless data compression and has been used for compressing visual images for over half a century, so it was an obvious pick for inclusion into QOI. However, it wasn't powerful enough. When encoding runs of identical pixels, nothing needs to be buffered. You just tick up a counter. And you would like for that counter to be able to go as high as the length of runs you are likely to see in practice. This is why I added long runs to QOIG and got a great deal of mileage out of them. The secret sauce is being able to encode the length of the run in as few bytes as possible. While there are plenty of encoding schemes that are set up to encode with as few bits as possible (e.g. Golomb-Rice or Elias delta), going down to the granularity of bits loses you the speed and simplicity of byte alignment. With QOIG, I used an entire additional byte to encode 128 additional run lengths, and an entire second additional byte to encode a further 2^15 additional run lengths. I did something similar with RPK: The op code byte can store runs of length up to 15, an optional additional byte, in combination with the last 5 bits of the op code byte, can encode an additional 2^11 lengths, and an optional third byte, in combination with all that precedes, can encode an additional 2^19 lengths. This is more than enough to encode, all at once, even the longest runs in my test images.

Idea #4: Shaving Opcodes From Blocks of Colors and Color Diffs

One of the most profitable schemes, in terms of compression ratio, I incorporated into QOIG was raw blocks, where a single op code could flag an entire block of uncompressed color data. It resulted in such immense savings that it almost entirely on its own inspired the RPK format. After all, what could be faster than just copying color data straight from the file without having to stop to parse any intervening bytes. If nothing can be done to compress any colors in that block, that should be the default action. As such, that mechanic was lifted almost exactly as it was from QOIG. However, blocks of colors have to be buffered by the encoder so that the length of the block can be output before the color data itself, so in keeping with the resource diet RPK is on, the maximum length of such blocks is limited to 32 colors, which only requires storing an additional 128 bytes of data at most.

And with the opcode byte set up to be able to store run lengths already, there's no reason not use that to encode runs of the 1-byte and 2-byte xor diffs as described above. These diffs can also be buffered up to 32 at a time, though in practice there are not so many in a row. However, we save a byte every time there is.

The one little complication that needs to be taken care of in the encoder is ensuring that an ongoing run of 1-byte diffs is not interrupted to issue an INDEX op. On the off-chance that this INDEX is immediately followed by another run of 1-byte diffs, we'll have wasted a byte on the opcode starting the new diff run.

Results

Compression Comparison

Just as my QOIG analysis was primarily concerned with compression ratio, as that was my main goal, this analysis is primarily concerned with speed. However, you probably are interested in how much compression power I lose to achieve that speed, so we'll start with those results in comparison with QOI:


	Image   QOI file size    RPK file size  Ratio (RPK/QOI)
	-------------------------------------------------------
	alphanoise      34630            30553            88.2%
	bricks        2122775          3028219             143%
	diamondtrust   231486           219769            94.9%
	mines            2523             2312            91.6%
	noise          817592           595877            72.9%
	panel          300948           378977             126%
	perlin         261999           198687            75.8%
	plant         1820118          2641952             145%
	rock          2283625          3138075             137%
	thinwall        60697            58384            96.2%
	unknown        240624           259357             108%
	ventwall        84118            91470             109%
	ventwall2      120759           117824            98.5%
	wspeppers      398999           198334            49.7%
                                             

As you can see, many of the file sizes are on par with QOI, but a few are much larger. There is a lot of room here for improvement. I can imagine variations on this format which might do so at a relatively small cost in speed while reducing the memory footprint yet further. Subject of future research perhaps.

Encoding Speed Comparison

For this comparison, I encoded all 14 test PNG images 100 times each and summed the time spent doing so using statistical sampling with gprof. These times indicate time spent within the encoding function alone, not e.g. time spent decoding the input PNG. Neither does it include system time (e.g. time spent writing out the encoded file). The QOIG files were encoded using the settings that achieved best compression for each image. QOI was encoded using my QOIG encoder with the QOI settings. All codecs were compiled using gcc -O3 -pg. This level of optimization removes via loop unswitching the extra checks that my QOIG extensions do during encoding, so it is a fair representation of QOI's performance.


                                     QOI    QOIG     RPK
                                   ---------------------
	Total time (s)             13.06   20.27    7.92
	Average time per call (ms)  9.33   14.48    5.66
	Relative runtime (%)       48.25   59.06   36.18
	                                         

As you can see, RPK encodes in roughly half the time QOI does. In fact, it is the only one of the three for which more time was spent decoding the PNG than encoding the new file.

Decoding Speed Comparison

As before, this comparison does not include system or time spent encoding and deflating the output PNG file. It's worth keeping in mind that the input files in this comparison are not the same. They are the files that resulted from the encodings used in the previous comparison. The total size of the rpk files is about 30% larger than the total size of the qoi files. The qog files are about 10% smaller in total than the qoi files. This means that in order to tie with QOI in decoding, RPK has to process 30% more data in the same amount of time, and therefore needs to process it at least 30% faster.


                                     QOI    QOIG     RPK
                                   ---------------------
	Total time (s)             18.81   31.48   17.27
	Average time per call (ms) 13.44   22.49   12.34
	Relative runtime (%)        3.74    5.99    3.52
											 

Based on this, you can see RPK decodes slightly faster than QOI does despite the larger file sizes. However, while profiling decoding, I saw a lot of variation in comparative decoding time between different files, so I also did a few single image decodes (100 decodes summed):


        Image   QOI(ms) RPK(ms)
	-----------------------
	rock         48    38.9
	plant        55    38.9
	ventwall    4.0     2.3
	panel       4.0     4.4
											

I'd guess an average case is that RPK decodes around 10%-15% faster most of the time.

Reference Implementation

The reference implementation can be found at https://github.com/quintopia/RPK, but the format should not be considered final. It is an experiment, after all, and I might come up with additional tweaks in the future. However, it is still a good starting point for experiments along similar lines. However, the basic structure and interface of the application should remain consistent.

This implementation builds as a streaming converter between PNG and RPK. "Streaming" means that it does not store the input or output files in RAM, which means it can handle very large files. I don't know exactly how large, but the main functionality should continue to work even as the inputs grow to many gigabytes in size. The PNG encoding and decoding is handled by libspng, a fast streaming PNG codec (compresses about 30% faster than libpng). If compiled as rpkconv, it can be called like this:


	rpkconv image.png image.rpk
										

This re-encodes the PNG image as an RPK file.


	rpkconv image.rpk image.png
	                                    

This will decode the RPK file back into a PNG.

Note that the filename endings must exactly match one of the above formats. One must be .png and the other must be .rpk.

Possible Future Changes to the Format and the Codec

  1. Find a more powerful use case for run type 1's op code
  2. Enable longer runs for run types 2 and 3 without buffering using fp seeking
  3. Animated RPK format (whenever libspng adds APNG support)