Nov 19 2021

One of these JPEGs is not like the other

“JPEG” or the image encoding specification by the “Joint Photographic Experts Group” (JPEG) is a truly universal format at this stage. You really cannot go very far on the internet without seeing a JPEG file. The amount of content encoded in JPEGs must be surely biblical by now. If there is one thing that is going to carry into the future for historians, It will surely be a JPEG decoder.

But all of this is running under the assumption that JPEG is just a single “format” (ignoring JPEG2000 here for a moment). But oh boy would you be wrong if you thought that. You see, multimedia is basically never ending pain.

Hardware decoding

For almost as long as there has been multimedia compression there have been hardware accelerators for compression formats. These hardware accelerators are the things that allow cheap DVD players, cheap digital TV boxes, and if you’re lucky: thermal and power efficient HD youtube playback. However they often come with drawbacks. Since hardware decoders are harder to design than their software counterparts, they generally come with more bugs.

Hardware JPEG decoders may seem strange at first since JPEG decoding is already quite fast on modern day systems (it was not always) for a lot of battery power applications fast and low power JPEG decoding is vital for hitting battery life targets on web browsing workloads. Even most Intel GPUs contain a JPEG decoder:

The hardware decoder I am fighting with actually today is the subject of a previous blog post called Ludicrously cheap HDMI capture for Linux, in which I found a cheap HDMI <-> ethernet transmitter and receiver pair on the market that was software decodable and so could be used for HDMI capture on a computer.

My flatmate has looked into the audio format for the receiving units, and we use it to output “holding music” to our amplifier if there is nothing plugged into a HDMI port at a given time.

However I also wanted to innovate this a little more and ideally display the current time and playing track. Since we already built a receiver for the video and audio, and then built a transmitter for the audio, surely transmitting JPEG’s in the same way can’t be hard right?

Wrong.

The JFIF innards of a JPEG

So the obvious thing to do here is to just encode a JPEG using standard software that can export JPEGs and spit it out in the same framing format. Well we already tried that in the audio post and it didn’t work. Instead just resorting to replaying a captured JPEG from a real transmitter.

But why didn’t it work? jpg files are jpg files surely?

Wrong!

Just giving a quick look at a JPEG from the transmitter and a JPEG as made from image/jpeg in Go shows a visible difference:

$ file out.jpg 
out.jpg: JPEG image data, baseline, precision 8, 1920x1080, components 3
$ file working.jpg 
working.jpg: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 1920x1080, components 3

Okay. Fine. What is this JFIF thing and why does it seem to get in the way of our ASIC/Hardware decoder from displaying the jpg?

To understand this we need to look into what makes up a JPEG file. JPEGs have a packet style header that sits on top of the actual DCT (the actual compressed image bit) data. There are many types of packets but a few are critical. Wikipedia has a full list here

SOI - Start of Image, Parsers need this to know it’s a JPEG
DQT - Define Quantization Table, the data needed to construct the image
SOF0 (or 2) - Start of frame. Has data like size and subsampling
DHT - Define Huffman Table, Core image compression data

It is possible to not have a DHT in the case of MJPEG (since it saves space), however this is appears to be uncommon in my experience (I could not find a file that does it) and breaks non MJPEG decoders, ffmpeg also offers a way to fix this exact trick with mjpeg2jpeg

So what is in our image?

I wrote a small parser/dumper out of github.com/neilpa/go-jfif to see the difference between files:

My file:

-- 5 segments
-- Segment 0 - SOI
-- Segment 1 - DQT
-- Segment 2 - SOF0
-- Segment 3 - DHT
-- Segment 4 - SOS

The hardware encoder JPEG:

-- 9 segments
-- Segment 0 - SOI
-- Segment 1 - APP0
-- Segment 2 - DQT
-- Segment 3 - SOF0
-- Segment 4 - DHT
-- Segment 5 - DHT
-- Segment 6 - DHT
-- Segment 7 - DHT
-- Segment 8 - SOS

Ok, so the mjpeg2jpeg is not needed as we have a DHT segment. However it does seem that we are missing a APP0 in our image… Perhaps we need this for the image to render out on the hardware decoder?

The wikipedia page for APP0 segment shows it should be quite easy to recreate, as it just has spots for a thumbnail, and some basic pixel density data.

With a reasonably simple patch APP0 tags can be written by the golang image/jpeg library.

I hope(?) to try and submit this to the golang core, since it should not harm any backwards compatibility, and also helps slightly strange decoders (like my HDMI ethernet thing) decode the outputs of image/jpeg

Ok, but what else can go wrong?

While messing with the JPEG export options of GIMP I found that the hardware decoder also does not accept APP2 headers (EXIF and friends), and does not deal with 4:4:4 sampling.

It’s a slightly silly situation where you can have the following table of JPEGs all look the same and be called a JPEG but have wildly different decoding compatibility with some bits of kit. For example, all of these JPEGs are slightly different.

if you see a white box, it’s because you don’t support the bizzare Arithmetic coding extention

Amusingly, even while editing this post, I found that whatever renders thumbnails in Nemo does not like one of those JPEGs:

Winding back the stack, how did I get here?

Right. The reason I’m here is to add the ability for our existing audio playback solution to also show the now playing Artist/Title.

With some small fiddling around with the newly patched JPEG encoder. We now have the Date, Now Playing and a tribute to the famous generic DVD player screensaver

Regardless of any of this madness. I think I got off lucky, I don’t have to look into hardware H.264 decoders any deeper! A previous job I discovered that there is actually no consistent way to decode a H.264 video stream. Don’t ever expect two decoders (especially hardware assisted) to give you identical outputs.

Madness. It turns out lossy compression is lossy! ;)

If you want to stay up to date with the blog you can use the RSS feed or you can follow me on Twitter

Until next time!

Ludicrously cheap HDMI capture for Linux (2016)

Random Post:

Building a legacy search engine for a legacy protocol (2017)