Stay in your lane, the shifty MIPI bug

Posted on Jul 26, 2020 | ~6mins

camera firmware bugs war_stories debugging

WARNING: DRAFT POST

tl;dr While tossing together my last post about ThreadX and debugging the video pipeline I was reminded of other exciting debugging adventures from our camera work; this is the next in that series.

Disclaimers This bug occurred roughly a decade ago, my notes from debugging it date to October 2010. I no longer have access to the hardware, the source code, documentation, or my full write up from the time. I have reconstructed most of this from a couple note files and some Skype chat logs that somehow stuck around.

The bug report

Scenario, this was one of the first customers using a MIPI¹ sensor at all with COACH, and certainly the first to try using the multi-lane MIPI mode. Prior to that most of our customers were using a much simpler but slower parallel interface² plus a SPI/I2C/… connection for configuration.

Customer reports that camera works fine in video mode, but when you try to take a still capture looks foo-bared.

Visually, the images appear to have vertical stripes and random noise. Here’s a generated example:

Quick background on MIPI

MIPI can have 1-4 data lanes. Each of the MIPI data lanes is a pair of wires using a differential signal. COACH supported both 1-lane, and 2-lane and 8-bit, and 10-bit readout modes.

(I am not in possession of a MIPI spec, so speculating exactly how these readout modes work, but should be something like I describe. If you know better ping me and I can update the post).

Lets start with the simplest case, 8-bit data on one lane. To transmit this, the packetizing looks something like (where p# is pixel number # or later its most significant byte for 10-bit):

Lane1: [  p1  |  p2  |  p3  |  p4  |  p5  | ... |  p$n  ]

If we stick to 8-bit pixels but spread them across two lanes it might look something like:

Lane1: [  p1  |  p3  |  p5 |  p7  |  p9  ]
Lane2: [  p2  |  p4  |  p6 |  p8  |  p10 ]

Now if we want to ship 10-bit raw, unless we were willing to waste the extra 6-bits per pixel, we need to pack these into byte aligned packets. What MIPI does is to snip the lest significant two bits from each pixel and batch 4 of those into a byte, so we send 5-bytes per 4-pixels. This might look something like:

     :                             |LSBs  for|
Lane1: [  p1  |  p2  |  p3  |  p4  | 1,2,3,4 |  p5  | ... | p$n  ]

And finally if we want to transmit that 10 bit raw across 2-lanes:

Lane1: [  p1  |  p3  |1,2,3,4|  p6  |  p8   ]
Lane2: [  p2  |  p4  |   p5  |  p7  |5,6,7,8]

This final case is what was in use with this particular bug.

Interestingly, the 8-bit across two lanes would have hit the issue as well, however, that mode was not used. Perhaps it was not supported by COACH or the sensor. That would’ve been easier to debug with though, so I suspect tried it.

Hardware

                                   +---------+
+------------+                     |         +---------> DMA
|            +---------------------+   IPP   |
|  sensor    |   MIPI Data         |         |
|            +---------------------+         |
|            |                     |         |
|            +--- MIPI Control ----+         |
+------------+                     |         |
                                   |         |
                                   |         |
                                   |         |
                                   |         |
                                   |         |
                                   |         |
                                   |         |
                                   +---------+

In the IPP hardware block, where those MIPI lines connect, there would be: a PHY, physical layer, block that handles decoding the differential signal and shifting out bits. There would be a block, maybe same HW unit, that takes those bits and packages them into bytes. There would be a block that takes the split 10-bit data as a stream of bytes and restitches the split off LSBs with the MSB.

The IPP would do a lot of other work as well, including Lens Shading Correction (LSC), Gamma correction, Dead Pixel Correction (DPC), and many more.

Debugging

First step was narrowing down what was different between video and still capture mode. This wasn’t terribly hard to track down, video mode is using 8-bit raw (and some cropping, sub-sampling, etc.) and still capture was doing 10-bit raw, was using a larger frame and thus had higher bandwidth requirements, and so was using a 2-lane readout.

Things we discounted fairly quickly:

Show that it was not a sensor issue – customer had wired the sensor for both parallel and MIPI, so we could check if that worked. Also that video mode was working was strong signal wasn’t something in the sensor modules.
Signal issue on the MIPI lines (had to obtain a scope capable of handling the high data rate differential signal; pain in the arse).
Verify that it was an issue with the input to the IPP – The IPP has other outputs besides the image, including a histogram and a decimated image, these matched the image output and not the expected image, so we knew IPP was processing wrong data as well.

Thankfully the sensor, OV5653, we were working with had a test pattern³ mode which output 8 vertical color bars. Wherever possible, removing non-determinism should make things easier to deal with.

With that and disabling as much of the IPP image processing stages as possible (LSC, Gamma, DPC, …), or making sure the values for these systems are consistently configured we were in a pretty deterministic state. By counting the hash of a relatively small window in the images we were able to conclude we were only getting about 32 different results.

In parallel to that we were thinking about the vertical bars that we were seeing and what might cause them. The first thing we likely checked with that was if the output was shifted by exactly a byte. It was not. Or at least not always.

I remember creating a script (well, C code running on device, but for this post I have recreated the gist of it in python⁴) that did the following: loaded a file of a good readout from the test pattern, “reversed” what the IPP, DSI, and PHY did and generated a pair of “lane” buffers. You can do the same to a bad example and with not too much code, search a window of 0-32 bit shift between the lanes to find where you have the minimum bit difference to the original.

We also then built a script that toggled the sensor did a single readout of a test pattern, ran through that algorithm, and logged built a histogram around which shifts were occurring.

Conclusion

I never really got a satisfactory answer from our HW team as to what the root cause was. Something about not resetting the input shift registers correctly, but I do know that we had to rev a chip, at the least was a metal fix to get this resolved.

Once I had recognized that it was a shift between lanes, it took a boat load more work to prove to the hardware teams that this was a real issue and not just some configuration issue.