WARNING: DRAFT POST
(alternative title: How to procrastinate on your 2020 self review by writing one form 2010)
Catching up on cppcast last week, I learned that MSFT acquired ThreadX and it is now Azure-RTOS1. This brought up some old memories.
Once upon a time, in a universe far far away… We built a tool for logging all the context switches occurring on a system based on ThreadX. A while back, I found a patch file containing most of the code necessary to page how we did this back into my brain.
Since at the time it was one of my more favorite hacks, seems worthy of a write up. Per that patch file, this work was done sometime around or before March 2011.
Disclaimers: The original of this code is probably still copyrighted, though the company Zoran has gone through several acquisitions and AFICT no one is selling anything based COACH any longer. A lot of the coding style from this makes me cringe now, I’ve corrected some of the worst parts (Hungarian notation anyone) while extracting from the patch file for this post. I tried to leave any inline comments intact since they were written when this was fresh in heads. It doesn’t appear that all of ThreadX is open source in their repo, however, in researching this post I stumbled on this book which appears to describe most of the missing pieces (e.g. low level interrupts in ThreadX/MIPS, scheduling, etc.), and Google book search on it allowed me to verify most of what I needed. Finally, back when this was originally written, most of the brains of the operation was Yuriy Romanenko, I just glued the finishing sparkle on. :)
System
For this to make any sense you need a little understanding of the assumptions under which it was running. Let us start with a high level description of the system.
The COACH (“Camera On A Chip”) systems that this code ran on had two roughly 240MHz 32bit MIPS CPUs and a boat load of peripherals hardware accelerators for image processing pipelines. The primary CPU was reserved for Zoran’s work and the second one (referred to as the “COP” for CO-Processor) was reserved for our customers use.
Everything on the system ran in the same address space, and everything ran in what a normal human would consider kernel mode. There was no virtual memory or TLB. The scheduler was configured to be preemptable, with round robin scheduling within a priority. A common idiom throughout the code base, was to implement atomic operations by simply disabling interrupts.
Why a context switch logger?
So why bother, why would we want this context switch logger? In a sentence: debugging dropped frames in our video recording pipeline.
The video recording pipeline was designed such that each stage had a thread. These threads would block on a queue2 for work to do (generally waiting for a frame to process), wake up and in most cases configure a piece of hardware to run, kick the hardware and go back to sleep. Some of the hardware had real-time constraints (sensor readout needs buffers, display needs frames, …). Some stages in the pipeline need to run close to their HW’s theoretical max, and need to be configured with very little down time. Some hardware blocks take long CPU time to configure, others less. The entire pipeline in some cases, e.g. 1080p60fps, was close to the total memory bandwidth3 available. For all this, thusly some of the threads in the pipeline were running at different priorities.
Debugging problems with text based logging or our other existing tooling was near impossible. We needed a better tool.
WHAT
With that in mind, what data did we need to debug problems? Here’s the struct we came up with; 16 bytes of data per entry.
struct ContextSwitchEntry {
uint32_t type;
uint32_t counter;
uintptr_t address;
const char *thread_name;
};
// Some "types" of events logged by the context switch logger.
#define TX_DEBUG_WAKE 0x00
#define TX_DEBUG_SLEEP 0x01
#define TX_DEBUG_SYNC_RESUME 0x02
#define TX_DEBUG_ASYNC_RESUME 0x03
#define TX_DEBUG_ISR_RETURN 0x04
#define TX_DEBUG_ISR_ENTER 0x05
#define TX_DEBUG_ISR_EXIT 0x06
#define TX_DEBUG_INT_DISABLE 0x07
#define TX_DEBUG_INT_RESTORE 0x08
#define TX_DEBUG_PREVIOUSLY_DISABLED 0x20
Well, we obviously will need something to indicate where we are in code, what thread we’re running on, what we’re doing, and when we’re doing it.
For the “type” field we defined 8 possible places a thread could be suspended / resumed plus interrupts disabling / enabling. We do not really need a full 32bits for the type flag, but due to alignment and because we will need to access that field sometimes from MIPS asm, using the full uint32 seemed reasonable.
Name | Desc |
---|---|
TX_DEBUG_ISR_ENTER | Entering a high level ISR handler |
TX_DEBUG_ISR_EXIT | Exiting a high level ISR handler |
TX_DEBUG_WAKE | CPU returning from idle sleep (scheduler) |
TX_DEBUG_SLEEP | CPU heading to idle sleep (scheduler) |
TX_DEBUG_SYNC_RESUME | Resuming thread (scheduler) |
TX_DEBUG_ASYNC_RESUME | Resuming thread (scheduler) |
TX_DEBUG_ISR_RETURN | Return to thread after handling interrupt? |
TX_DEBUG_INT_DISABLE | TX_DISABLE, disabling interrupts |
TX_DEBUG_INT_RESTORE | TX_RESTORE, re-enabling interrupts |
To handle the when something occurred, we will store the count register from the MIPS4 coprocessor0
(not to be confused with the COP, second MIPS). Since we’re counting cycles, have 32bits of them, and operating at about 240MHz we roll over roughly every 17.9 seconds (someone check my math?). Our gaps between log entries are on the orders of microseconds so we can safely assume that we will always catch the counter wrapping and handle that in post processing the logs. Another benefit of this timebase is that it is what we used for our text based logger so it would be possible to align with them.
Where were we in the code? Well, for simplicity we simply record a single address representing where we are, would be nice to record a full stack trace, but the log entries needed to be extremely cheap and small (and implementable in asm) and unwinding the stack was out of scope. Mostly the address we record is the return address from our currently instrumented function e.g. __builtin_return_address(0)
or in MIPS asm $ra
register. A few places we do something different to be more meaningful, for ISRs for instance we label the thread as “ISR” and the address is the handler’s address.
What context is the code running in? For this we logged the thread name. Thankfully, in almost all cases the on our system threads were created with static strings for their names; so the data for their names would certainly outlive the thread. This allowed us to simply record a pointer to the thread context’s name which. When saving the recordings we validate that the string is still valid-ish (is printable characters), and that’s all the sanity checking we did. Kinda gross but effective.
CONTROL
The first implementation of this operated in a single one-shot buffer that once full stopped logging. Later we added a cyclic mode that could either be polled from a thread or dumped when a problem occurred (e.g. from the crash handling code!). Alright, so what do we need to control this, we need some pointer to a buffer to store ContextSwitchEntry
structs, a current pointer to tell us where into that buffer we are and a size or end marker to tell us when to stop (or in cyclic mode return to the beginning):
ContextSwitchEntry *profile_switch_log_start = (ContextSwitchEntry *)-1;
ContextSwitchEntry *profile_switch_log_current = (ContextSwitchEntry *)-1;
ContextSwitchEntry *profile_switch_log_end = (ContextSwitchEntry *)-1;
uint32_t profile_switch_log_cyclic = 0;
Then to start the logger we might do something like:
void profile_switch_logging_start(/*take entries and cyclic param*/) {
#if (FEATURE_DEBUG_TX() == FEATURE_EN)
const bool cyclic = false && (FEATURE_DEBUG_TX_ALLOWCYCLIC() == FEATURE_EN);
const int entries = 100000;
ContextSwitchEntry *log_buf =
(ContextSwitchEntry *)malloc(sizeof(ContextSwitchEntry) * entries);
TX_INTERRUPT_SAVE_AREA
TX_DISABLE
profile_switch_log_cyclic = cyclic;
profile_switch_log_start = log_buf;
profile_switch_log_current = log_buf;
profile_switch_log_end = profile_switch_log_current + entries;
TX_RESTORE
#endif
}
And I suppose we want to stop as well:
void profile_switch_logging_stop(ContextSwitchEntry **start,
ContextSwitchEntry **current,
ContextSwitchEntry **end, uint32_t *cyclic) {
#if (FEATURE_DEBUG_TX() == FEATURE_EN)
TX_INTERRUPT_SAVE_AREA
TX_DISABLE
// Save current location
*start = profile_switch_log_start;
*current = profile_switch_log_current;
*end = profile_switch_log_end;
*cyclic = profile_switch_log_cyclic;
// Set to dummy to disable
profile_switch_log_start = (ContextSwitchEntry *)-1;
profile_switch_log_current = (ContextSwitchEntry *)-1;
profile_switch_log_end = (ContextSwitchEntry *)-1;
profile_switch_log_cyclic = 0;
TX_RESTORE
#endif
}
That’s basically it! We’ll need some code to dump those buffers, but can implement that later.
HOW
From the above you would think logging this would be pretty trivial, and you would be right .. mostly. Let’s start with the easy cases.
The basic thing we want to do is:
- Check if we are the main cpu or the COP, if COP bail
- Disable interrupts if they were enabled
- Check if logging is enabled, if not bail
- Check if we’ve reached the end, if not-cyclic bail, if cyclic reset to start of buffer
- Save our data into the current slot (could stash the slot we want to write to and increment then save while interrupts are enabled, but the “work we do is writing 16bytes; e.g. cheap)
- increment current slot idx
- restore interrupt enabled context
Turns out ISR is actually an easy case. Interrupt handling, as is commonly done, in ThreadX are split between the primary (low) (clear bit, enqueue work, etc.), and secondary (high) half to do the work. The secondary handler is written in c. It simply iterates the handlers, if any are “ready” calls their registered function. We will inject a log before and after the call to the functions so we can better understand their timings. This looks something like:
STATIC VOID _tx_isr_handle_list(TX_ISR_PTR isr_node) {
// for each isr_node ...
// if this ISR has been signaled ...
// ...
#if (FEATURE_DEBUG_TX() == FEATURE_EN)
if (!IsCop()) {
#if (FEATURE_DEBUG_TX_ALLOWCYCLIC() == FEATURE_EN)
if (profile_switch_log_cyclic &&
profile_switch_log_current == profile_switch_log_end) {
profile_switch_log_current = profile_switch_log_start;
}
#endif
if (profile_switch_log_current != profile_switch_log_end) {
profile_switch_log_current->uiType = TX_DEBUG_ISR_ENTER;
profile_switch_log_current->uiCounter = _tx_counter_get();
profile_switch_log_current->uiAddress = (UINT)isr_node->tx_isr_func;
profile_switch_log_current->szThreadName = "ISR";
profile_switch_log_current++;
}
}
#endif
// ~~~ Call the ISR's function ~~~
isr_node->tx_isr_func(isr_node->tx_isr_int_id, isr_node->tx_user_data);
#if (FEATURE_DEBUG_TX() == FEATURE_EN)
if (!IsCop()) {
#if (FEATURE_DEBUG_TX_ALLOWCYCLIC() == FEATURE_EN)
if (profile_switch_log_cyclic &&
profile_switch_log_current == profile_switch_log_end) {
profile_switch_log_current = profile_switch_log_start;
}
#endif
if (profile_switch_log_current != profile_switch_log_end) {
profile_switch_log_current->uiType = TX_DEBUG_ISR_EXIT;
profile_switch_log_current->uiCounter = _tx_counter_get();
profile_switch_log_current->uiAddress = (UINT)isr_node->tx_isr_func;
profile_switch_log_current->szThreadName = "ISR";
profile_switch_log_current++;
}
}
#endif
}
Neat, so that code should cover: TX_DEBUG_ISR_ENTER
and TX_DEBUG_ISR_EXIT
.5
Unfortunately, the rest of the points we want to instrument are … in MIPS assembly.
The hard bit is that in several contexts where we want to log, we are limited to the available registers we can safely blow away w/o saving context. We are limited to k0/k1, so we need to implement in a way that doesn’t require writing to more than two registers.
Below is the set of asm macros we came up with to do the heavy lifting. The main entry point is DEBUG_TX_LOG_HERE_ASM
which takes the two names of the registers to use, the identifier for the type we want to use, and an optional “callback” macro to inject at the end taking the same args sans itself.
Much like the c-code we saw above, the first thing DEBUG_TX_LOG_HERE_ASM
does is to check if we are on the COP and bail if so (3 instructions). Next it injects our DEBUG_TX_LOAD_LOG_PTRS
macro which will load up the passed kreg0/kreg1 with the current/end of the ring buffer respectively; in ring buffer mode it also handles the restarting at the beginning (4 instructions, 14 instructions for cyclic). Next we inject _DEBUG_TX_UPDATE_ENTRY_ASM
to write our data to those pointers (10 instructions). And finally we inject the passed macro name (or an empty default) to be called in all cases after the COP check passes. So for a total of 23 or 33 instructions we have our logging; neat!
Lets take a look:
/*
* _DEBUG_TX_UPDATE_ENTRY_ASM: Updates the current entry of context switch log.
* Sets current thread name pointer, return address
* (EPC), thread name, and passed in type.
*
* kreg0 - first register we can use. kreg0 should point to the current entry.
* kreg1 - second register we can use
*/
.macro _DEBUG_TX_UPDATE_ENTRY_ASM kreg0, kreg1, type
/* Write Entry -- kreg0 is pointer to CSWITCHENTRY struct to update */
li \kreg1, \type
sw \kreg1, CSWITCHENTRY__TYPE_OFFSET(\kreg0)
mfc0 \kreg1, $9 /* Get counter value */
sw \kreg1, CSWITCHENTRY__COUNTER_OFFSET(\kreg0) /* Store it to the log */
mfc0 \kreg1, $14 /* Get EPC */
sw \kreg1, CSWITCHENTRY__ADDRESS_OFFSET(\kreg0) /* Store to the log */
la \kreg1, _tx_thread_system_stack_ptr
lw \kreg1, _TX_THREAD_CURRENT_PTR_OFFSET(\kreg1) /* Pickup current thread pointer */
lw \kreg1, TX_THREAD__TX_THREAD_NAME_OFFSET(\kreg1) /* Pickup tx_thread_name pointer */
sw \kreg1, CSWITCHENTRY__THREADNAME_OFFSET(\kreg0) /* Store to the log */
.endm
/*
* DEBUG_TX_LOAD_LOG_PTRS: Loads kreg0 with Current and kreg1 with End
* pointers. Handles restarting cycle if cyclic. Does
* NOT handle verifying we are not on COP first
* kreg0 - first register we can use
* kreg1 - second register we can use
*
* NOTE: Can fall out early to label "4f" In this case we have assumed that
* logging is enabled. This case we can skip the check, BUT we only promise that
* the current entry kreg0 is valid, not kreg1.
*
* For example:
*
* DEBUG_TX_LOAD_LOG_PTRS $10, $11
*
* sub $11, $11, $10
* blez $11, 1f
* nop
* 4:
*
*/
.macro DEBUG_TX_LOAD_LOG_PTRS kreg0, kreg1
/* Get CSW Logger pointer -- */
la \kreg0, profile_switch_log_current
lw \kreg0, 0(\kreg0)
#if (FEATURE_DEBUG_TX_ALLOWCYCLIC() == FEATURE_EN)
/* Cyclic */
la \kreg1, profile_switch_log_cyclic
lw \kreg1, 0(\kreg1)
beqz \kreg1, 3f
nop
la \kreg1, profile_switch_log_end
lw \kreg1, 0(\kreg1)
bne \kreg1, \kreg0, 3f /* if(current != end) not wrapping save at current */
nop
/* We _have_ wrapped around -- Load the start address into kreg0; we will
* use it as current and store back the next entry. */
la \kreg0, profile_switch_log_start
lw \kreg0, 0(\kreg0)
3: /* Cyclic not enabled, we need to re-load the End address */
#endif
la \kreg1, profile_switch_log_end
lw \kreg1, 0(\kreg1)
.endm
.macro default_extra kreg0,kreg1,type
.endm
/*
* DEBUG_TX_LOG_HERE_ASM: Loads kreg0 with Current and kreg1 with End
* pointers. Handles restarting cycle if cyclic.
* Verifies that we're on COP first. Updates next entry
* based on type And finally updates the current log
* pointer.
*
* kreg0 - first register we can use
* kreg1 - second register we can use
* type - CONSTANT value for the type of our ctx switch
* logging_extra - Will execute after logging ALWAYS so long as we're not COP.
*/
.macro DEBUG_TX_LOG_HERE_ASM kreg0, kreg1, type, logging_extra=default_extra
#if (FEATURE_DEBUG_TX() == FEATURE_EN) // Profile context switching
li \kreg0, CPU_ADDR_CPU_ID
lw \kreg0, 0(\kreg0) /* READ_REG(CPU_ADDR_CPU_ID) */
bnez \kreg0, 2f /* No context switch log on cop */
nop
DEBUG_TX_LOAD_LOG_PTRS \kreg0, \kreg1
/* If they are equal, start has reached end; we're done */
beq \kreg0, \kreg1, 1f
nop
4: /* Cyclic enabled, we have not wrapped */
_DEBUG_TX_UPDATE_ENTRY_ASM \kreg0, \kreg1, \type
/* Update CSWLog to next entry */
addi \kreg0, \kreg0, 16
la \kreg1, profile_switch_log_current
sw \kreg0, 0(\kreg1) /* Save back offset structure */
1:
\logging_extra \kreg0, \kreg1, \type
2:
#endif
.endm
And here’s how we called it:
_tx_thread_schedule_loop:
// ...
.macro post_dbg kreg0, kreg1, type
mfc0 \kreg0, $9
la \kreg1, profile_switch_idle_last
sw \kreg0, 0(\kreg1)
.endm
DEBUG_TX_LOG_HERE_ASM k0, k1, TX_DEBUG_SLEEP, post_dbg
// something like: while(nor unable threads) { wait; }
// We log one of the "RESUME" after restoring registers from
// stack so that we have the return address setup to log.
// After waking from the wait loop and restoring a standard stack
DEBUG_TX_LOG_HERE_ASM k0, k1, TX_DEBUG_SYNC_RESUME
// After waking from the wait loop and restoring a short stack
DEBUG_TX_LOG_HERE_ASM k0, k1, TX_DEBUG_ASYNC_RESUME
// ... Somewhere else in the code ..
// Somewhere near primary ISR code / exception vector (exact context missing now??)
_tx_exception_trampoline:
// ...
.macro wake_log_idle_time kreg0,kreg1,type
# Track idle time
la k1, profile_switch_idle_last
lw k0, 0(k1)
sw zero, 0(k1)
beqz k0, 2f
nop
mfc0 k1, $9
subu k0, k1, k0
la k1, profile_switch_idle_total
lw k1, 0(k1)
addu k0, k1, k0
la k1, profile_switch_idle_total
sw k0, 0(k1)
.endm
DEBUG_TX_LOG_HERE_ASM k0, k1, TX_DEBUG_WAKE, wake_log_idle_time
// Somewhere else in the code ...
_tx_thread_no_preempt_restore:
// After restoring stack (so that $ra is setup) and before returning.
DEBUG_TX_LOG_HERE_ASM k0, k1, TX_DEBUG_ISR_RETURN
You can see how we used the “callback” macros to also track the total time spent idling.
For the interrupt control logging we do something slightly different but are still able to reuse some of the code form those macros. We also guarded it with a separate feature flag since this code is use so heavily on the system. In fact, including it for completeness, but I think only once or twice did we enable it to try catching a case where interrupts were disabled for too long. Also, re-reading this, looks like there’s a bug (!) since we mucking with the ring buffer before disabling interrupts?
_tx_thread_interrupt_control:
/* Pickup current interrupt lockout posture. */
mfc0 $8, $12 # Pickup current SR
nop # Delay slot
andi $2, $8, RETURN_MASK # Return value back to caller
#if (FEATURE_DEBUG_TX() == FEATURE_EN) // Profile context switching
# if (FEATURE_DEBUG_TX_TIC() == FEATURE_EN)
/* Check and bail if we're on COP */
li $10, CPU_ADDR_CPU_ID
lw $10, 0($10) # READ_REG(CPU_ADDR_CPU_ID)
bnez $10, 1f # No context switch log on cop
nop
/* Load up current/end pointers into $10/$11 */
DEBUG_TX_LOAD_LOG_PTRS $10, $11
/* Bail if we're at the end */
beq $11, $10, 1f
nop
4:
/* Figure out the type */
li $11, TX_DEBUG_INT_RESTORE # Type 8
li $12, TX_DEBUG_INT_DISABLE # Type 7
/* Conditional use DISABLE marker if we're disabling */
movz $11, $12, $4
ori $12, $11, TX_DEBUG_PREVIOUSLY_DISABLED
/* Conditionally mark with previous state */
movz $11, $12, $2
/* Expected results:
before\now | TX_DISABLE | TX_RESTORE
----------+------------+----------
enabled | 7 | 8
disabled | 39 | 40
*/
/* Pre-emptivly disable interrupts. */
li $12, RETURN_MASK
not $12, $12
and $12, $12, $2
mtc0 $12, $12
/* Save the debug info */
sw $11, CSWITCHENTRY__TYPE_OFFSET($10)
mfc0 $11, $9 # Get counter value
sw $11, CSWITCHENTRY__COUNTER_OFFSET($10) # Store it to the log
sw $ra, CSWITCHENTRY__ADDRESS_OFFSET($10) # Store return addr to the log
la $11, _tx_thread_system_stack_ptr
lw $11, _TX_THREAD_CURRENT_PTR_OFFSET($11) # Pickup current thread pointer
lw $11, TX_THREAD__TX_THREAD_NAME_OFFSET($11) # Pickup tx_thread_name pointer
sw $11, CSWITCHENTRY__THREADNAME_OFFSET($10) # Store it to the log
addi $10, $10, 16
la $11, profile_switch_log_current
sw $10, 0($11) # Save back offset structure
1:
# endif
#endif
// Set the requested bits essentially some masking then: mtc0 $9, $12
Saving results
There were some functions for this. Leaving as an exercise for the reader since I have way exceeded my time box for writing this ..
Viewing results
There was an app.
Well, a python or awk script took these binary logs and the output output of nm from the binary to convert to a log format for an existing application we had. That app looked a bit like a crappy version of what XCode’s Instruments provides for “Points of interest”.. much crappier. It was deigned be an oscilloscope / logic analyser IIRC.
FIN
Not the most elegant thing ever, but was really fun for me at the time as a young developer. Still makes me smile sometimes that there was a time we did “real” programming. Writing this up reminds me of many other war stories from those days.
-
Over a year ago, so not like it’s new news, but was news to me [:shrug:] ↩︎
-
That reminds me of another hack to threadx, we modified the queue system to support “zero-capacity” queues. This operated like a barrier blocking until both threads arrived reduced copying of the frame descriptor objects which had grown rather large. Later IIRC we changed the queues to pass around pointers so the zero-capacity queue hack was less valuable. ↩︎
-
Probably not relevant to this discussion, but including for completeness. All of the hardware blocks' DMA memory bandwidth scheduling was separate from the OS and pre-scheduled into a “SlotMachine” for a particular mode.. calculated in an excel spread sheet … ahhhh the humanity, ahhh the memories … ↩︎
-
IIRC the See MIPS Run book was really good reference. ↩︎