Skip to content

Conversation

@gandrewstone
Copy link

@gandrewstone gandrewstone commented Feb 2, 2021

This code optimizes the RMT fill interrupt handler for execution time. It does some small C optimizations but the majority of the effect is the replacement of the C inner loop with unrolled assembly language.

The assembly code writes the RMT pattern for all 32 bits of pixeldata4 into the RMT buffer.
It achieves a jump-free 4 cycles per bit by operating as follows:

First it shifts the target bit into the MSB (not necessary for the first bit) of reg %3. Then it executes 2 speculative move operations that copy the correct RMT pattern into a working register, based on the sign of %3. Since we shifted the target bit into MSB, that bit defines the sign, so the assembly instruction movgez (move if greater than or equal to zero) and movltz (move if less than zero), also has the semantics "move if MSB is 0" and "move if MSB is 1" respectively.

Finally we store the working register to memory, indexed by pItem with a specified offset. If the ESP32 was big endian, the offset would simply be incrementing, 0,4,8... However, the ESP32 is little endian, which means the bytes are backwards, but the bits within the bytes are forwards. Hence the non-incremental store offset order.

…sses memory once per 32 bits. Inline the code into the interrupt handler.
@samguyer
Copy link
Owner

samguyer commented Feb 8, 2021

@gandrewstone Wow, this is amazing work -- thank you. I would like to do a little testing and benchmarking on this code, but I'd like to merge it in as an option, guarded by ifdef. Do you have a sense of how much faster it is than the baseline C code? I'd be thrilled if it were fast enough that we didn't need to worry so much about other interrupts coming in (especially from WiFi) and disrupting the timing.

@gandrewstone
Copy link
Author

gandrewstone commented Feb 11, 2021

If you look at 1883558 in my repo, you will see a bunch of different variants and some logic that measures timing. You could grab parts of that to do your measurements.

IIRC it was reporting < 1000 cycles whereas the original was more like 1500+, so almost 2x as fast.

But, I don't really believe these numbers. During the optimization process there was a suspicious lack of a drop when I did some changes (like removing a function call or reading all 32 bits in one shot instead of in char chunks but I would be surprised if the compiler optimizer was that sophisticated) as compared to what people reported in various ESP32 forums posts about timing measurement.

I have an app that is constantly receiving data from the WiFi and pushing it out to the LEDs. I was running 4000 LEDs in an 8x500 configuration, but sometimes I'd just try smaller (4x50). I was seeing flashing, often along an entire strand but sometimes not the first few bulbs. In particular, if the lowest bit of a color was set, I'd see flashes of the subsequent color (in the order the color data is put on the wire). If the lowest bit was NOT set but bit 1 was, I'd see flashes much more rarely. From that I reasoned that something was inserting an extra bit (or sometimes 2) in the wire, shifting the LSB of one color into the MSB of the next. This could be caused by a delay being inserted in the line, or by a spurious final bit.

I eliminated the possibility of a final bit by causing the RMT to issue the 50us latch time itself, rather than doing it by enforcing a delay. This ensured that when the RMT was releasing the line, it wasn't driving high for some reason (also I put in some pull-down resistors). [aside: Having the RMT issue the latch delay is an interesting idea but since the CPU thread is waiting for RMT completion, its faster in this architecture to do it your way with the CPU delay lock, since your way lets the CPU work on the changes in parallel with the final 50us delay. However, using a RMT-generated 50us latch time would allow the system to be re-architected to basically have the RMT subsystem independently and continually update the strands, no explicit update call needed. But this would be more like how a video card works, and you'd have similar problems like "ripping"]

So it had to be something interrupting the RMT interrupt handler or that the interrupt handler wasn't supplying the data quickly enough.

Anyway, tldr: with this code, I do not see any flashing.

(BTW, if you are using WS2811 strands and still see flashing, keep in mind that you can slowly "blow" the 1st bulb I think by letting data ride above VCC -- for example, if you connect data before you've connected an independent power to the bulbs. The way this "blows" is to become very sensitive to voltage changes and so inject spurious bits in the line. The only solution I know of is to cut out the 1st bulb. Maybe an indication that the 1st bulb or 2 is blown is if they spuriously turn on right on power on, or at least that's what I see).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants