If possible, use DMA. If you have to use CPU transfers, try the existing library code first. Smart people have spent a lot of time optimizing it. Other than that, write in larger chunks, to reduce the amount of time spent waiting on the (S)DRAM. Use all the normal optimizing tricks, like unrolling your loops and taking advantage of the instruction set where possible. Also test on real hardware if you can, emulators rarely take things like memory wait states or pipeline stalls into account. A naive attempt at a 2-way unrolled version (still writing only a single byte at a time): Code: | | | .align 4 _memclrwh: mov #0, r1 cmp/eq r1, r5 bt leave cmp/eq r1, r6 bt leave mov r5, r0 mov r4, r3 add #1, r3 add r5, r4 outer: tst #1, r0 bf/s odd mov r4, r2 inner: mov.b r1, @-r2 odd: cmp/hi r3, r2 bt/s inner mov.b r1, @-r2 add r7, r3 dt r6 bf/s outer add r7, r4 leave: rts nop |
|