| Home | Forums | What's new | Resources | |
| Cache Coherency Sanity Check |
| Mask of Destiny - Feb 16, 2007 |
| Mask of Destiny | Feb 19, 2007 | ||
| So has no one here done any multi-processor work on the Saturn? | |||
| vbt | Feb 19, 2007 | |||
Maybe Rockin'-B, I did some tries with no success | ||||
| antime | Feb 20, 2007 | |||
It's safer/more correct to do Code: This way you can't do the addition twice by mistake. However learning the link script syntax is worth it IMO. First, if you discover that a variable isn't shared you can change it to be cached just by modifying its attribute, rather than finding all changing all code that accesses it. Second, grouping all uncached variables together in memory means they won't pollute the cache just by sharing a cache line with a cached variable. Third, you will need to learn to use these features if you want to use overlays, run code from on-chip memory etc. (This doesn't apply to the Saturn, but some cache types write back an entire cache line into memory at once. In these cases the uncached variables must be placed separately from the cached ones, or you may write back stale data. I don't know what core the 32x uses.) | ||||
| Mask of Destiny | Feb 20, 2007 | |||||||||||||||||||||
If I was writing a piece of software from scratch for the Saturn I would certainly take this approach, but in this case I'm thinking about porting a piece of software I wrote with modern multi-processor/multi-core setups (which have hardware cache coherency) in mind to the Saturn. It's mostly just an interesting exercise. One of the design goals of my little programming language is to simplify the utilization of multiple cores/processors and I thought it would be interesting to see how well it would handle the rather odd setup in the Saturn/32X. Of course, since the current implementation is just an interpretter the performance will suck compared to C code running on even one processor, but I can still compare performance of the interpretter running on one processor (with locking removed and nothing accessed via cache-through area) vs. two. I appreciate the offer, but I'm a masochist and like to target the bare metal. I have implementations of most of the parts of the C library that I use, with the notable exception of the file functions, for my Sega CD port (though my malloc/free implementation is crappy, I really should write a nice slab allocator). So you can give a variable a cusotm attribute and use that to put it in different parts of memory? Nifty. I don't think I'll have much reason to use it though. There aren't a lot of globals in the code and I'd be hardpressed to think of any that would really benefit from being cached given the Saturn's setup. I'll probably just set up my linker script so that the .data and .bss sections are accessed via the cache-through region. I'm assuming that's because access to one of those is faster than the other? Which is which? That's part of what got me thinking along these lines in the first place. Just the code itself is easily much bigger than the cache (my Sega CD port currently compiles to about 36 or 37K with -O0) so my hope is that some of the performance loss of heap data being uncached will be offset by better cache utilization for the code itself. Yeah, write-back cache without hardware cache coherency would be a total nightmare. I'm pretty sure the 32X SH-2s use write-through cache just like the Saturn. I was under the impression that the 32X uses the same part (or at least a very similar one) just clocked a little slower. | ||||||||||||||||||||||
| antime | Feb 20, 2007 | |||||||||
Strong coherency and ordering as provided by the x86 is really the odd one out. Most architectures have much looser memory models both for performance and implementation simplicity. Eg. the PPC architecture manual says that even if a memory page is marked as requiring memory coherency you still have to explicitly execute a "sync" instruction until the results are visible to other CPUs. Yes, and it works with functions as well. Look for "Output section attributes" in the ld manual. Workram-l uses DRAM and workram-h uses faster SDRAM. That is also why SCU DMA only works with workram-h. | ||||||||||
| Mask of Destiny | Feb 20, 2007 | ||||||||||||
I probably should have said modern desktop systems (which are almost all x86 systems these days). I can't think of how you'd manage SMP on a modern CPU without reasonable cache coherency support. Even on x86 you need to use one of the fence instructions to make sure that all pending writes have actually occurred, but that's usually only an issue when you're implementing locking primitives or trying to do atomic updates. Those kind of limitations are annoying, but at least they can be managed by your locking primitives. It's having to manually flush cache lines that my code really can't handle. Good to know. Might come in handy for Sega CD work as well. Also good to know. Thanks! | |||||||||||||
| antime | Feb 20, 2007 | ||||||
It doesn't really require more than invalidating the cache after you've acquired the lock that protects the shared resource. If you use a language that's designed with multithreading/multiprocessing in mind it can even be done automatically. It's only really the C model that's problematic, but it has all kinds of problems with the modern computing environment anyway. As long as all shared data protected by one lock can be packed into a single object it shouldn't be a big task to write locking functions or macros that do the flushing automatically. | |||||||
| Mask of Destiny | Feb 20, 2007 | |||||||||
Not if the processor uses a write-back cache, which AFAIK is pretty much the norm for processors that are likely to be used in an SMP setup these days. The "distance" between the processor and memory has gotten too great for write-through to be reasonably performant anymore. Indeed. It gets complicated if the object has pointers to separately allocated heap objects/structs/arrays. I suppose you could require that the separately allocated object has it's own lock, but that's not necessarily a good idea from a performance point of view. Locking overhead is enough of a problem as is. | ||||||||||