|
| | RockinB said: |
It is a pure software problem to ensure, that both cpus don't access the same memory region at the same time (in a manner not intended, avoid both writing to the same memory). you will have to implement some handshaking. it is best when you know exactly, which data is accessed by which cpu. at special points, you could synchronize the cpus, e.g. perform variable update.
So globals and heap don't need to be accessed cache-through in general. i would recommend to use cache-through access only for very few dedicated variables. Since you synchronize only a couple of times per frame, you could clear the cache lines on the reading cpu (CSH lib of SBL lib), or clear the whole cache (slCachePurge() or something of SGL) when synchronizing.
|
If I was writing a piece of software from scratch for the Saturn I would certainly take this approach, but in this case I'm thinking about porting a piece of software I wrote with modern multi-processor/multi-core setups (which have hardware cache coherency) in mind to the Saturn. It's mostly just an interesting exercise. One of the design goals of my little programming language is to simplify the utilization of multiple cores/processors and I thought it would be interesting to see how well it would handle the rather odd setup in the Saturn/32X. Of course, since the current implementation is just an interpretter the performance will suck compared to C code running on even one processor, but I can still compare performance of the interpretter running on one processor (with locking removed and nothing accessed via cache-through area) vs. two.
| |
i have been using the SGL function slSlaveFunc() in the voxel demo, the texture coordinate stuff and the gbc and snes emus. based on some sbl dual cpus, i have reimplemented and extended the SGL handling of the slave for the SGL replacement. you can have some code, if you like.
|
I appreciate the offer, but I'm a masochist and like to target the bare metal. I have implementations of most of the parts of the C library that I use, with the notable exception of the file functions, for my Sega CD port (though my malloc/free implementation is crappy, I really should write a nice slab allocator).
| |
Use GCC's attribute extension to assign them to one segment with a virtual load address in the uncached region
|
So you can give a variable a cusotm attribute and use that to put it in different parts of memory? Nifty. I don't think I'll have much reason to use it though. There aren't a lot of globals in the code and I'd be hardpressed to think of any that would really benefit from being cached given the Saturn's setup. I'll probably just set up my linker script so that the .data and .bss sections are accessed via the cache-through region.
| |
and on the Saturn will be different for workram-H and -L.
|
I'm assuming that's because access to one of those is faster than the other? Which is which?
| |
When going for maximum performance it's also worth remembering that cache invalidation and pulling "read-once"-type data into the cache will lower the cache utilization for the rest of your code and data.
|
That's part of what got me thinking along these lines in the first place. Just the code itself is easily much bigger than the cache (my Sega CD port currently compiles to about 36 or 37K with -O0) so my hope is that some of the performance loss of heap data being uncached will be offset by better cache utilization for the code itself.
| |
This doesn't apply to the Saturn, but some cache types write back an entire cache line into memory at once.
|
Yeah, write-back cache without hardware cache coherency would be a total nightmare.
| |
I don't know what core the 32x uses.
|
I'm pretty sure the 32X SH-2s use write-through cache just like the Saturn. I was under the impression that the 32X uses the same part (or at least a very similar one) just clocked a little slower. |