Home | Forums | What's new | Resources | |
Reading from the framebuffer? |
XL2 - Aug 23, 2018 |
1 | 2 | 3 | 4 | Next> |
antime | Aug 23, 2018 | |||
SCU DMA can read from the VDP1 back buffer. Since bandwidth isn't free, it'll cost some performance. EDIT: Maybe it would be possible to use the SCU DSP to generate a subsampled version of the backbuffer? There's also still a lot of ca. mid- to late 90s material on occlusion culling available, that may fit the Saturn's limits better than a lot of the material that's presented today. |
Ponut | Aug 24, 2018 | |||
So I have a general lack of experience in programming, but I would say a few things that might be helpful: 1. Occlusion planes rather than "Z-sorting" would be more performant. My only thought of how that would be done is a distance test on the center of objects (meshes) and an intersection test with the occlusion plane (which would actually be a 2D line). With that you could determine if it is past the occlusion plane (in absolute distance) or not and whether or not it is actually intersecting the plane. A solution that can be done entirely on CPU. But this won't work for entire level meshes, only smaller game objects. 2. Besides that, the frame-buffers are something like 256 KB? Or maybe even 512KB? In that case, that is too much to go through on a single frame. It might not even work because it hits the next copy before the first is finished. Not good! Ideally, you would set-up something that copies 1/10th of the frame-buffer at a time and performs your sorting on that. Not ideal, but it should work. My idea is like this: Uint8 copytimer; if(copytimer > 10){ copytimer = 0; } slDMACopy(framebuffer + (copytimer * 26215), workarea, 26215); ztSortPolygons(framebuffer + (copytimer * 26215)); //Assume this does its work on 26215 byes of buffer at a time copytimer++; |
XL2 | Aug 24, 2018 | |||||
Thanks both of you. Would it be possible by using scu dma to retrieve only the msb and merging these bits in bytes? Like a 512x256 buffer would only require 16 KB? And even less if you skip pixels and don't retrieve the whole thing? I have been scratching my head for months to solve the occlusion problem and have something that can fit both for Sonic X-Treme's totally inconsistant maps and something like Quake. A pvs is nice, but it's only as accurate as your map subdivision, which means you need a lot of memory and spend more time searching the nodes and doing frustum culling. A bsp with pvs or portals is nice too, but it doesn't work well for open world maps and since the Saturn has no texture coordinates you end up with really weird walls and floor (like the Slavedriver games) and lots of vertices and quads to deal with. A portal system requires lot of manual work and is good only for interior maps. I thought too of placing manually some occlusion walls or even doing it automatically, but it can quickly kill peformances so you are restricted to only a few walls unlike a portal system and it requires more manual work than automated techniques, but I will consider your idea Ponut. A pvs + "depth" buffer seems like a good fit since the game would generate its own (limited) occlusion map from last frame, but it's by no means simple or super fast. Any better ideas? |
mrkotfw | Aug 24, 2018 | |||
In terms of fetching the MSB, I believe that you can use the update features of the SCU DMA to fetch a byte then skip a byte. With the SCU DSP, it would mean multiple transfers. From the back buffer to HWRAM to the DSP data banks. |
XL2 | Aug 24, 2018 | |||||
I guess the SGL function is doing something like that since you can get a low er res version of the buffer (like 64x32), but again there is little to no details for these functions. I'm not even sure if it's the SH2 reading the buffer or scu indirect transfer. Either ways, when should I call it? During v-blank in? I tried the pvs technique, with all the raytracing it takes hours to build a map since they aren't corridor maps and it needs to take into account many areas the camera can go to. I could probably speed it, but it's a bit insane! |
mrkotfw | Aug 25, 2018 | ||||
Here is what slGetFrameData does. I haven't verified, but I see no calls to SCU DMA.
Code:
|
antime | Aug 25, 2018 | |||
If the function used DMA, it would almost certainly be documented, to prevent conflicts. It doesn't look like you can copy bytes, the read address increment options are 0 and 4, and the write address increment options do not include one byte. |
XL2 | Aug 25, 2018 | |||
Wow, thanks a lot, amazing! How did you dissassemble the function? I guess I should really try to learn assembly... AFAIK, with SGL channel scu dma 0 is free for the user while the rest is used by SGL. I guess I could just let the slave do it at the start of the game loop while the main cpu is preparing the frustum and other stuff? Even if it doesn't work well for occlusion, sending the framebuffer to a sprite is also a nice effect, so nothing would be lost. Edit : I did manage to increase the speed of the pvs building quite a bit, so it might be a viable solution with RLE compression. |
antime | Aug 25, 2018 | |||||
Objdump... can disassemble object files. Other useful binutils tools include ar... and nm.... |
mrkotfw | Aug 25, 2018 | ||||
On MinGW/Cygwin/Unix:
Code:
I've attached a .zip file for you that includes the source. |
XL2 | Aug 26, 2018 | |||
Thanks a lot, I guess either using the slave to do it or using a scu dsp transfer would be my best options. I will be taking a look at these functions later this week. Thanks again |
XL2 | Aug 27, 2018 | |||
I think that will give you a better idea of what I'm thinking of doing and it was very easy to implement (but it's still a bit slow). If would complement the PVS and hopefully I will find a way to subdivide these quads close to the camera to prevent such bad clipping, but anyway : you can see the weird colors are color bank pixels that I just flipped the MSB to have them displayed using a 16 bits sprite. These would be the occludees, while the correctly colored quads would be the occluders. So these huge objects blocking the camera and both sides of a node could at least block some extra geometry. This buffer is currently 88x56, which seems like it could work if the algorithm is conservative (like check a bit more than the quads' boundaries to prevent rejecting too much), but making it fast is a whole other thing and I'm not sure it can be done, but whatever. |
mrkotfw | Aug 29, 2018 | |||
Thanks, that gives me a better idea. There's a few things here...
Are there other areas to improve on performance? Have you timed your code with the CPU FRT? Have you timed how long it takes to render? The framebuffer idea seems wild. With the DSP, you have 4 data banks, each 1024 bytes. The small access is 4 bytes. You have the ability to DMA straight from the DSP and into its 4 data (and 1 prog) bank. Though, I've tried to DMA from LWRAM and the Saturn would lock up, so I'm not sure if you'd be able to DMA straight from the B-bus to the DSP data banks. I believe it can do 4 loads in parallel, though, some in non-general purpose registers (A, X, Y, etc.). I just don't know how you use the DSP for this purpose. Then there's the slave CPU. You DMA from VDP1 FB to HWRAM. Then you have to keep the slave off the CPU bus, so you manually copy chunks of the DMA'd FB into the slave's split cache. You have about 2KiB there. I'm just throwing ideas out there. I don't know if you've done this already, but getting some way to objectively profile the game would be a really good step to take soon. |
XL2 | Aug 30, 2018 | |||||
Quad subdivision is tricky, unless of course you just store different textures and polygons/vertices. That would be the fastest way for sure, but it takes way too much memory, from both RAM and VRAM. It's easy to subdivide a sprite on the height (it's what I do for the water animation, I just "scroll" the starting address). I guess maybe doing something like Quake 2 on PS1 could be one workaround, but it's anoying to always encounter loading screens. Creating new vertices and polygon in realtime could also be done and is what the PS1 does with its SDK afaik, even if it's slower than just storing it in RAM, but I'm not sure how to subdivide a sprite horizontally in VRAM. If you change the width, you will just end up with a sprite that will just alternate lines with the other horizontal part, and changing the pointers will lead to the same problem. As for performances, for sure my CPU code can be improved a lot, I'm not doubting that. How SGL works with the draw commands is that it stores everything in a few buffers : vertex buffer, polygon buffer, z-sort buffer and draw commands buffer. When you synch, it just DMA everything in one batch to VRAM and the CPU moves on. Afaik, since LWRAM is on the a bus, I guess you can't directly DMA to the DSP, but I could be wrong. As for how I know it's the VDP1 that is the bottleneck, I simply have different debug modes : untextured polygons only, wireframe only, gouraud shaded textured polygons, etc. The gouraud shaded polygons leads to many slowdowns, which doesn't happen in other modes. Of course, with better CPU optimizations, I could probably do more, but at the same time, all this overdraw is also increasing the CPU load (more vertices and polygons to process for nothing). But anyway, since the reaction so far with the Sage demo has been negative overall (some people even complain that I'm using 3D models instead of sprites!) and most people just try the demo on their slow PCs with emulators and don't even bother plugging in a controller and then complain online that it doesn't control well or that it slows down, I'll just stop wasting time on Sonic Z-Treme and move on to the FPS game. Which means that a simple portal system could be implemented, so I don't need to overthink for an all-around solution. That solves many issues and involves less work in the end since I don't need to try to micmic a game not even built for the Saturn, but I might still play with the framebuffer to add some cool effects. |
Ponut | Aug 30, 2018 | |||||
I know its off-topic, but wow. I would say they have high standards but maybe in that case I am confusing high with low. (I would say something about performance but I have an i7 4770K @ 4.5 GHz..) As far as performance goes, I know I don't have much to add. Have you explored the option of only partially calculating the occlusion/PVS each frame? (The idea being the occlusion is a "buffer" of occluded polygons that is filled partially each frame) |
XL2 | Aug 30, 2018 | |||||
With a portal system I could easily precalculate the pvs, so at runtime all you need to do is uncompress the pvs for your current node/leaf, flag the visible nodes with the current ticks and then run your bsp/octree normally but you don't bother with nodes that aren't potentially visible. It speeds up the cpu calculations quite a lot and it solves partially the occlusion problem. You can also do like the Slavedriver engine and just add user clipping draw commands to prevent even more overdraw, but you would need to clip these against the portals, so I'm not 100% sure it's worth the extra cpu load. Sgl has a sorting option where it draws all polygons in front of the previous polygons within the same pdata, so you could always include a user clip command first for each plane and use the "sort before" option for all the following polygons, which minimizes the overdraw as much as you can on Saturn. Anyway, I will take my time on this to properly write a bsp compiler and portal generator, so it might take a few months. |
mrkotfw | Aug 30, 2018 | |||||||||||||||||||||||||
Yeah, that's why I asked. Quad division is really tricky, not including the fact that there's no hardware UV texture support.
Okay, that makes sense. I guess that keeps you from idling both on the VDP1 and CPU.
It's on the CPU-bus, sadly.
That's insane. Where is this negative feedback coming from? With anything, you're going to get your percentage of idiots who don't know what they're talking about. You know you've made it to the big leagues when you start getting death threats. Don't let that discourage you. Really, work on what makes you happy. Another thing is that the game looks like a vertical slice rather than a tech demo. Some people may have a hard time understanding that. If it was more of a prototype, it might allow people to have a better understanding that the game is of course still in progress. But then again, people are stupid.
|
1 | 2 | 3 | 4 | Next> |