SegaXtreme

Home	Forums	What's new	Resources

Reading from the framebuffer?

XL2 - Aug 23, 2018

		XL2	Aug 23, 2018
		Here is one idea I have to reduce the overdraw in my game : Since I'm using RGB code for objects close to the camera and palette codes for objects further away, I thought about simply using the framebuffer as a pseudo- z-buffer. Since I know pixels starting with a 1 (RGB code) are closer to the screen than pixels starting with a 0 (palette code), I could just test palette objects against the previous frame to reject them (like, if the area is covered by 1, the object is occluded and doesn't need to be rendered). I'm not sure if I could even manage to do it quickly enough to make it worth it - since it would require some perspective divisions for the whole mesh, but I could also just test it with the buffered sprite commands and insert a skip command if it fails the test - but anyway it's worth testing. But here is the main issue : reading from the framebuffer is really slow from the little tests I made. Of course, I don't transfer the whole buffer, I just made tests with a smaller buffer (like 64x32 or something and just skipped pixels). Is it possible to do it indirectly (like a scu dma transfer) so that the Cpu can continue doing other things? What would be the best moment to do it? V-blank in? And since I'm reading that buffer anyway, if I wanted to transfer to a sprite or scroll layer, what would be the best time to do it? SGL has a function to get the framebuffer, which is what I'm using, but there is very little detail on how to properly use it or what it does internally, so I'm not even sure I do it right. Right now it's so slow that I would be better off to just create my own z-buffer. Thanks!

		mrkotfw	Aug 23, 2018
		What is the name of the function to read from the framebuffer? This would be a per-pixel test? Do you control the swapping of the framebuffers yourself? I'm not really sure I understand the idea. What about instead of having a pseudo Z-buffer, use LoD to at least generate billboards, or non-textured versions of the objects/mesh?

XL2

Aug 23, 2018

	mrkotfw said:

I'm already using a lod, reducing the texture quality and geometry, but it's still not enough as there is a lot of overdraw.
Some pixels get overwritten like 5 times or more.
It works ok in single player, but when I add gouraud shading (on the high quality model obly) it becomes a bit too much and the framerate drops often.
The function is slGetFrame or something like that.

The idea is that my lod uses palette codes, so the pixels start with a 0, while objects closer to the camera are rgb codes, so the pixels start with a 1 at their msb.
So I could just test the image space data against the lower res buffer from last frame.
If it has a 1 and the quad sees only 1s, that means last frame there was an object closer over there, so I can discard that quad. Since the lod quads are large, it might reduce the artifacts caused by this technique, and of course as you drop the framerate it becomes less reliable.
That buffer would be low res, so I guess it could be cache friendly and allowing quick tests.
It's not perfect, the pvs would still be my main solution, but the pvs won't work well in some situations and I will need something else.

		antime	Aug 23, 2018
		SCU DMA can read from the VDP1 back buffer. Since bandwidth isn't free, it'll cost some performance. EDIT: Maybe it would be possible to use the SCU DSP to generate a subsampled version of the backbuffer? There's also still a lot of ca. mid- to late 90s material on occlusion culling available, that may fit the Saturn's limits better than a lot of the material that's presented today.

		Ponut	Aug 24, 2018
		So I have a general lack of experience in programming, but I would say a few things that might be helpful: 1. Occlusion planes rather than "Z-sorting" would be more performant. My only thought of how that would be done is a distance test on the center of objects (meshes) and an intersection test with the occlusion plane (which would actually be a 2D line). With that you could determine if it is past the occlusion plane (in absolute distance) or not and whether or not it is actually intersecting the plane. A solution that can be done entirely on CPU. But this won't work for entire level meshes, only smaller game objects. 2. Besides that, the frame-buffers are something like 256 KB? Or maybe even 512KB? In that case, that is too much to go through on a single frame. It might not even work because it hits the next copy before the first is finished. Not good! Ideally, you would set-up something that copies 1/10th of the frame-buffer at a time and performs your sorting on that. Not ideal, but it should work. My idea is like this: Uint8 copytimer; if(copytimer > 10){ copytimer = 0; } slDMACopy(framebuffer + (copytimer * 26215), workarea, 26215); ztSortPolygons(framebuffer + (copytimer * 26215)); //Assume this does its work on 26215 byes of buffer at a time copytimer++;

XL2

Aug 24, 2018

	antime said:

Thanks both of you.
Would it be possible by using scu dma to retrieve only the msb and merging these bits in bytes?
Like a 512x256 buffer would only require 16 KB? And even less if you skip pixels and don't retrieve the whole thing?

I have been scratching my head for months to solve the occlusion problem and have something that can fit both for Sonic X-Treme's totally inconsistant maps and something like Quake.
A pvs is nice, but it's only as accurate as your map subdivision, which means you need a lot of memory and spend more time searching the nodes and doing frustum culling.
A bsp with pvs or portals is nice too, but it doesn't work well for open world maps and since the Saturn has no texture coordinates you end up with really weird walls and floor (like the Slavedriver games) and lots of vertices and quads to deal with.
A portal system requires lot of manual work and is good only for interior maps.

I thought too of placing manually some occlusion walls or even doing it automatically, but it can quickly kill peformances so you are restricted to only a few walls unlike a portal system and it requires more manual work than automated techniques, but I will consider your idea Ponut.
A pvs + "depth" buffer seems like a good fit since the game would generate its own (limited) occlusion map from last frame, but it's by no means simple or super fast.
Any better ideas?

		mrkotfw	Aug 24, 2018
		In terms of fetching the MSB, I believe that you can use the update features of the SCU DMA to fetch a byte then skip a byte. With the SCU DSP, it would mean multiple transfers. From the back buffer to HWRAM to the DSP data banks.

XL2

Aug 24, 2018

	mrkotfw said:

I guess the SGL function is doing something like that since you can get a low er res version of the buffer (like 64x32), but again there is little to no details for these functions.
I'm not even sure if it's the SH2 reading the buffer or scu indirect transfer.
Either ways, when should I call it?
During v-blank in?

I tried the pvs technique, with all the raytracing it takes hours to build a map since they aren't corridor maps and it needs to take into account many areas the camera can go to.
I could probably speed it, but it's a bit insane!

mrkotfw

Aug 25, 2018

Here is what slGetFrameData does. I haven't verified, but I see no calls to SCU DMA.

Code:

    sglC24.o: file format coff-sh Disassembly of section SLPROG: 00000000 <_slGetFrameData>: 0: 2f 86 mov.l r8,@-r15 2: c4 b0 mov.b @(176,gbr),r0 4: 2f b6 mov.l r11,@-r15 6: 2f a6 mov.l r10,@-r15 8: 2f 96 mov.l r9,@-r15 a: 63 03 mov r0,r3 c: c6 20 mov.l @(128,gbr),r0 e: eb ff mov #-1,r11 10: 4b 18 shll8 r11 12: 68 09 swap.w r0,r8 14: 48 28 shll16 r8 16: 1b 50 mov.l r5,@(0,r11) 18: e1 00 mov #0,r1 1a: 1b 14 mov.l r1,@(16,r11) 1c: 1b 85 mov.l r8,@(20,r11) 1e: 40 28 shll16 r0 20: e2 10 mov #16,r2 22: 23 28 tst r2,r3 24: 89 00 bt 28 <gtfd_00> 26: 40 01 shlr r0 00000028 <gtfd_00>: 28: 58 b7 mov.l @(28,r11),r8 2a: 1b 60 mov.l r6,@(0,r11) 2c: 1b 14 mov.l r1,@(16,r11) 2e: 1b 05 mov.l r0,@(20,r11) 30: da 38 mov.l 114 <IMM_FrameBuffer>,r10 ! 25c80000 32: d2 37 mov.l 110 <IMM_SPR_EDSR>,r2 ! 25d00010 34: 67 83 mov r8,r7 36: 47 01 shlr r7 38: 59 b7 mov.l @(28,r11),r9 0000003a <gtfd_10>: 3a: 60 21 mov.w @r2,r0 0000003c <gtfd_11>: 3c: c8 02 tst #2,r0 3e: 8f 0a bf.s 56 <gtfd_20> 40: e1 7f mov #127,r1 00000042 <gtfd_12>: 42: 41 10 dt r1 44: 8b fd bf 42 <gtfd_12> 46: c4 13 mov.b @(19,gbr),r0 48: e1 80 mov #-128,r1 4a: 23 18 tst r1,r3 4c: 8f f5 bf.s 3a <gtfd_10> 4e: 40 11 cmp/pz r0 50: 8b 32 bf b8 <gtfd_99> 52: af f3 bra 3c <gtfd_11> 54: 60 21 mov.w @r2,r0 00000056 <gtfd_20>: 56: 6b 93 mov r9,r11 58: 4b 01 shlr r11 5a: e0 08 mov #8,r0 5c: 23 08 tst r0,r3 5e: 8f 30 bf.s c2 <gtfd_30> 60: 45 09 shlr2 r5 00000062 <gtfd_21>: 62: 61 b3 mov r11,r1 64: 41 29 shlr16 r1 66: 41 18 shll8 r1 68: 41 08 shll2 r1 6a: 31 ac add r10,r1 6c: 62 73 mov r7,r2 6e: 63 53 mov r5,r3 70: 71 0a add #10,r1 00000072 <gtfd_22>: 72: 60 23 mov r2,r0 74: 40 29 shlr16 r0 76: 40 00 shll r0 78: 00 1d mov.w @(r0,r1),r0 7a: 32 8c add r8,r2 7c: 81 40 mov.w r0,@(0,r4) 7e: 60 23 mov r2,r0 80: 40 29 shlr16 r0 82: 40 00 shll r0 84: 00 1d mov.w @(r0,r1),r0 86: 32 8c add r8,r2 88: 81 41 mov.w r0,@(2,r4) 8a: 60 23 mov r2,r0 8c: 40 29 shlr16 r0 8e: 40 00 shll r0 90: 00 1d mov.w @(r0,r1),r0 92: 32 8c add r8,r2 94: 81 42 mov.w r0,@(4,r4) 96: 60 23 mov r2,r0 98: 40 29 shlr16 r0 9a: 40 00 shll r0 9c: 00 1d mov.w @(r0,r1),r0 9e: 32 8c add r8,r2 a0: 81 43 mov.w r0,@(6,r4) a2: 43 10 dt r3 a4: 8f e5 bf.s 72 <gtfd_22> a6: 74 08 add #8,r4 a8: 46 10 dt r6 aa: 8f da bf.s 62 <gtfd_21> ac: 3b 9c add r9,r11 ae: 69 f6 mov.l @r15+,r9 b0: 6a f6 mov.l @r15+,r10 b2: 6b f6 mov.l @r15+,r11 b4: 00 0b rts b6: 68 f6 mov.l @r15+,r8 000000b8 <gtfd_99>: b8: 69 f6 mov.l @r15+,r9 ba: 6a f6 mov.l @r15+,r10 bc: 6b f6 mov.l @r15+,r11 be: 00 0b rts c0: 68 f6 mov.l @r15+,r8 000000c2 <gtfd_30>: c2: 61 b3 mov r11,r1 c4: 41 29 shlr16 r1 c6: 41 18 shll8 r1 c8: 41 08 shll2 r1 ca: 31 ac add r10,r1 cc: 62 73 mov r7,r2 ce: 63 53 mov r5,r3 000000d0 <gtfd_32>: d0: 60 23 mov r2,r0 d2: 40 29 shlr16 r0 d4: 00 1c mov.b @(r0,r1),r0 d6: 32 8c add r8,r2 d8: 80 40 mov.b r0,@(0,r4) da: 60 23 mov r2,r0 dc: 40 29 shlr16 r0 de: 00 1c mov.b @(r0,r1),r0 e0: 32 8c add r8,r2 e2: 80 41 mov.b r0,@(1,r4) e4: 60 23 mov r2,r0 e6: 40 29 shlr16 r0 e8: 00 1c mov.b @(r0,r1),r0 ea: 32 8c add r8,r2 ec: 80 42 mov.b r0,@(2,r4) ee: 60 23 mov r2,r0 f0: 40 29 shlr16 r0 f2: 00 1c mov.b @(r0,r1),r0 f4: 32 8c add r8,r2 f6: 80 43 mov.b r0,@(3,r4) f8: 43 10 dt r3 fa: 8f e9 bf.s d0 <gtfd_32> fc: 74 04 add #4,r4 fe: 46 10 dt r6 100: 8f df bf.s c2 <gtfd_30> 102: 3b 9c add r9,r11 104: 69 f6 mov.l @r15+,r9 106: 6a f6 mov.l @r15+,r10 108: 6b f6 mov.l @r15+,r11 10a: 00 0b rts 10c: 68 f6 mov.l @r15+,r8 ... 00000110 <IMM_SPR_EDSR>: 110: 25 d0 mov.b r13,@r5 112: 00 10 .word 0x0010 00000114 <IMM_FrameBuffer>: 114: 25 c8 tst r12,r5 ...

		antime	Aug 25, 2018
		If the function used DMA, it would almost certainly be documented, to prevent conflicts. It doesn't look like you can copy bytes, the read address increment options are 0 and 4, and the write address increment options do not include one byte.

		XL2	Aug 25, 2018
		Wow, thanks a lot, amazing! How did you dissassemble the function? I guess I should really try to learn assembly... AFAIK, with SGL channel scu dma 0 is free for the user while the rest is used by SGL. I guess I could just let the slave do it at the start of the game loop while the main cpu is preparing the frustum and other stuff? Even if it doesn't work well for occlusion, sending the framebuffer to a sprite is also a nice effect, so nothing would be lost. Edit : I did manage to increase the speed of the pvs building quite a bit, so it might be a viable solution with RLE compression.

antime

Aug 25, 2018

	XL2 said:

Objdump... can disassemble object files. Other useful binutils tools include ar... and nm....

mrkotfw

Aug 25, 2018

On MinGW/Cygwin/Unix:

Code:

    mkdir libsgl cd libsgl sh-elf-ar libsgl.a for obj in *.o; do sh-elf-objdump -d "${obj}" > "${obj%%.o}.s"; done

I've attached a .zip file for you that includes the source.

		XL2	Aug 26, 2018
		Thanks a lot, I guess either using the slave to do it or using a scu dsp transfer would be my best options. I will be taking a look at these functions later this week. Thanks again

		XL2	Aug 27, 2018
		I think that will give you a better idea of what I'm thinking of doing and it was very easy to implement (but it's still a bit slow). If would complement the PVS and hopefully I will find a way to subdivide these quads close to the camera to prevent such bad clipping, but anyway : you can see the weird colors are color bank pixels that I just flipped the MSB to have them displayed using a 16 bits sprite. These would be the occludees, while the correctly colored quads would be the occluders. So these huge objects blocking the camera and both sides of a node could at least block some extra geometry. This buffer is currently 88x56, which seems like it could work if the algorithm is conservative (like check a bit more than the quads' boundaries to prevent rejecting too much), but making it fast is a whole other thing and I'm not sure it can be done, but whatever.

		mrkotfw	Aug 29, 2018
		Thanks, that gives me a better idea. There's a few things here... Quad subdivision. I'm curious to know what algorithms are available for subdividing quads You're positive that your bottleneck (currently) is the VDP1 and not something else I know this is outside the realm of your original question, but what about the way command lists are being passed to the VDP1? Is it that SGL processes a large command list, then triggers the VDP1 to draw, or does it keep the VDP1 fed as much as possible while processing other command lists? As in, process a small batch, have the VDP1 render, and in parallel, process the next batch? Are there other areas to improve on performance? Have you timed your code with the CPU FRT? Have you timed how long it takes to render? The framebuffer idea seems wild. With the DSP, you have 4 data banks, each 1024 bytes. The small access is 4 bytes. You have the ability to DMA straight from the DSP and into its 4 data (and 1 prog) bank. Though, I've tried to DMA from LWRAM and the Saturn would lock up, so I'm not sure if you'd be able to DMA straight from the B-bus to the DSP data banks. I believe it can do 4 loads in parallel, though, some in non-general purpose registers (A, X, Y, etc.). I just don't know how you use the DSP for this purpose. Then there's the slave CPU. You DMA from VDP1 FB to HWRAM. Then you have to keep the slave off the CPU bus, so you manually copy chunks of the DMA'd FB into the slave's split cache. You have about 2KiB there. I'm just throwing ideas out there. I don't know if you've done this already, but getting some way to objectively profile the game would be a really good step to take soon.

XL2

Aug 30, 2018

	mrkotfw said:

Quad subdivision is tricky, unless of course you just store different textures and polygons/vertices. That would be the fastest way for sure, but it takes way too much memory, from both RAM and VRAM. It's easy to subdivide a sprite on the height (it's what I do for the water animation, I just "scroll" the starting address). I guess maybe doing something like Quake 2 on PS1 could be one workaround, but it's anoying to always encounter loading screens. Creating new vertices and polygon in realtime could also be done and is what the PS1 does with its SDK afaik, even if it's slower than just storing it in RAM, but I'm not sure how to subdivide a sprite horizontally in VRAM.
If you change the width, you will just end up with a sprite that will just alternate lines with the other horizontal part, and changing the pointers will lead to the same problem.

As for performances, for sure my CPU code can be improved a lot, I'm not doubting that. How SGL works with the draw commands is that it stores everything in a few buffers : vertex buffer, polygon buffer, z-sort buffer and draw commands buffer. When you synch, it just DMA everything in one batch to VRAM and the CPU moves on.

Afaik, since LWRAM is on the a bus, I guess you can't directly DMA to the DSP, but I could be wrong.

As for how I know it's the VDP1 that is the bottleneck, I simply have different debug modes : untextured polygons only, wireframe only, gouraud shaded textured polygons, etc.
The gouraud shaded polygons leads to many slowdowns, which doesn't happen in other modes.
Of course, with better CPU optimizations, I could probably do more, but at the same time, all this overdraw is also increasing the CPU load (more vertices and polygons to process for nothing).

But anyway, since the reaction so far with the Sage demo has been negative overall (some people even complain that I'm using 3D models instead of sprites!) and most people just try the demo on their slow PCs with emulators and don't even bother plugging in a controller and then complain online that it doesn't control well or that it slows down, I'll just stop wasting time on Sonic Z-Treme and move on to the FPS game. Which means that a simple portal system could be implemented, so I don't need to overthink for an all-around solution.

That solves many issues and involves less work in the end since I don't need to try to micmic a game not even built for the Saturn, but I might still play with the framebuffer to add some cool effects.

Ponut

Aug 30, 2018

	XL2 said:

I know its off-topic, but wow. I would say they have high standards but maybe in that case I am confusing high with low.
(I would say something about performance but I have an i7 4770K @ 4.5 GHz..)

As far as performance goes, I know I don't have much to add. Have you explored the option of only partially calculating the occlusion/PVS each frame?
(The idea being the occlusion is a "buffer" of occluded polygons that is filled partially each frame)

XL2

Aug 30, 2018

	Ponut said:

With a portal system I could easily precalculate the pvs, so at runtime all you need to do is uncompress the pvs for your current node/leaf, flag the visible nodes with the current ticks and then run your bsp/octree normally but you don't bother with nodes that aren't potentially visible. It speeds up the cpu calculations quite a lot and it solves partially the occlusion problem. You can also do like the Slavedriver engine and just add user clipping draw commands to prevent even more overdraw, but you would need to clip these against the portals, so I'm not 100% sure it's worth the extra cpu load. Sgl has a sorting option where it draws all polygons in front of the previous polygons within the same pdata, so you could always include a user clip command first for each plane and use the "sort before" option for all the following polygons, which minimizes the overdraw as much as you can on Saturn.
Anyway, I will take my time on this to properly write a bsp compiler and portal generator, so it might take a few months.

mrkotfw

Aug 30, 2018

	XL2 said:

Yeah, that's why I asked. Quad division is really tricky, not including the fact that there's no hardware UV texture support.

	XL2 said:

Okay, that makes sense. I guess that keeps you from idling both on the VDP1 and CPU.

	XL2 said:

It's on the CPU-bus, sadly.

	XL2 said:

	XL2 said:

That's insane. Where is this negative feedback coming from? With anything, you're going to get your percentage of idiots who don't know what they're talking about. You know you've made it to the big leagues when you start getting death threats. Don't let that discourage you. Really, work on what makes you happy.

Another thing is that the game looks like a vertical slice rather than a tech demo. Some people may have a hard time understanding that. If it was more of a prototype, it might allow people to have a better understanding that the game is of course still in progress. But then again, people are stupid.

	XL2 said: