I hope I've not misunderstood, but bPPU::build_sprite_list(), which converts the sprites from their OAM representation, probably takes quite a bit of time.
It takes about ~8% of emulation time in total. I optimized it to insane degrees with the Mednafen author, and it didn't help performance any, but I was still computing it once every scanline.
You're right that a $2104 dirty flag would be a good idea in this case. And to that aim, probably for window mask tables as well. I'll give it a try, thank you.
Then for the up to 2x4 background layers it renders them from back to front into the buffer.
Isn't it expensive to run over the entire scanline twice for each background, and four times for each OAM priority? I figured that would be worse than an if(z > pixel[x].z) { pixel[x].pal = p; pixel[x].z = z; } only when writing.
Extremely little is tested per pixel, with as much as possible calculated in advance and done in large groups.
Definitely not going for a cycle-renderer, I see ;)
This is how I managed to get screen generation down to 1 to 2 milliseconds per frame
That's really exceptional, speed wise. I hope you can find a solution with your approach.
But I don't know how I'm going to do windows, offset per tile mode, mosaic, sub-screens, etc.
Little by little. I'm not one to talk about speed, but for me it worked a hell of a lot better to not focus on speed at all on the initial implementation, and then go back and add things like the tiledata decode cache, tilemap cache fetching, RTO cache fetching, etc.
Trust me, those speed hacks based on faulty assumptions will bite you in the ass when you go back to add things like mosaic and RTO.
I almost wanted to delete the source because it made me so angry, discovering more and more and more impossibly complicated features on a 19 year-old chip. It's why I gave up before, and I still have no inspiration.
I know the feeling. About the only advice I have is that the more difficult things get, the more it makes figuring it out in the end that much more rewarding. This is probably the wrong thing to say, but I've found the S-CPU to be just as much of a nightmare when you're trying to get 100% compatibility. The closer you get to perfection, the more random games break. And not the good games you love, but the absolute shit games like Jumbo Ozaki no Hole in One, Super Power Rangers Battle Whateverthefuck, Toy Story, Bugs Bunny, etc. The PPU is
extremely forgiving in regards to timing compared to the other chips.
Of course, if you're happy with 99% compatibility sans those special chips, and maybe a few small per-game tweaks for 100%, you're really most of the way there already :)
I do hope that you don't give up, but I won't be selfish enough to say to work on it no matter what. If you don't have the motivation, that's a real problem. It may be best to put it aside for a little while and see how you feel then. Just don't be rash and delete the source. And keep a backup somewhere just in case ;)
I also still don't get the sprites. I've tried that before. I've read Anomie's explanation a hundred times.
I still don't get his FirstSprite+Y notes, but I've never seen a game use it, and in fact even my tests to trigger it haven't worked. I always see the same results in my emulator.
To me the hardest part to understand was the way the final pixel is generated once you know what each BG and OAM pixel should be. The main+sub, window+color-window, priorities, color math both on fixed colors and other layers, special rules for the background, etc etc were so complicated that I literally only solved it by trial and error over several weeks.
It's really tough to explain this stuff in laymen's terms. There's so many millions of caveats and edge cases that you can't really explain anything 100% without spanning several pages or relying on the readers' knowledge :/
Which leads us to the problem with source code comments:
Yeah, it could've used a comment or two - but most of the functions require "inside knowledge" like that.
I can never think of a comment that won't be either so terse as to be pointless, or so verbose that the comments will be 5-10x larger than the code, making the functions as a whole very difficult to read.