Latch timing

Strictly for discussing ZSNES development and for submitting code. You can also join us on IRC at irc.libera.chat in #zsnes.
Please, no requests here.

Moderator: ZSNES Mods

Post Reply
byuu

Post by byuu »

The crystal is 315*6/88mhz. One SNES CPU cycle is either 6 (~3.58mhz), 8 (~2.68mhz), or 12 (~1.79mhz) master clock cycles.
The NTSC color subcarrier is 315/88mhz, so there is an abundance of crystals that are multiples of this. This is likely why the SNES uses one, not to mention it would help with NTSC timing.
I have no clue whatsoever how PAL SNES units work.
anomie
Lurker
Posts: 151
Joined: Tue Dec 07, 2004 1:40 am

Post by anomie »

byuusan wrote:I don't know how the values change exactly when you perform a DMA/HDMA transfer.
I've tested this now.

DMA: $43x2-3 are incremented and $43x5-6 are decremented to 0. Nothing else changes.

Direct HDMA: On init, first copy $43x2-3 into $43x8-9 (along with $43x4, this points to the current table index). Then, load one byte from the table into $43xA, and increment $43x8-9. I'm not sure if a 0 causes the HDMA to halt right away, or if it waits until the first transfer (the only reason it'd matter would be the timing). On transfer, transfer the bytes if we just loaded $43xA or if we're in repeat mode (and increment $43x8-9). Then in any case, decrement $43xA, and if it's 0 load another byte from the table and increment $43x8-9 again. Nothing else is affected, notably $43x5-6.

Indirect HDMA: On init, load the indirect address from the table into $43x5-6 (along with $43x7, this points to where the data is read from). On transfer, transfer bytes and increment $43x5-6 instead of $43x8-9. But still increment $43x8-9 when loading into $43xA, and at the same time load the new indirect address into $43x5-6. And when a 0 is loaded into $43xA, we still load $43x5-6 but somewhere an increment gets lost so the 0 ends up in $43x5 and the value that "should" go into $43x5 goes into $43x6 instead (and $43x8-9 ends up pointing two bytes past the 0 instead of 3). So if your HDMA table ends with $00 $01 $02 $03, you'll get $43x5-6=$0100 instead of $0201 and $43x8-9 will point to the $02 instead of the $03.


And for some reason, I had thought that WDM (opcode $42) was a 1-byte opcode rather than 2. WLA has the same bug, though. I haven't had a chance to time it yet to determine if it's two reads or a read and an IO (like NOP), though.
byuu

Post by byuu »

Excellent, thanks for the $43xx register info. I'm hoping by implementing that, I can fix the bug I have on the FF5 title screen (the reflection of the text isn't appearing all the way).
And for some reason, I had thought that WDM (opcode $42) was a 1-byte opcode rather than 2. WLA has the same bug, though. I haven't had a chance to time it yet to determine if it's two reads or a read and an IO (like NOP), though.
Yeah. I have wdm as one CPU, one I/O cycle, but I haven't timed it either. My cross assembler has it wrong too, now that you mention it. I wonder if I should make WDM require/support a signature like brk.
Overload
Hazed
Posts: 70
Joined: Sat Sep 18, 2004 12:47 am
Location: Australia
Contact:

Post by Overload »

byuusan wrote: We have already proven via anomie's test that the longer dot position is different on our systems. We have the same CPU/PPU1/PPU2 (3/1/2), as well. This could also be a result of the copiers we are using, though.
I haven't done any thorough testing yet, but as you would expect CPU version 1 differs from CPU version 2. I have three systems and two copiers that i am using. CPU/PPU1/PPU2 (1/1/1), (2/1/2), PAL (2/1/3). I have noticed differences between my PAL and Japanese Systems. I would guess that there are also differences between US and Japanese Systems. The only noticable difference on a PAL system is is has more scanlines. The timing tests that i run on my V2 Japanese SNES give the same results on my PAL SNES.
byuu

Post by byuu »

Oops, got my version numbers mixed up there. 2/1/3 is what I have.
The only noticable difference on a PAL system is is has more scanlines.
If PAL has 240 scanlines, what's overscan do to PAL? I would guess it bumps it up to 265, and pushes the onscreen image down by 7 or 8 scanlines, similar to how the image is pushed down on NTSC televisions in overscan mode (thusly always only showing 224/448 visible scanlines at a time)?
The timing tests that i run on my V2 Japanese SNES give the same results on my PAL SNES.
I can only guess that's because the SNES operates at some multiple of the PAL frequency, but that the dot clock / cpu cycle clock is tied to the counter and not the actual length of time. I'm having a hard time explaining what I mean... the dot clock runs at multiples of 4, the cpu clock at 6, 8, and 12. So even if 12 cycles pass faster/slower on the PAL, the cpu sees 12 cycles regardless of PAL/NTSC, and thusly your tests match.
The things that should be tested on a PAL unit would be the weird stuff like longer dots, the missing dot on non-interlace scanline 240, stuff like that... I don't understand it, but all of that has to somehow be related to NTSC timing. I doubt any of them would exist in PAL, but PAL may have its own bizarre quirks as well.
OptiRoc
Rookie
Posts: 18
Joined: Mon Jan 10, 2005 12:31 am

Post by OptiRoc »

byuusan wrote:
The only noticable difference on a PAL system is is has more scanlines.
If PAL has 240 scanlines, what's overscan do to PAL? I would guess it bumps it up to 265, and pushes the onscreen image down by 7 or 8 scanlines, similar to how the image is pushed down on NTSC televisions in overscan mode (thusly always only showing 224/448 visible scanlines at a time)?
The resolutions available does not differ between NTSC and PAL units. Most PAL games run in the same mode as the NTSC code it was ported from, ie 224. Actually I can't remember a single title that uses the 239 line mode, but I'm sure a couple do. In most cases it would have involved too much work to be worthwhile, both for the artists and the programmers.
byuu

Post by byuu »

Ah yes, that's right... thank you.
By the way, the SNES Test Program uses 239 line mode. But as I've noticed on my NTSC TV, it doesn't actually show all the lines, it's just the visible image appears a little lower (hence the image itself is pushed up); although I'm sure the other scanlines are still being rendered. Hence it seems almost useless to use this mode on an NTSC TV anyway. No idea what happens in PAL.
Ok, so then PAL just has more scanlines where nothing happens, which compensates for (I'm guessing here) shorter individual scanlines, or the 50/60hz thing, or whatever. Fair enough. I can get a doc for the scanline counts/lengths, no reason to post that info here for me.
anomie
Lurker
Posts: 151
Joined: Tue Dec 07, 2004 1:40 am

Post by anomie »

Looks like WDM is two memory access cycles. About as I expected. I just changed my standard test to insert WDMs instead of NOPS, and each gave an extra .5 dot ;)
byuusan wrote:But as I've noticed on my NTSC TV, it doesn't actually show all the lines, it's just the visible image appears a little lower (hence the image itself is pushed up); although I'm sure the other scanlines are still being rendered.
Same here.


[later]

This evening's test: I've investigated just how the interlace mode setting affects frame length. It turns out it's impossible to get 1364*263-2 or anything weird like that, the only possibilities are the 3 we know of. The SNES uses the value of $2133 bit 1 at the last moment that $4212 bit 7 is clear to determine the interlace setting for the next frame, at least as far as timing goes. So if overscan is off, a "STA $2133" altering bit 1 that would latch [153.5,$E0] were it LDA $2137 will change the current frame's length, while 2 master cycles later it will not have an effect on the current frame.

If $2133 is $05 (overscan + interlace), and you write $00 between $E0 and $EF, interlace will still be "set" because I suppose overscan turns off first. OTOH, if $2133 is $00 and you write $05 between $E0 and $EF, interlace will also be "set" because overscan suddenly 'clears' V-Blank. ... Overscan mode is really just weird and annoying.

I suppose it wouldn't hurt to put together a test that sets interlace at line $E0 and then clears it during V-Blank, and see if we get the (visible) interlace effect or not...
byuu

Post by byuu »

Wow, it's been 17 days now since I've done any timing tests... I hope I'm not getting burned out already, it's too soon for that.
Some DMA tests, now. The test is the same as always: wait a variable number of cycles, execute DMA, and latch. DMA does have 8 master cycles per channel overhead, and 8 cycles per byte. And there's an overhead for the whole DMA transfer, of somewhere between 12 and 24 cycles. This varies depending on just when the transfer begins in a cycle 4 steps long (e.g. 14-20-20-14, 14-18-18-18, or 16-22-16-14, we have to guess the half-dots anyway). The 4-step pattern varies based on the number of bytes transferred and the number of channels, and on FastROM/SlowROM.
I ran some more tests myself, and this is making more sense to me now.
http://setsuna.the2d.com/files/dma_delay2.txt

You were right that there's an 8 cycle overhead per active channel, I probably hit one of the longer delays in my test, making the 8 cycle overhead for 2 channels vs 1 channel "disappear".

As far as the overhead for the entire DMA transfer, I'm still working on that.
So far, I've tested this with SlowROM, and the 4-step delay pattern seems to be: { 24, 16, 16, 16 }
I have yet to notice any combination of bytes transferred/number of channels that changes the 4-step pattern.

For my implementation, I created a variable named dma_delay_pos. It is set to 0 upon reset (at master cycle position 188). I also have one main routine that updates the cycle counter position, in that routine I've added:
dma_delay_pos += cycles;
dma_delay_pos &= 7;
The main DMA routine then checks if the value written to $420b != 0, if it doesn't then it adds ((dma_delay_pos >> 1) == 0)?24:16 master cycles, and 8 master cycles per active channel.
The >>1 is because the SNES can't step less than 2 master cycles at a time, so it turns the (0-7) range to (0-3), for the 4-step pattern.

Also, the DMA base/per-channel initialization moves the pattern-table position ahead as well, obviously. I tested this by performing multiple DMA's and then latching $2137, and my values still matched up.

With that, I was able to match all tests I tried on the SNES through emulation. Next up, I need to try changing the bytes transferred to all different combinations to make sure there aren't any combinations that change the 4-step pattern, but I suspect we'll only need two pattern tables: one for SlowROM, and one for FastROM.
anomie
Lurker
Posts: 151
Joined: Tue Dec 07, 2004 1:40 am

Post by anomie »

byuusan wrote:So far, I've tested this with SlowROM, and the 4-step delay pattern seems to be: { 24, 16, 16, 16 }
Which fits, it's difficult to determine if the latch is X or X.5 when the delay can change drastically for a .5 dot difference in starting position... I just picked { 22, 16, 14, 16 } because that also fit the data and made it average out to the same as the other two patterns.
I have yet to notice any combination of bytes transferred/number of channels that changes the 4-step pattern.
I didn't either with SlowROM. FastROM will allow you to see the two other patterns.

... I should send you the DMA & HDMA test ROM i put together recently. It doesn't do any of this stuff, but it'll test if you're updating the registers correctly.
byuu

Post by byuu »

Which fits, it's difficult to determine if the latch is X or X.5 when the delay can change drastically for a .5 dot difference in starting position...
Well, good to know we aren't getting different results on this after all. I've just been using my old rep x : { lda $0000 } : rep y : { lda $2100 } trick. Add one to x and subtract one from y to move the cycle counter position forward by 0.5 dots. Or you can go back 0.5 dots if you need to.
Also, it's extremely helpful to have the opcodes stepping cycle-by-cycle, so that the sta $420b and lda $2137 write at the (assumed) correct cycle position.
Having the DMA init delay working properly (at least in SlowROM mode) should hopefully make NMI during DMA timing easier to figure out as well.
I didn't either with SlowROM. FastROM will allow you to see the two other patterns.
... damnit. Oh well, I'm sure we'll notice the pattern eventually with lots of patience.
... I should send you the DMA & HDMA test ROM i put together recently. It doesn't do any of this stuff, but it'll test if you're updating the registers correctly.
This is one of the things I wanted to do at some point in time... make a huge test ROM (or set of ROMs) that test an emulator's accuracy with all of this stuff we're finding out. Kind of like an extended electronics test. It could go over CPU flag settings, opcode timing, H/DMA timing, counter latch positions, and the like. Most of the really hard timing stuff can only be verified by running your code on a copier and comparing latch positions, so future emulators authors / people without copiers won't be able to verify they've implemented our notes correctly without such a test ROM.
anomie
Lurker
Posts: 151
Joined: Tue Dec 07, 2004 1:40 am

Post by anomie »

byuusan wrote:... damnit. Oh well, I'm sure we'll notice the pattern eventually with lots of patience.
The question is, why the variable delay in the first place? If we could figure that out, we'd probably be set...

[later]

SPC700 timing results: Remember way back when my "12-cycle" test looked like it should really be 13 cycles? It really is, I've timed opcode $DA and it is 5 cycles, rather than 4 as the available documentation claimed. So far though, everything else i've tested was correct (haven't tested anything involving a call, branch, jump, BRK, SLEEP, STOP, or return except BRA).
byuu

Post by byuu »

I'd like to start documenting my findings on SNES timing, but I want to come up with a formal method for doing this, so that I don't have to go through and redo everything later on... here's a mockup of what I have in mind:
snes documentation layout demonstration

I doubt the graphical flowchart method will be possible, although I'll probably make a nice version of it with a flowchart app or photoshop and include it as a picture on the main page anyway. I imagine the flowchart will be of a massive size by the time it's complete.

I plan to use the text mode and just subsection everything into categories. The link between DMA and NMI timing is there because I plan to to cross-reference the same document in each section, since they're interrelated.

I also plan to make the actual documents themselves using a special scripting language (that PHP will convert to HTML) that will be used to display graphical explanations and cross-reference other documents.

I also want to create a sort of mini-CVS, which I'll do by hand. I plan to keep a timestamp in the filenames and archive all old outdated versions, in case a revert or anything is ever needed. I also want to keep a main changelog, so in case anyone ever follows the documentation, and implements all of the findings, and later on something is changed, they can look quickly at the main changelog.

Do you have any input or ideas on what else should be done/not done for this, before I begin?
SPC700 timing results: Remember way back when my "12-cycle" test looked like it should really be 13 cycles? It really is, I've timed opcode $DA and it is 5 cycles, rather than 4 as the available documentation claimed. So far though, everything else i've tested was correct (haven't tested anything involving a call, branch, jump, BRK, SLEEP, STOP, or return except BRA).
Makes sense that mov dp,ya would be the same amount of cycles as mov ya,dp. But then, (supposedly) mov dp,a = 4 cycles and mov a,dp = 3 cycles, so who knows.

Excellent find, nonetheless. Are we going to try and break the APU down into cycles like we have for the CPU? Or are we just going to go with a 'good enough' per-opcode cycle counter? I don't know how we could figure out the per-cycle mnemonics for each opcode anyway.
byuu

Post by byuu »

Ugh!!!! >__<

Looked into FastROM DMA timing, and realized SlowROM DMA timing was also incorrect.

As usual, here are my logs:
http://setsuna.the2d.com/files/dma_delay_op.txt

It's a lot more complicated than I thought it would be, but after creating a huge list of tests, I was finally able to see the pattern, or so I hope...
I believe that the delay is based around two things.
1) The 4-step pattern position
2) The speed of the next two cycles of the next opcode

The first one I've already explained and you're familiar with, so no need to go over that. We will label these steps as {0}-{3}

As for the second one: first, we know that every single opcode has at least two opcode cycles within it. We also know that the only three possible combinations are 6+6, 8+6, and 8+8. I don't know of any opcode that utilizes 6+8, but I would guess it would be the same as 8+6 anyway. I also know of no way to get n+12 or 12+n for the first two cycles.
I am also making the assumption that only the next two cycles count. It is possible that this applies to the cycle during the DMA enable that actually sets $420b. Right now, I'm using: sta $420b, which always has the last cycle as 6 cycles long. I'm going to try using sta $420b in 16-bit mode later, to see if the extra cycle after the write to $420b counts as one of the next two cycles or not. It probably will.

Next, I also have not tested different transfer sizes (only tried 8 bytes), nor different numbers of channels (only tried 1 channel). I did notice a small quirk where writing 65536 bytes threw off my counter by 1 dot, whereas 65535 worked fine with my old theory. I also didn't notice either of these affecting timing with my old tests either. But it needs to be retested anyway.

With that said, I believe the DMA delay works like this:
Pattern-table {0}:
6+6=12 cycle delay
8+6=18 cycle delay
8+8=24 cycle delay
Pattern-table {1}:
6+6=18 cycle delay
8+6=18 cycle delay
8+8=16 cycle delay
Pattern-table {2}:
6+6=18 cycle delay
8+6=18 cycle delay
8+8=16 cycle delay
Pattern-table {3}:
6+6=18 cycle delay
8+6=12 cycle delay
8+8=16 cycle delay

Why does it work like this? Most likely because Nintendo knew I'd be reverse-engineering their SNES timing system one day, and wanted to try and drive me insane before I succeeded, so they purposefully made the DMA timing as bizarre as possible. That, or because the sum of all 12 possible cycle delays multiplied by 3.265 (rounded) is the number of the devil. Take your pick :/

{1} and {2} seem identical, but {0} and {3} are unique. Using this table, and the fact that my old test was SlowROM { sta $420b : lda $2137 }, or 8+8, my tests from yesterday match up to these new findings {24, 16, 16, 16}, as well as all of my tests from today.

I was not able to determine the half-dot position for my tests, due to the weird nature of how the pattern tables change the results, and how even the opcode after the DMA enable changes the results as well... so I ran ~50 tests and used educated guesses to come up with the above table, which is the only possible table that satisfies all of the tests I ran.

Whew, I was really afraid I wouldn't be able to figure this out for a few hours there... it still may prove impossible, and many many more tests are needed, but I think I have it now... :/
Please test my theory and let me know what you come up with :D

---

Update: Yay. It's even more complicated than I thought it was. Turns out the number of bytes transferred affects the base initialization delay as well. As I am the one who reverse-engineered this, I'd like to dub them 'rings'. Basically, right now: There are four pattern tables, each pattern table has three timing blocks, each timing block contains three transfer rings. This results in 36 possible base initialization values.
I've updated the above text file to reflect these findings.
ring = bytes_transferred % 3;
In other words, zero bytes transferred = ring 0, one byte transferred = ring 1, two bytes transferred = ring 2, three bytes transferred = ring 0.
I believe it works like this because DMA tries to transfer the data in 24-bit blocks. It's fastest when the transfer total is divisible by three, but if it isn't, then it has to transfer either one or two more bytes afterwards. Of course, the pattern tables and timing blocks play into the equation as well.
Pseudo-code:

Code: Select all

x = transfer_size;
if(!x)x = 65536;
while(x >= 3) {
  transfer_long();
  x -= 3;
}
if(x == 2) { transfer_word(); }
if(x == 1) { transfer_byte(); }
Obviously, this delay probably really exists at the end of the DMA transfer, but I have no way to (dis)prove that yet.

I don't have all of the transfer rings decoded yet. As I was using 8 for my transfers, I only have transfer ring 2 finished (8%3=2), and I did all of pattern table {1} as well.

New table:

Code: Select all

{0}
  6+6={??,??,12} cycles
  8+6={??,??,18} cycles
  8+8={??,??,24} cycles
{1}
  6+6={22,20,18} cycles
  8+6={16,20,18} cycles
  8+8={16,16,16} cycles
{2}
  6+6={??,??,18} cycles
  8+6={??,??,18} cycles
  8+8={??,??,16} cycles
{3}
  6+6={??,??,18} cycles
  8+6={??,??,12} cycles
  8+8={??,??,16} cycles
I am going to go postal if the number of channels plays into the base initialization delay as well. That would bring our total possible initialization values to 144. And if transfer mode type plays into this... 1152. At which point I will resign to just using 18 master cycles for base initialization every time. I also still need to see if writing a word to $420b counts the last cycle as part of the next-two-cycle pair.
anomie
Lurker
Posts: 151
Joined: Tue Dec 07, 2004 1:40 am

Post by anomie »

I've finished with the SPC700 opcode timing, only the one instruction was wrong in the doc. Of course, I couldn't test STOP or SLEEP since both halt the processor (I suspect SLEEP == WAI, except there's no I to WA for).

I've also begin testing the math instructions. Decimal arithmetic is rather easier than the 65816, just look at each nybble and the H/C flags. ADDW/SUBW set H based on the high byte, so DAA/DAS are useless there.

MUL is pretty simple as well, only Z and N are affected (both are set based on the high byte output).

DIV is a pain though. ZN are based on the quotient. V is set if the quotient overflows. H is set oddly, it seems to be based on the input X&$0F>Y&$0F. Division by zero actually gives results (seemingly Y=A and A=$FF-Y). Overflows by just 1 bit give sane results, but beyond that we start getting variations on the divide-by-zero "formula". Anyone have any ideas, before i try to figure it out?

[later]

I remembered to look at TRAC's results from my earlier SPC700 timing tests. The results seem to indicate ~1024700. It seems the oscillators (quite possibly both of them) have quite a bit of possible variance in them.
anomie
Lurker
Posts: 151
Joined: Tue Dec 07, 2004 1:40 am

Post by anomie »

anomie wrote:Division by zero actually gives results (seemingly Y=A and A=$FF-Y). Overflows by just 1 bit give sane results, but beyond that we start getting variations on the divide-by-zero "formula". Anyone have any ideas, before i try to figure it out?
Ok, i think I have it. It's a fairly standard algorithm, but without any of the "safety" checks for overflow or divide-by-zero and with the overflow flag hacked in. My code:

Code: Select all

if((Reg.X&0xf)<=(Reg.Y&0xf)){ // why?
    APUSetHalfCarry();
} else {
    APUClearHalfCarry();
}
uint32 yva, x, i; // well, really we want uint17
yva = Reg.YA;
x = Reg.X << 9; // X needs to be aligned at the top of the dividend
for(i=0; i<9; i++){
    yva<<=1; if(yva&0x20000) yva=(yva&0x1ffff)|1; // 17 bit ROL
    if(yva>=x) yva^=1; // Why XOR i don't know, but it's what works
    if(yva&1) yva-=x; // and I guess this was easier than a compound if
}
if(yva&0x100){
    APUSetOverflow();
} else {
    APUClearOverflow();
}
Reg.Y = yva>>9;
Reg.A = yva & 0xff;
Any thoughts?
byuu

Post by byuu »

Over my head, sorry. I'm still working on the CPU side of things...
You may want to try and contact TRAC again, he seems to be the most knowledgeable of the SPC700.
TRAC
SNEeSe Developer
SNEeSe Developer
Posts: 25
Joined: Sun Nov 14, 2004 12:46 pm
Contact:

Post by TRAC »

anomie wrote:Oh, and note that an IRQ set for (153,240) will not trigger on the short frame in non-interlace mode. And for some reason, an IRQ set for dot 153 on the last scanline of the frame (261, or 262 for interlace mode long frames) will not trigger.
Perhaps the IRQ H-counter does not suffer from the 'manipulation' that the PPU2 H-counter does? Have attempts been made to set IRQs for H=340? I can't seem to remember... :oops:

anomie: great work on tracking down the SPC700 MOVW timing inaccuracy... now that we have that down, any ideas on researching SPC700 bus timing software-side?
anomie
Lurker
Posts: 151
Joined: Tue Dec 07, 2004 1:40 am

Post by anomie »

TRAC wrote:Have attempts been made to set IRQs for H=340? I can't seem to remember... :oops:
No IRQ will trigger for H=340.
any ideas on researching SPC700 bus timing software-side?
You mean getting data along the lines of the GTE datasheet for the 65816? It should be _possible_ since S-CPU can access the IO regs 2.5 times per SPC700 cycle. The hard part would be synchronizing the tests, accounting for WRAM refresh (which puts a big hole in our sampling) and interpreting the results... A few opcodes at least are fairly obvious, for example it seems that it needs an IO cycle to do the +X in "d+X", but no IO cycle between the reads and writes in something like "ASL d". Branches take 2 extra cycles when successful (i'd guess it always does the equivalent of 65816's note #6)...
TRAC
SNEeSe Developer
SNEeSe Developer
Posts: 25
Joined: Sun Nov 14, 2004 12:46 pm
Contact:

Post by TRAC »

anomie wrote:A few opcodes at least are fairly obvious, for example it seems that it needs an IO cycle to do the +X in "d+X", but no IO cycle between the reads and writes in something like "ASL d". Branches take 2 extra cycles when successful (i'd guess it always does the equivalent of 65816's note #6)...
I'm not nearly so convinced. The fact that, as you say, there's 'no IO cycle' between the reads and writes in 'ASL d', and how there seems not to be enough time in other opcodes for 1 byte transferred per cycle appears, at the very least, dubious. Combine this with the fact that estimates for the 'worst case' bandwidth requirements by the SDSP would require considerably more bandwidth than 1.024MB/s, upwards to twice that, and you start suspecting that another solution needs to be found.

I had looked into the schematics for the possibility that the SDSP might have a 16-bit memory interface, but that does not appear to be possible, as the data lines for both 32kx8 RAMs are shown to be tied together. However, our documentation (and some testing or measurement, iirc?) does suggest the SSMP has a 2.048MHz clock input, even if though its execution rate is half that. Perhaps the SDSP has a 4.096MB/s data rate, and the SSMP gets 2 accesses per 'instruction cycle'?
anomie
Lurker
Posts: 151
Joined: Tue Dec 07, 2004 1:40 am

Post by anomie »

TRAC wrote:Combine this with the fact that estimates for the 'worst case' bandwidth requirements by the SDSP would require considerably more bandwidth than 1.024MB/s, upwards to twice that, and you start suspecting that another solution needs to be found.
i'm not familiar with these estimates...
However, our documentation (and some testing or measurement, iirc?) does suggest the SSMP has a 2.048MHz clock input, even if though its execution rate is half that. Perhaps the SDSP has a 4.096MB/s data rate, and the SSMP gets 2 accesses per 'instruction cycle'?
Well, the oscillator for the whole thing is specified as 24.57MHz. This actually seems to go into two different pins on what seems to be S-DSP, and S-DSP supplies clocks to the CIC, the expansion port, and what seems to be the SPC700 chip. S-DSP also mediates all memory accessed by the SPC700, so a timeslicing arrangement seems easily possible. We have a report that the expansion port clock is around 8.192MHz, 1/3 the oscillator. If that's also the memory access clock, that would allow for 7 DSP accesses to 1 SPC700 access. Beyond that, I know of no other timing measurements...

[later]

I've run some tests on the SPC700 registers, to see what happens if you read a write-only register. $f1 and $fa-c are write-only, reads always seem to return 0. $f2 is read/write. $fd-f are sort of writable, writing resets the counter value just as reading does. $f0 reads back 0, but whether that's because it's write-only or because the read-write default value is 0 I don't know, since anything I tried to write caused strange behaviors...

Also, something interesting about writing $fa-c while the timer is running: Say T=5 and the internal counter is up to 2, then you set T=1. The next increment of $fd-f won't occur until the internal counter has gone all the way to $ff and wrapped back around to 1.

One other thing regarding timers: recall that there's 3 levels. At the bottom, T0 and T1 get a tick every 128 SPC700 cycles, and T2 gets a tick every 16 cycles. At the second level, each tick the internal counter gets incremented. And at the top level, the visible counter in $fd-f gets incremented when the middle level reaches the target. Accessing $fd-f resets only the top level to 0. Enabling the timer (0->1 of the bit in $f1) zeros only the top and middle levels, NOT the bottom level. For some reason, this surprised me...
TRAC
SNEeSe Developer
SNEeSe Developer
Posts: 25
Joined: Sun Nov 14, 2004 12:46 pm
Contact:

Post by TRAC »

anomie wrote:
TRAC wrote: Combine this with the fact that estimates for the 'worst case' bandwidth requirements by the SDSP would require considerably more bandwidth than 1.024MB/s, upwards to twice that, and you start suspecting that another solution needs to be found.
i'm not familiar with these estimates...
For a given audio sample (1/32000 of a second, or 32 SSMP instruction cycles), the following bandwidth requirements can be generated by the SDSP.

(4) 4 bytes - echo read = +4
(8) 4 bytes - echo write = +4
(48) 8 channels x loop address (2) + BRR header (1) + 4 BRR samples (2) = +40

I'm not sure if I missed something or not, but even that would require a minimum of 1.536MB/s. It's unlikely that it would run at less than twice that (with no real benefit of an asynchronous setup), so that makes 3.072MB/s. To make sense of the SSMP opcode timings, I expect a 4.096MB/s rate equally shared would be sufficient. I don't see any way it could use 8.192MB/s with the SSMP timings and SDSP data we have.

Also, anomie... have you determined any behavior or function of $F8 and $F9?
anomie wrote:Enabling the timer (0->1 of the bit in $f1) zeros only the top and middle levels, NOT the bottom level. For some reason, this surprised me...
Does disabling the timers clear any of the levels? Do the lowest level of the timers run while disabled?
anomie
Lurker
Posts: 151
Joined: Tue Dec 07, 2004 1:40 am

Post by anomie »

TRAC wrote:To make sense of the SSMP opcode timings, I expect a 4.096MB/s rate equally shared would be sufficient. I don't see any way it could use 8.192MB/s with the SSMP timings and SDSP data we have.
OTOH, 8.192 would be more than enough then. Maybe the DSP needs to run at that speed to do whatever it needs to do? I suppose if we really wanted to know the timing, someone with the proper equipment could set up something with nice access patterns and watch the lines.
Also, anomie... have you determined any behavior or function of $F8 and $F9?
Nope. They're read/write, whatever they might be.
anomie wrote:Does disabling the timers clear any of the levels? Do the lowest level of the timers run while disabled?
We can only examine the top level directly. If the middle level is cleared on disable or on enable, there's no observable difference if it doesn't increment while disabled. I suspect the bottom level runs constantly, and that it's just a scaler on the main clock. If you emulated the SPC700 cycle-by-cycle, I suspect you could just do something like "!(Cycles&0x7F)" and "!(Cycles&0x0F)" to tick the middle levels...
TRAC
SNEeSe Developer
SNEeSe Developer
Posts: 25
Joined: Sun Nov 14, 2004 12:46 pm
Contact:

Post by TRAC »

-- edited many times - hopefully not again [8-May-2005 - 0252 - 0500 GMT]
anomie wrote:We can only examine the top level directly. If the middle level is cleared on disable or on enable, there's no observable difference if it doesn't increment while disabled. I suspect the bottom level runs constantly, and that it's just a scaler on the main clock. If you emulated the SPC700 cycle-by-cycle, I suspect you could just do something like "!(Cycles&0x7F)" and "!(Cycles&0x0F)" to tick the middle levels...
We can't access the lower levels directly, but by 'catching' an increment on one of the timers, the state of the lower levels could possibly be derived.
anomie wrote:One other thing regarding timers: recall that there's 3 levels. At the bottom, T0 and T1 get a tick every 128 SPC700 cycles, and T2 gets a tick every 16 cycles. At the second level, each tick the internal counter gets incremented. And at the top level, the visible counter in $fd-f gets incremented when the middle level reaches the target. Accessing $fd-f resets only the top level to 0. Enabling the timer (0->1 of the bit in $f1) zeros only the top and middle levels, NOT the bottom level. For some reason, this surprised me...
What difference do you mean between the bottom counter and the middle counter, if you say the middle counter gets a tick for every tick of the bottom one?

Do you mean them as the 3 stages of:
Stage 1: 128:1 (T0, T1) or 16:1 (T2) scaler.
Stage 2: 1-256 'divisor', based on a 0-255 wraparound counter and a post-increment comparator.
Stage 3: The 4-bit counter for output ticks from the comparater stage.

Also, about the 4-bit counter - do we know if it wraps around, or saturates?
TRAC
SNEeSe Developer
SNEeSe Developer
Posts: 25
Joined: Sun Nov 14, 2004 12:46 pm
Contact:

Post by TRAC »

anomie wrote:
TRAC wrote:Have attempts been made to set IRQs for H=340? I can't seem to remember... :oops:
No IRQ will trigger for H=340.
Then.. one would expect that, while the latches report normal-length dots for the 'short' line on odd NI frames, the 5A22 still treats them as long dots?
Post Reply