Progress on reducing HQ2x to fit in L1 cache
Moderator: ZSNES Mods
Progress on reducing HQ2x to fit in L1 cache
I've been researching alternate methods of implementing the HQ system, initially the HQ2x variant specifically. The goal of this research has been to reduce the total footprint for code and data for the filter to be inside of 8 kilobytes so as to keep the lookup tables and code paths completely inside of L1 cache on all MMX and above processers.
So far this research has born fruit, in the form of a pixel art scaling toolkit I had posted on-line for general use briefly though I've since taken it down. 7 difference-checks and 4 arbitrary blends take a total of 86 clock cycles without any MMX used yet, allowing for 21 clock cycles per output pixel while only using ~4.5kb of code and data total.
So far this research has born fruit, in the form of a pixel art scaling toolkit I had posted on-line for general use briefly though I've since taken it down. 7 difference-checks and 4 arbitrary blends take a total of 86 clock cycles without any MMX used yet, allowing for 21 clock cycles per output pixel while only using ~4.5kb of code and data total.
Last edited by WolfWings on Sun Dec 30, 2012 11:35 pm, edited 5 times in total.
[img]http://wolfwings.us/sigs/WolfWings.jpg[/img]
-
- Dark Wind
- Posts: 1271
- Joined: Thu Jul 29, 2004 8:58 pm
- Location: Texas
- Contact:
Re: Progress on reducing HQ2x to fit in L1 cache
What are you compiling with? MinGW-GCC 3.4.5 gives meWolfWings wrote:Also, if people can run the MMX Difference code on their own computers on as many different systems as possible, I'd appreciate it. I only have an Athlon 64 to test on, and that can be heavilly lopsided for testing purposes. Especially I'd appreciate results from earlier CPU's.
Code: Select all
C:\src>gcc -Os MMXdiff.c -o MMXdiff.exe
MMXdiff.c: In function `ExpandedDiff':
MMXdiff.c:55: error: incompatible types in assignment
MMXdiff.c:56: error: incompatible types in assignment
MMXdiff.c:57: error: incompatible types in assignment
MMXdiff.c:58: error: incompatible types in assignment
Code: Select all
C:\src>gcc -Os -march=pentium-m MMXdiff.c -o MMXdiff.exe
C:\DOCUME~1\Mitchell\LOCALS~1\Temp/ccu0baaa.o:MMXdiff.c:(.text+0xe8): undefined reference to `random'
C:\DOCUME~1\Mitchell\LOCALS~1\Temp/ccu0baaa.o:MMXdiff.c:(.text+0xf8): undefined reference to `random'
C:\DOCUME~1\Mitchell\LOCALS~1\Temp/ccu0baaa.o:MMXdiff.c:(.text+0x134): undefined reference to `random'
C:\DOCUME~1\Mitchell\LOCALS~1\Temp/ccu0baaa.o:MMXdiff.c:(.text+0x144): undefined reference to `random'
[u][url=http://bash.org/?577451]#577451[/url][/u]
-
- ZSNES Developer
- Posts: 3904
- Joined: Tue Jul 27, 2004 10:54 pm
- Location: Solar powered park bench
- Contact:
random() is a BSD thing, you'll need *nix to compile it, not Windows.
May 9 2007 - NSRT 3.4, now with lots of hashing and even more accurate information! Go download it.
_____________
Insane Coding
_____________
Insane Coding
I was mistaken, I thought random() was pretty cross-platform. Modified the code on the web-server, replacing it with rand() runs equivilant code, just might not be as random. Though since I'm not seeding the random-number generator with the clock or something similair it's relatively moot. 
Also, yeah, only testing with GCC 4.x (4.1.1 specifically) since 3.x has really poor MMX intrinsic support by comparison. 4.x allows for much cleaner-looking code to compile. I'd need to go through all sorts of hoops that would make the code easier to write in raw assembly to begin with in 3.x, compared to the relatively neat and readable code I can use in 4.x right now aside from the lack of MOVD support to avoid redundant stack usage.

Also, yeah, only testing with GCC 4.x (4.1.1 specifically) since 3.x has really poor MMX intrinsic support by comparison. 4.x allows for much cleaner-looking code to compile. I'd need to go through all sorts of hoops that would make the code easier to write in raw assembly to begin with in 3.x, compared to the relatively neat and readable code I can use in 4.x right now aside from the lack of MOVD support to avoid redundant stack usage.
Last edited by WolfWings on Tue Aug 01, 2006 6:08 pm, edited 1 time in total.
[img]http://wolfwings.us/sigs/WolfWings.jpg[/img]
-
- Dark Wind
- Posts: 1271
- Joined: Thu Jul 29, 2004 8:58 pm
- Location: Texas
- Contact:
Alright. Pentium M 730 (Dothan, 1.6 GHz) w/ 32/32 KB L1 data/code cache. Compiled with MinGW-GCC 4.1.1.
gcc -Os -march=pentium-m MMXdiff.c -o MMXdiff.exe
I've got an Athlon XP and a Pentium 3 at home, I'll run it on those.
gcc -Os -march=pentium-m MMXdiff.c -o MMXdiff.exe
Code: Select all
Ticks: -1124683956 Right: 1073741824 Wrong: 0
Ticks: -1271181139 Right: 1073741824 Wrong: 0
Ticks: -2101298327 Right: 1073741824 Wrong: 0
Ticks: 407186843 Right: 1073741824 Wrong: 0
Ticks: 1018030989 Right: 1073741824 Wrong: 0
Ticks: 1161470693 Right: 1073741824 Wrong: 0
Ticks: -636712411 Right: 1073741824 Wrong: 0
Ticks: 2092338431 Right: 1073741824 Wrong: 0
Ticks: 1054051752 Right: 1073741824 Wrong: 0
Ticks: -884377318 Right: 1073741824 Wrong: 0
Ticks: 561283507 Right: 1073741824 Wrong: 0
Ticks: -1176038292 Right: 1073741824 Wrong: 0
Ticks: 1677575927 Right: 1073741824 Wrong: 0
Ticks: -580191728 Right: 1073741824 Wrong: 0
Ticks: 1777804351 Right: 1073741824 Wrong: 0
Ticks: 192927273 Right: 1073741824 Wrong: 0
[u][url=http://bash.org/?577451]#577451[/url][/u]
Meh...
...bug in the blasted code there. Thanks for the test result, installed a full 32-bit Linux install to track down the bug that I wasn't seeing initially running the code in 32-bit mode on my 64-bit OS install. It wasn't displaying the results properly on non-64bit machines due to a mistake in my code. I uploaded a fixed version of the code that also shows a running average estimated clock-cycle count a couple of minutes ago. I also moved the timing checks inside the calculation subroutine so it will get a much more accurate reading of the clock cycles taken for the actual 'subtraction' and MMX ops.
Last edited by WolfWings on Tue Aug 01, 2006 6:08 pm, edited 1 time in total.
[img]http://wolfwings.us/sigs/WolfWings.jpg[/img]
I compiled it using MinGW with GCC 4.1.1. I use a Pentium 4 "Nocona" CPU @ 3.0 GHz. I only had time for the first nine "laps".
Code: Select all
Ticks: 635047103 Right: 1073741824 Wrong: 0 Average: -3
Ticks: 2109085812 Right: 1073741824 Wrong: 0 Average: -2
Ticks: -1853395371 Right: 1073741824 Wrong: 0 Average: -1
Ticks: 1536322843 Right: 1073741824 Wrong: 0 Average: -2
Ticks: 969834710 Right: 1073741824 Wrong: 0 Average: -3
Ticks: 725194335 Right: 1073741824 Wrong: 0 Average: -3
Ticks: 1200218382 Right: 1073741824 Wrong: 0 Average: -2
Ticks: 807501412 Right: 1073741824 Wrong: 0 Average: -3
Ticks: 696416873 Right: 1073741824 Wrong: 0 Average: -3
That... shouldn't happen. Meh. Okay, try this version instead. http://wolfwings.us/hq/timing.c Skips all the fancy tests, I've accepted the math works, so it removes the right/wrong stuff and just tries to record the ticks accurately over a huge block of the code being run repeatedly.Poobah wrote:I compiled it using MinGW with GCC 4.1.1. I use a Pentium 4 "Nocona" CPU @ 3.0 GHz. I only had time for the first nine "laps".Code: Select all
Ticks: 635047103 Right: 1073741824 Wrong: 0 Average: -3 Ticks: 2109085812 Right: 1073741824 Wrong: 0 Average: -2 Ticks: -1853395371 Right: 1073741824 Wrong: 0 Average: -1 Ticks: 1536322843 Right: 1073741824 Wrong: 0 Average: -2 Ticks: 969834710 Right: 1073741824 Wrong: 0 Average: -3 Ticks: 725194335 Right: 1073741824 Wrong: 0 Average: -3 Ticks: 1200218382 Right: 1073741824 Wrong: 0 Average: -2 Ticks: 807501412 Right: 1073741824 Wrong: 0 Average: -3 Ticks: 696416873 Right: 1073741824 Wrong: 0 Average: -3
Last edited by WolfWings on Tue Aug 01, 2006 6:09 pm, edited 1 time in total.
[img]http://wolfwings.us/sigs/WolfWings.jpg[/img]
Code: Select all
Ticks: 1932806547
Ticks: 1899956382
Ticks: 1929284420
Ticks: 2080034802
Ticks: 2074461807
Ticks: 2109035840
Ticks: 2098081857
Ticks: 1883020288
Ticks: 1882983830
Ticks: -2132344564
Ticks: 1996767124
Ticks: -2098670569
Ticks: 2038388383
Ticks: 1910984104
Ticks: 1929814685
Ticks: 1883466837
In this case, truly randomized sequences weren't really needed, the random numbers were only being used to try to 'poison' read-ahead in caches to rudely simulate the chaotic 'all over the place' nature of most video-game images since they don't do lots of smooth gradiants.byuu wrote:Offtopic note: seeding is usually done with srand(time());
But there's plenty of other ways to do it too, of course

And I still can't figure out why the frelling hell the code's not getting proper 64-bit RDTSC values. =-.-= It should not physically be possibly to get a negative result like it's turning out unless it's only doing the math in 32-bit mode for some reason, and I haven't been able to duplicate that situation with the timing.c code. :-/
Fuck it, just throw 'time ./timing' if you're on a *NIX and lemme know what that says, that'll at least give a reasonable response regardless of the mis-gauged clock-ticks.

Last edited by WolfWings on Tue Aug 01, 2006 6:09 pm, edited 1 time in total.
[img]http://wolfwings.us/sigs/WolfWings.jpg[/img]
Here you go. Compiled with -march=k6-2 for your convenience.
Northwood Celeron, 3020Mhz.
Ticks: -219943712
Ticks: 706787624
Ticks: 1760269224
Ticks: 239852700
Ticks: 23413652
Ticks: 106703928
Ticks: 31305052
Ticks: -179077004
Ticks: -140565660
Ticks: 617559996
Ticks: 1016521984
Ticks: 1007000948
Ticks: 735528072
Ticks: 324340236
Ticks: 314756832
Ticks: 267699984
Yay?
Ticks: -219943712
Ticks: 706787624
Ticks: 1760269224
Ticks: 239852700
Ticks: 23413652
Ticks: 106703928
Ticks: 31305052
Ticks: -179077004
Ticks: -140565660
Ticks: 617559996
Ticks: 1016521984
Ticks: 1007000948
Ticks: 735528072
Ticks: 324340236
Ticks: 314756832
Ticks: 267699984
Yay?
SHREIK!!!!!!! DDdddnnnnnnaaaa! GESTAHLLLLLLLLLL!!!!!!!!
Steelers now officially own your ass.
Steelers now officially own your ass.
-
- ZNES Developer
- Posts: 215
- Joined: Mon Aug 02, 2004 11:22 pm
Edit: Rerun using DMV27's binary.
Athlon XP Barton Core @ 2.0 GHz
Athlon XP Barton Core @ 2.0 GHz
Code: Select all
Ticks: 12274106949
Ticks: 12472828512
Ticks: 11549340469
Ticks: 11431195209
Ticks: 11411786656
Ticks: 11417993042
Ticks: 11419195704
Ticks: 11385457613
Ticks: 11341841106
Ticks: 11392726554
Ticks: 11387325977
Ticks: 11463281467
Ticks: 11411806859
Ticks: 11387981304
Ticks: 11488935477
Ticks: 11370416010
Last edited by DOLLS (J) [!] on Tue Jul 25, 2006 7:00 am, edited 1 time in total.
The RDTSC code is correct, but the printf "%16lli" code does not work properly on Win32. I fixed it by using a double instead of a uint64. The code and a Win32 EXE compiled with GCC 3.4.5 are at http://rapidshare.de/files/26915721/timing.zip.html. I also added some CPUID opcodes for more accurate timing.WolfWings wrote:And I still can't figure out why the frelling hell the code's not getting proper 64-bit RDTSC values. =-.-= It should not physically be possibly to get a negative result like it's turning out unless it's only doing the math in 32-bit mode for some reason, and I haven't been able to duplicate that situation with the timing.c code. :-/
Duron @ 850 MHz
Code: Select all
Ticks: 14898415422
Ticks: 14530683119
Ticks: 14513379681
Ticks: 14531306276
Ticks: 14576906166
Ticks: 14495807632
Ticks: 14548245187
-
- Lurker
- Posts: 165
- Joined: Wed Jul 28, 2004 3:15 am
- Location: The Land of Insanity
- Contact:
Athlon X2 3800+ @ 2.0 GHz
Code: Select all
Ticks: 9025792898
Ticks: 8986723978
Ticks: 9025712758
Ticks: 8529643670
Ticks: 9046469446
Ticks: 9539386198
Ticks: 9525954193
Ticks: 9566102128
Ticks: 9573992158
Ticks: 9552645045
Ticks: 9537216631
Ticks: 9485480727
Ticks: 9480342099
Ticks: 9504240861
Ticks: 9499390360
Ticks: 9026290844
It is better to be silent and thought a fool then to open your mouth and remove all doubt
I am Zophar, Master of Sh*t!
[url=http://archlyn.bravejournal.com]View my blog[/url]
I am Zophar, Master of Sh*t!
[url=http://archlyn.bravejournal.com]View my blog[/url]
Ah, this is another Windowism. You need to use %I64d for 64-bit integers passed to a printf statement. However, using a double would work as well for this case since the va_args list stack push would truncate the double down to 64-bits for you anyway.DMV27 wrote:The RDTSC code is correct, but the printf "%16lli" code does not work properly on Win32.
Good to know %16lli is the unix-equivalent of %I64d, though.
Ticks: 11279449977
Ticks: 11332359781
Ticks: 11269373812
Ticks: 11257814392
Ticks: 11295460487
Ticks: 11242154574
Ticks: 11315805258
Ticks: 11280371652
Ticks: 11275821014
Ticks: 11289480215
Ticks: 11247295621
Ticks: 11295770396
Ticks: 11251983033
Ticks: 11283265198
Ticks: 11260864251
Ticks: 11262269531
AMD Athlon XP 2500+ 1.98Ghz
Ticks: 11332359781
Ticks: 11269373812
Ticks: 11257814392
Ticks: 11295460487
Ticks: 11242154574
Ticks: 11315805258
Ticks: 11280371652
Ticks: 11275821014
Ticks: 11289480215
Ticks: 11247295621
Ticks: 11295770396
Ticks: 11251983033
Ticks: 11283265198
Ticks: 11260864251
Ticks: 11262269531
AMD Athlon XP 2500+ 1.98Ghz
Okay, the results are valid and stable at least now. I was away from the internet for a couple of days, and had my car break down in Arizona from this heat (it's not getting repaired) and I'm back in California now.
Looks like Athlon's are averaging around 12 cycles per lap with the loop overhead and RGBUnpack overhead included, Athlon-64's/Turions/X2's are down around 9 cycles per lap, and the Pentium 4 is up around 18 though I'm not entirely surprised since it's such an infamously difficult case to optimize code fully for the P4. Fully P4-optimized code though, as the VirtualDub author can attest, chews up bytes at a frightening rate.
I've actually worked out a table-free version at this point that surprisingly ends up about the same size, code-wise (100-120 bytes depending on optimization settings), as this tiny-table version. And it removes the table. As it ends up doing more parallel codepaths, it pipelines better from what I can see, though it obviously is doing more math (12 opcodes being the longest 'branch' of the parallel code paths though I may be able to reduce that by shuffling YUV columns around) so it may not be any faster. I also figured out how to short-circuit for the 'identical pixels' case, so that particular case kicks back after nothing more than the equivilant of a compare (a subtraction), and requisite jump. Moved the guard-bits math outside of the comparison after reading and understanding the "Optimal RGB Mixing" articles better. I'll upload the source code for this version later today if folks don't mind running another test program?
EDIT: After reading through various CPU optimization and instruction-pairing documents, I've learned something amusing: It's often faster to store a 32-bit register to memory and perform an MMX operation directly FROM that memory, than to use MOVD as I am right now to copy to an MMX register and manipulate from that. Starting to chew up registers more than before, but still not badly.
Looks like Athlon's are averaging around 12 cycles per lap with the loop overhead and RGBUnpack overhead included, Athlon-64's/Turions/X2's are down around 9 cycles per lap, and the Pentium 4 is up around 18 though I'm not entirely surprised since it's such an infamously difficult case to optimize code fully for the P4. Fully P4-optimized code though, as the VirtualDub author can attest, chews up bytes at a frightening rate.
I've actually worked out a table-free version at this point that surprisingly ends up about the same size, code-wise (100-120 bytes depending on optimization settings), as this tiny-table version. And it removes the table. As it ends up doing more parallel codepaths, it pipelines better from what I can see, though it obviously is doing more math (12 opcodes being the longest 'branch' of the parallel code paths though I may be able to reduce that by shuffling YUV columns around) so it may not be any faster. I also figured out how to short-circuit for the 'identical pixels' case, so that particular case kicks back after nothing more than the equivilant of a compare (a subtraction), and requisite jump. Moved the guard-bits math outside of the comparison after reading and understanding the "Optimal RGB Mixing" articles better. I'll upload the source code for this version later today if folks don't mind running another test program?
EDIT: After reading through various CPU optimization and instruction-pairing documents, I've learned something amusing: It's often faster to store a 32-bit register to memory and perform an MMX operation directly FROM that memory, than to use MOVD as I am right now to copy to an MMX register and manipulate from that. Starting to chew up registers more than before, but still not badly.
Last edited by WolfWings on Tue Aug 01, 2006 6:09 pm, edited 1 time in total.
[img]http://wolfwings.us/sigs/WolfWings.jpg[/img]
I'll just toss out an idea that could be used to shrink the assembly versions of the "giant switch statement" in HQ2X/HQ3X/HQ4X (this does not apply to WolfWings' microcoded version though):
A lot of the "switch statement handlers" have common tails which could be merged together to save a lot of code space.
For example, you could optimize something like this: [example from HQ3X.asm]
to read like this instead:
Given how much code is emitted by those macros, I suspect you could make the whole thing some 30-40% smaller this way with essentially no performance loss.
@MaxSt: I don't know how this code was written but it sure smells like it was generated by a program from some sort of source table. Would it make sense to modify your generator program so that it can identify common tails and automatically combine them as above?
[Edit: You could apply this optimization to the C++ code also, using gotos. An omniscient C++ compiler could do it automatically, but I'd be surprised if real compilers even tried an optimization like that. For readability of the C++ version, you might want to keep the original tail code there after the goto and just comment it out.]
Edit: I had another dumb idea, which could reduce code size especially for HQ3X and HQ4X: un-inline the contents of the PIXEL_nn macros (i.e. each one a call instruction). That would shrink code size drastically but probably slow it down a bit on anything older than a Pentium 4. My guess is that the Pentium 4 would perform about the same on that code; I have not tried it though, and I have no intuition about AMD here.
It might be better to just un-inline some of the PIXEL_nn macros---the ones that are not called very often at runtime (though I imagine those are the least-frequently-occurring in the source code too... oh well). In HQ3X for example, suppose you budget around 3 extra call instructions per flag handler (i.e. around 6 of the pixels for each are still inlined). If you pick the right ones you could probably reduce code size by 30% without slowing it down TOO much.
Maybe that trick is better applied to HQ4X only. You could start with the assumption that only people with a decently fast computer are going to enable HQ4X anyway, and not worry if its 7% slower on older machines...
A lot of the "switch statement handlers" have common tails which could be merged together to save a lot of code space.
For example, you could optimize something like this: [example from HQ3X.asm]
Code: Select all
..@flag89
PIXEL00_1U
PIXEL01_1
PIXEL02_1M
PIXEL10_C
PIXEL11
PIXEL12_C
DiffOrNot w8,w4,PIXEL20_1M,PIXEL20_2
PIXEL21_C
DiffOrNot w6,w8,PIXEL22_1M,PIXEL22_2
jmp .loopx_end
..@flag90
DiffOrNot w4,w2,PIXEL00_1M,PIXEL00_2
PIXEL01_C
DiffOrNot w2,w6,PIXEL02_1M,PIXEL02_2
PIXEL10_C
PIXEL11
PIXEL12_C
DiffOrNot w8,w4,PIXEL20_1M,PIXEL20_2
PIXEL21_C
DiffOrNot w6,w8,PIXEL22_1M,PIXEL22_2
jmp .loopx_end
Code: Select all
..@flag89
PIXEL00_1U
PIXEL01_1
PIXEL02_1M
..@flag89_tail:
PIXEL10_C
PIXEL11
PIXEL12_C
DiffOrNot w8,w4,PIXEL20_1M,PIXEL20_2
PIXEL21_C
DiffOrNot w6,w8,PIXEL22_1M,PIXEL22_2
jmp .loopx_end
..@flag90
DiffOrNot w4,w2,PIXEL00_1M,PIXEL00_2
PIXEL01_C
DiffOrNot w2,w6,PIXEL02_1M,PIXEL02_2
jmp ..@flag89_tail
@MaxSt: I don't know how this code was written but it sure smells like it was generated by a program from some sort of source table. Would it make sense to modify your generator program so that it can identify common tails and automatically combine them as above?
[Edit: You could apply this optimization to the C++ code also, using gotos. An omniscient C++ compiler could do it automatically, but I'd be surprised if real compilers even tried an optimization like that. For readability of the C++ version, you might want to keep the original tail code there after the goto and just comment it out.]
Edit: I had another dumb idea, which could reduce code size especially for HQ3X and HQ4X: un-inline the contents of the PIXEL_nn macros (i.e. each one a call instruction). That would shrink code size drastically but probably slow it down a bit on anything older than a Pentium 4. My guess is that the Pentium 4 would perform about the same on that code; I have not tried it though, and I have no intuition about AMD here.
It might be better to just un-inline some of the PIXEL_nn macros---the ones that are not called very often at runtime (though I imagine those are the least-frequently-occurring in the source code too... oh well). In HQ3X for example, suppose you budget around 3 extra call instructions per flag handler (i.e. around 6 of the pixels for each are still inlined). If you pick the right ones you could probably reduce code size by 30% without slowing it down TOO much.
Maybe that trick is better applied to HQ4X only. You could start with the assumption that only people with a decently fast computer are going to enable HQ4X anyway, and not worry if its 7% slower on older machines...
Actually, this very concept is the basis for my work. Try this experiment instead: Unwrap the switch statement for each individual pixel. Now compare entire switch statements, you'll suddenly see a LOT of relations between them. And I mean hard, solid simplifications once you break all the pixels apart. Well... once you break things into quarters specifically, breaking them down further doesn't noticably improve HQ4x, though HQ3x you'd end up with the 'overlapping' middles to deal with.mozz wrote:I'll just toss out an idea that could be used to shrink the assembly versions of the "giant switch statement" in HQ2X/HQ3X/HQ4X (this does not apply to WolfWings' microcoded version though):

Last edited by WolfWings on Tue Aug 01, 2006 6:09 pm, edited 1 time in total.
[img]http://wolfwings.us/sigs/WolfWings.jpg[/img]
My thought for HQ3X was that it looks like most of the handlers could be divided into "two halves" (of varying size). E.g. if there are a bunch of handlers that set the top, left, and upper left pixels to the same thing, that would be done in one "half" and the other 5 pixels could be done in the other "half" (the center pixel is always the same in HQ3X).WolfWings wrote:Actually, this very concept is the basis for my work. Try this experiment instead: Unwrap the switch statement for each individual pixel. Now compare entire switch statements, you'll suddenly see a LOT of relations between them. And I mean hard, solid simplifications once you break all the pixels apart. Well... once you break things into quarters specifically, breaking them down further doesn't noticably improve HQ4x, though HQ3x you'd end up with the 'overlapping' middles to deal with.mozz wrote:I'll just toss out an idea that could be used to shrink the assembly versions of the "giant switch statement" in HQ2X/HQ3X/HQ4X (this does not apply to WolfWings' microcoded version though):
You then construct a dispatch table with two entries for each index (pack them into one 32-bit word by making them 16-bit offsets into the code, and as you load them into a register, add a base address to them). During dispatch, load the second one into a register and leave it there. By the time you've run the first handler (which should generally be the shorter of the two, unless its less than 40 cycles long), the jump through a register to the second handler ought to be cheap or free.