Storing pixels as words vs. bytes on a Pentium
category: code [glöplog]
I realized that in my MS-DOS demo for this years Assembly I stored all the 8-bit pixel values with single char writes to VRAM. How much slower is this when compared to first storing four values in a register and then writing it as a single 32-bit integer?
Are there many other common performance related gotchas one should know about? I don't have my Pentium machine at hand here so I can't do any benchmarks.
Are there many other common performance related gotchas one should know about? I don't have my Pentium machine at hand here so I can't do any benchmarks.
You get the needed shifting for free because of the buffering of the bus, so combining bytes to 16 bits at least is definitely worthwhile -- 32 even better. Could be 4x the speed, depending on the video card (I remember some crappier one only managing 16-bit writes). There's quite a lot of available cycles between writes, so you can probably also empty the main memory buffer free while you're at it.
(In hindsight, not knowing what the actual effect is and whether there's a main memory buffer etc. it's really hard to say anything conclusive. Sigh.)
you save alot.
i see that 486 and pentium take only 1 cycle for the move operation, while 386 takes 2 and sometimes 4, depending on where you load and write, so you will save alot of cycles since you also halve or quarter the counts of iterations for the moves in a loop.
you can get an intel documentation and compare the cycle-counts for the move-operations (or any other operation for that matter).
to save missing cycles from the data-cache you could also align it. im no expert, but i know this can improve the speed alot since the data gets fetched from the cache and not the RAM-directly. you manage not to loose cycles and the bus drops slow RAM-access. see code and data-alignment here for example: http://archive.gamedev.net/archive/reference/articles/article206.html or google some more on cache-alignment.
for example if you manage data in a struct, its better for its size to be a power of two number than something else.
in the first example just insert some extra memory (that you for example might have use for later) just to align it to the cache. for example insert BYTE dummy[6]; so the total is 32 bytes for that struct, which is a good number for the cache. now there are different sizes for the cache and so on, you better make your structs compact so you can fill the memory with as compact data as you can.
i see that 486 and pentium take only 1 cycle for the move operation, while 386 takes 2 and sometimes 4, depending on where you load and write, so you will save alot of cycles since you also halve or quarter the counts of iterations for the moves in a loop.
you can get an intel documentation and compare the cycle-counts for the move-operations (or any other operation for that matter).
to save missing cycles from the data-cache you could also align it. im no expert, but i know this can improve the speed alot since the data gets fetched from the cache and not the RAM-directly. you manage not to loose cycles and the bus drops slow RAM-access. see code and data-alignment here for example: http://archive.gamedev.net/archive/reference/articles/article206.html or google some more on cache-alignment.
for example if you manage data in a struct, its better for its size to be a power of two number than something else.
Code:
struct
{
BYTE a[2];
BYTE b[24];
};
//total size: 26 = BAD.
Code:
struct
{
BYTE a[2];
BYTE b[8];
BYTE c[6];
};
//total size: 16 = GOOD.
in the first example just insert some extra memory (that you for example might have use for later) just to align it to the cache. for example insert BYTE dummy[6]; so the total is 32 bytes for that struct, which is a good number for the cache. now there are different sizes for the cache and so on, you better make your structs compact so you can fill the memory with as compact data as you can.
The bottomline is that one write takes the same time no matter if it's a byte or 32 bits, and after writing there are several "free" cycles when the graphics card won't accept more anyway.
Depending on the effect, you can gain a some speed or then not. I tried this stuff back in the day and quite often it was a pain in the ass to get the innerloop to accommodate the shl eax, 8. Great for mode13h flat fillers though and probably unrolled code if you're going to for that.
Pentium can even do 64-bit writes in the same time, because it has a 64-bit bus.
That's why the fastest way to memcpy on a Pentium is to use the FPU to load and store 64-bit integers (the only way to do a 64-bit load/store in a single instruction).
You probably gain something when you put two 32-bit stores in the same cycle (U and V pipe), because then the Pentium might be able to group them together in a cache-line update.
In general, pairing instructions on Pentium is very important for performance.
That's why the fastest way to memcpy on a Pentium is to use the FPU to load and store 64-bit integers (the only way to do a 64-bit load/store in a single instruction).
You probably gain something when you put two 32-bit stores in the same cycle (U and V pipe), because then the Pentium might be able to group them together in a cache-line update.
In general, pairing instructions on Pentium is very important for performance.
so, how would you blit with arbitrary colorkey transparency with 32 bit wide writes to 13h VGA? I couldn't figure that out...
@visy: you can try to generate AND/XOR mask depending on pixel color value and then apply them, but it will involve video memory reads (unless virutal framebuffer in system RAM is used), resulting in major performance drop.
Or you can use a dedicated transparency bit (ideal for 15bpp hicolor mode, since the MSB is not used by color components and remaining free for other purposes), then generate masks according to this bit' value (it can be done by using shifts/bitwise ops to avoid costly branching). I used it in my recent DOS demo, which is released on CC2016 but still doesn't uploaded to pouet :D
Or you can use a dedicated transparency bit (ideal for 15bpp hicolor mode, since the MSB is not used by color components and remaining free for other purposes), then generate masks according to this bit' value (it can be done by using shifts/bitwise ops to avoid costly branching). I used it in my recent DOS demo, which is released on CC2016 but still doesn't uploaded to pouet :D
You can use bitmasking for that.
From the top of my head...
Say you have a colorkey of 0x55:
colorkey = 0x55555555;
mask = pixel ^ colorkey; // Now all bytes containing the colorkey are 0
temp1 = (mask >> 8) & 0xFF00FF00; // Split into two parts
temp2 = mask & 0x00FF00FF;
temp1 += 0x00FF00FF; // create an overflow in all bytes that were not 0 (so not colorkey)
temp2 += 0x00FF00FF;
temp1 &= 0x0100100; // Save only the overflow bit
temp2 &= 0x0100100;
temp1 -= 0x010001; // Convert to 0x00 or 0xFF mask, where 0x00 is colorkey, 0xFF is no colorkey
temp2 -= 0x010001;
mask = temp1 << 8 | temp2; // Combine again to 32-bit mask
Of course it is best to prepare the mask once, and re-use it when possible (eg when blitting sprites).
From the top of my head...
Say you have a colorkey of 0x55:
colorkey = 0x55555555;
mask = pixel ^ colorkey; // Now all bytes containing the colorkey are 0
temp1 = (mask >> 8) & 0xFF00FF00; // Split into two parts
temp2 = mask & 0x00FF00FF;
temp1 += 0x00FF00FF; // create an overflow in all bytes that were not 0 (so not colorkey)
temp2 += 0x00FF00FF;
temp1 &= 0x0100100; // Save only the overflow bit
temp2 &= 0x0100100;
temp1 -= 0x010001; // Convert to 0x00 or 0xFF mask, where 0x00 is colorkey, 0xFF is no colorkey
temp2 -= 0x010001;
mask = temp1 << 8 | temp2; // Combine again to 32-bit mask
Of course it is best to prepare the mask once, and re-use it when possible (eg when blitting sprites).
Oops, got the order of shift and and wrong on this line:
temp1 = (mask >> 8) & 0xFF00FF00; // Split into two parts
Should be:
temp1 = (mask >> 8) & 0x00FF00FF; // Split into two parts
temp1 = (mask >> 8) & 0xFF00FF00; // Split into two parts
Should be:
temp1 = (mask >> 8) & 0x00FF00FF; // Split into two parts
nice stuff, thanx!
aah scali, perfect :) that's exactly what I was looking for!
Oh, I think I see another error:
temp1 -= 0x010001; // Convert to 0x00 or 0xFF mask, where 0x00 is colorkey, 0xFF is no colorkey
temp2 -= 0x010001;
Should be:
temp1 -= (temp1 >> 8)
temp2 -= (temp2 >> 8)
Anyway, just from the top of my head, code not tested... but hopefully you get the idea, and you can fine-tune it for your own needs.
One interesting optimization could be to only use 128 different colours. That way you have a 'spare' bit, and don't need to split it into two dwords.
temp1 -= 0x010001; // Convert to 0x00 or 0xFF mask, where 0x00 is colorkey, 0xFF is no colorkey
temp2 -= 0x010001;
Should be:
temp1 -= (temp1 >> 8)
temp2 -= (temp2 >> 8)
Anyway, just from the top of my head, code not tested... but hopefully you get the idea, and you can fine-tune it for your own needs.
One interesting optimization could be to only use 128 different colours. That way you have a 'spare' bit, and don't need to split it into two dwords.
naah, i think limiting to 128 colors is not worth it.
Well, limiting to 128 colors was a quick way to do "halfbrite" stuff on VGA -- just use the 7th bit to denote an area that needed to be a different luminance, then copy the first 128 colors to the latter 128 in the palette, but brighter/darker/etc. for the effect. Working closely with your graphician was sometimes necessary for this to be effective, so that you could fit everything into 128.
I'm not sure how to feel about this thread; I thought these tricks were well-known? Maybe it's a new generation making Pentium demos? In any case, the main stuff was already mentioned above: Transfer as much as you can (using the FPU to move QWORDS can be faster than REP MOVSD depending on your code), learn how the U and V pipes work so you can pair your code effectively (or, make sure you use a compiler that understands pipelining), and try to make use of the cache where possible -- for example, Pascal wrote a rotozoomer article (hugi? demonews?) that showed you can do the effect faster if you perform your updates in tiles instead of scanlines, as this exploited cache locality.
I'm not sure how to feel about this thread; I thought these tricks were well-known? Maybe it's a new generation making Pentium demos? In any case, the main stuff was already mentioned above: Transfer as much as you can (using the FPU to move QWORDS can be faster than REP MOVSD depending on your code), learn how the U and V pipes work so you can pair your code effectively (or, make sure you use a compiler that understands pipelining), and try to make use of the cache where possible -- for example, Pascal wrote a rotozoomer article (hugi? demonews?) that showed you can do the effect faster if you perform your updates in tiles instead of scanlines, as this exploited cache locality.
If set up cleverly bit-masking tricks like the one shown by scali also work pretty well with 15 or 16 bit color modes - looks almost as good as true color and is easier to handle than paletted mode.
On a pentium and 6x86 it pays off to properly mix instructions so that they can run in parallel in the u and v pipe:
Might be useful to use the EBP and even the EBP register as well if your runtime environment allows it (no problem under Win32, more difficult under a protected mode environment without task switching and active interrupts) in order to feed both pipes.
Regarding vidmem writes: Some platforms might not apply write combining for the video mem, AFAIR e.g. some K6 based machines.
If your target paltform has a MMX cpu you can also use the mmx registers for fast memory copies.
On a pentium and 6x86 it pays off to properly mix instructions so that they can run in parallel in the u and v pipe:
Might be useful to use the EBP and even the EBP register as well if your runtime environment allows it (no problem under Win32, more difficult under a protected mode environment without task switching and active interrupts) in order to feed both pipes.
Regarding vidmem writes: Some platforms might not apply write combining for the video mem, AFAIR e.g. some K6 based machines.
If your target paltform has a MMX cpu you can also use the mmx registers for fast memory copies.
for the threadstarter:
instead of STOSB, for writing to videoram at 0xa000, that the [ES:EDI] video-address, the STOSD store the doubleword (EAX) and increment the address register by 4. so you actually plot 4 pixels at once instead of just one. but this is really basic. this is for mode13h atleast.
instead of STOSB, for writing to videoram at 0xa000, that the [ES:EDI] video-address, the STOSD store the doubleword (EAX) and increment the address register by 4. so you actually plot 4 pixels at once instead of just one. but this is really basic. this is for mode13h atleast.
if pentium has an instruction that can do this in 64-bits, you do 8 pixels at once maybe? i dont know which one it is. maybe MMX can do this too.
Thanks for the tips folks!
I can only warn about using the FPU for memcpy. We tried that back in the days and ran into serious troubles. In some situations this got really slow and even data got corrupted. So we switched to MMX. Later we found out that the reason for both was the FPU throwing exceptions on denormalized values and silently converting SNANs to QNANs. /o\
I thought FILD and FISTP only cause exceptions if the FPU stack runs out?
I think you can mask that out too with FSTCW, and similarly you can enable exceptions on a variety of other things with it (overflow/zerodivide/etc) as well.
Quote:
using the FPU for memcpy [...] data got corrupted
True, you can't use fld/fstp as a general memcpy.
But it works for 32bit RGBs where the alpha-component in every 0xaarrggbbaarrggbb is zero.
the only problem of using MMX in dos demos nowadays is almost full DOSBox incompatibility (some custom bulids like DOSBox-X support MMX but emulate it extremely slow, other builds can't cope with them at all), not sure about PCEm. So....it's a pity :)