Bad compression ratio for assembler code under crinkler?
category: general [glöplog]
While i was trying to optimize a 4k code into assembler, i'm facing a strange result.
It seems that even if in assembler you can get a 10-20% gain in unompressed size, the compressed size is sometimes only 5% more than the C++ version, and even in some case, is worse than 5%! I did this test on several functions (except intrinsic and small functions like memcpy that are worth to code in asm) and getting the same results...
I did a test on a simple function and it will not probably reflect the problem for an entire intro developped in asm.
For example, take the simple code of initializing a vertex shader / fragment shader in C++:
The equivalent version in assembler (i didn't test it, the purpose is to have an idea of the size of the code)
The fact is under crinkler, compression, the C++ version is better compressed (compiled under VC2008)
My questions are:
- Why such a result? Is it the fact that the code is to small and we should have better results on larger code? Because i'm not a x86-killer, is there really some asm trick&tips to use for smaller code?
- Is it really worth to use assembler then? I don't think that 5% makes the difference for an intro
What do you think? What is your practice with this?
It seems that even if in assembler you can get a 10-20% gain in unompressed size, the compressed size is sometimes only 5% more than the C++ version, and even in some case, is worse than 5%! I did this test on several functions (except intrinsic and small functions like memcpy that are worth to code in asm) and getting the same results...
I did a test on a simple function and it will not probably reflect the problem for an entire intro developped in asm.
For example, take the simple code of initializing a vertex shader / fragment shader in C++:
Code:
void __forceinline InitProcAdresses() {
for(int i = 0; i < 12; i++)
procAdresses[i] = GETPROCADRESS(&procAdressesNames[i][0]);
}
GLuint ShaderCompile(const char *vsh, const char *psh)
{
InitProcAdresses();
GLuint shader = glCreateProgram();
GLuint s = glCreateShader(GL_VERTEX_SHADER);
glShaderSource(s, 1, (const GLchar**)(&vsh), NULL);
glCompileShader(s);
glAttachShader(shader,s);
s = glCreateShader(GL_FRAGMENT_SHADER);
glShaderSource(s, 1, (const GLchar**)(&psh), NULL);
glCompileShader(s);
glAttachShader(shader,s);
glLinkProgram(shader);
return shader;
}
The equivalent version in assembler (i didn't test it, the purpose is to have an idea of the size of the code)
Code:
GLuint __declspec( naked ) ShaderCompile(const char *vsh, const char *psh) {
__asm {
push ebp;
push esi;
push edi;
getproc:
mov esi, procAdresses;
mov edi, procAdressesNames;
push edi;
call GETPROCADRESS;
mov [esi], eax;
add esi, 4;
add edi, 24;
cmp edi, procAdressesNames + 12*22;
jl getproc;
// GLuint shader = glCreateProgram();
mov esi, procAdresses;
lodsb;
call eax;
mov ebp, eax; // [shader = EBP]
// GLuint s = glCreateShader(GL_VERTEX_SHADER); [s = EDI]
push GL_VERTEX_SHADER;
lodsb;
call eax;
mov edi, eax;
// glShaderSource(s, 1, (const GLchar**)(&vsh), NULL);
push 0;
lea eax, [esp + 16];
push eax;
push 1;
push edi;
lodsb;
call eax;
// glCompileShader(s);
push edi;
lodsb;
call eax;
// glAttachShader(shader,s);
push edi;
push ebp;
lodsb;
call eax;
// Restart procAdresses
mov esi, procAdresses;
// glCreateShader(GL_FRAGMENT_SHADER);
push GL_FRAGMENT_SHADER;
lodsb;
call eax;
mov edi, eax;
// glShaderSource(s, 1, (const GLchar**)(&vsh), NULL);
push 0;
lea eax, [esp + 20];
push eax;
push 1;
push edi;
lodsb;
call eax;
// glCompileShader(s);
push edi;
lodsb;
call eax;
// glAttachShader(shader,s);
push edi;
push ebp;
lodsb;
call eax;
// glLinkProgram(shader);
push ebp;
lodsb;
call eax;
pop edi;
pop esi;
pop ebp;
ret;
};
}
The fact is under crinkler, compression, the C++ version is better compressed (compiled under VC2008)
Code:
Uncompressed Compressed
ShaderCompile ASM 126 bytes 73
ShaderCompile C++ 143 bytes 67
My questions are:
- Why such a result? Is it the fact that the code is to small and we should have better results on larger code? Because i'm not a x86-killer, is there really some asm trick&tips to use for smaller code?
- Is it really worth to use assembler then? I don't think that 5% makes the difference for an intro
What do you think? What is your practice with this?
Ace: yeah that's not good x86 asm :D
You have the possibility to schedule and group the instructions in ASM so the compression ratio is better, also you can assume that all windows API keep the contents of EBX, ESI & EDI registers intact which can help for even more optimisations.
You have the possibility to schedule and group the instructions in ASM so the compression ratio is better, also you can assume that all windows API keep the contents of EBX, ESI & EDI registers intact which can help for even more optimisations.
Shouldn't that be:
lodsd
call eax
?
lodsd
call eax
?
I didn't even notice that it was for Linux (but i assume there must be some preserved registers as well).
Well it looks based on the source of a certain linux intro which is a constant source of curiosity, it seems ;)
I guess he's talking about Windows though since Crinkler is in the topic.
I guess he's talking about Windows though since Crinkler is in the topic.
you're using lots of instructions that the compiler either doesn't use (push with immediate operand, lods*) or uses infrequently (call eax), which means the compressor needs some bytes to adapt to the new opcode distribution at the start of the function, and some more at the end to adapt to the c++ code again. packers don't compress bytes individually, it's all about context.
besides, i haven't checked the compiled function, but i'm pretty certain that it's a lot more regular than your code, which makes it bigger but easier to compress.
that aside, seriously, your asm code is... not good. it's got lots of bugs for one thing (you're using 12*22 but add edi, 24 in the loop - how long are your strings, 22 or 24 bytes? it's lodsd to load a dword, not loadb; vsh is in [ebp+20] after the push 0, not in [ebp+16], and similarly for fsh with [ebp+24]; you reset the pointer to procAddresses, but then it's pointing at glCreateProgram, not glCreateShader as you assume) and it's not particularly well size optimized either.
hitchhikr: look at the code, he does rely on windows preserving ebx, esi, edi and ebp.
anyway, i've tried out how small i can get it without changing the interface, this is the result (also completely untested, of course):
87 bytes uncompressed unless i miscounted somewhere.
besides, i haven't checked the compiled function, but i'm pretty certain that it's a lot more regular than your code, which makes it bigger but easier to compress.
that aside, seriously, your asm code is... not good. it's got lots of bugs for one thing (you're using 12*22 but add edi, 24 in the loop - how long are your strings, 22 or 24 bytes? it's lodsd to load a dword, not loadb; vsh is in [ebp+20] after the push 0, not in [ebp+16], and similarly for fsh with [ebp+24]; you reset the pointer to procAddresses, but then it's pointing at glCreateProgram, not glCreateShader as you assume) and it's not particularly well size optimized either.
hitchhikr: look at the code, he does rely on windows preserving ebx, esi, edi and ebp.
anyway, i've tried out how small i can get it without changing the interface, this is the result (also completely untested, of course):
Code:
push ebp;
push esi;
push edi;
mov esi, procAddressesNames;
mov edi, procAddresses;
push edi;
push 12; // update to match new function count here
pop ebp;
getproclp:
push esi;
push eax;
call GETPROCADDRESS;
stosd;
add esi, 24;
dec ebp;
jnz getproclp;
pop esi;
lodsd;
call eax;
xchg eax, edi; // edi=shader
push -2;
pop ebp;
push esi;
shaderlp:
mov esi, [esp];
lea eax, [ebp+GL_VERTEX_SHADER+1];
push eax;
lodsd;
call eax;
push eax;
push edi;
push eax;
push 0;
push dword ptr [esp+44+ebp*4];
push 1;
push eax;
lodsd;
call eax;
lodsd;
call eax;
lodsd;
call eax;
inc ebp;
jnz shaderlp;
pop eax;
push edi;
lodsd;
call eax;
xchg eax, edi;
pop edi;
pop esi;
pop ebp;
ret;
87 bytes uncompressed unless i miscounted somewhere.
hitchhikr: linux uses the same register preservation conventions as windows does, the only difference is that one requires the direction flag to be cleared on procedure entry while the other doesn't, i'm not certain which was which :)
I think windows needs a cld.
Nevertheless, both C & ASM code have more to do with Linux than Crinkler/Windows as it would be very inefficient on the later platform.
As for ASM the ability to directly use a controlled/restricted set of instructions also helps to outmatch any C/C++ code in term of compressed size, it just takes a little bit more time to craft.
Nevertheless, both C & ASM code have more to do with Linux than Crinkler/Windows as it would be very inefficient on the later platform.
As for ASM the ability to directly use a controlled/restricted set of instructions also helps to outmatch any C/C++ code in term of compressed size, it just takes a little bit more time to craft.
what on earth does crinkler and windows have to do with the efficiency of "C & ASM code"? crinkler doesn't give a shit, and nothing prevents you from doing a context mixing compressor on linux. i actually finished and debugged the kkrunchy 0.23a3 (and following versions) compressor under linux so i could use valgrind (handy to have if you're working with large models in a weird sizeoptimized depacker - it's very easy to accidentally read/write out of bounds).
using a restricted instruction set is not nearly as effective as one would expect because the x86 instruction encoding is quite irregular: e.g. xchg eax, <reg> is 1 byte while xchg <reg>, eax is 2 bytes (as are all forms of mov <reg>, <reg>); you can use signed 8-bit displacements on register-relative addressing, but if the register is esp, it's an entirely different encoding and 1 byte bigger; and so on. even if the instruction looks nearly identical in asm code, it can be entirely different on the opcode level. all these irregularities are why kkrunchys opcode reordering is a relatively large amount of code (>1k before compression); for RISC platforms with orthogonal instruction sets and few different instruction encodings (MIPS, ARM, SPARC), you could get the same effect at a fraction of the size.
using a restricted instruction set is not nearly as effective as one would expect because the x86 instruction encoding is quite irregular: e.g. xchg eax, <reg> is 1 byte while xchg <reg>, eax is 2 bytes (as are all forms of mov <reg>, <reg>); you can use signed 8-bit displacements on register-relative addressing, but if the register is esp, it's an entirely different encoding and 1 byte bigger; and so on. even if the instruction looks nearly identical in asm code, it can be entirely different on the opcode level. all these irregularities are why kkrunchys opcode reordering is a relatively large amount of code (>1k before compression); for RISC platforms with orthogonal instruction sets and few different instruction encodings (MIPS, ARM, SPARC), you could get the same effect at a fraction of the size.
Quote:
i actually finished and debugged the kkrunchy 0.23a3 (and following versions) compressor under linux
pls to release :(
the compressor. it can still only pack PE executables, not ELFs.
Quote:
what on earth does crinkler and windows have to do with the efficiency of "C & ASM code"? crinkler doesn't give a shit.
I think you didn't get it, by efficiency i meant that Crinkler is handling the API functions loading by itself, something which is done manually in the code above and is (so far) required for Linux small sized prods but would be a waste under Windows in such context.
Quote:
using a restricted instruction set is not nearly as effective as one would expect because the x86 instruction encoding is quite irregular: e.g. xchg eax, <reg> is 1 byte while xchg <reg>, eax is 2 bytes (as are all forms of mov <reg>, <reg>); you can use signed 8-bit displacements on register-relative addressing, but if the register is esp, it's an entirely different encoding and 1 byte bigger; and so on. even if the instruction looks nearly identical in asm code, it can be entirely different on the opcode level. all these irregularities are why kkrunchys opcode reordering is a relatively large amount of code (>1k before compression); for RISC platforms with orthogonal instruction sets and few different instruction encodings (MIPS, ARM, SPARC), you could get the same effect at a fraction of the size.
Since we're obviously talking about PC 4k intros here, the instructions set to use in such cases is mostly reduced to pushes and calls with a few floating point instructions anyway, the rest of the file is mostly being devoted to shaders nowadays, only the synth would use x86 opcodes (most notably fpu instructions).
Quote:
I think you didn't get it, by efficiency i meant that Crinkler is handling the API functions loading by itself, something which is done manually in the code above and is (so far) required for Linux small sized prods but would be a waste under Windows in such context.
Erm, no. GETPROCADDRESS != GetProcAddress (if you look up the function, you'll see that it takes two arguments), it's wglGetProcAddress. The default Win32 OpenGL implementation is still only OGL 1.2, if you want anything more that means ARB extensions, which means wglGetProcAddress. This is not simply a GetProcAddress on OPENGL32.DLL (i.e. you can't just directly import it and screw the middleman); the implementation is in a different DLL that's vendor-specific (nvoglnt.dll for nvidia, don't remember the name for ATI). Also, at least for NV, the functions aren't even exported in the DLL, it's just an internal table of names+function pointers somewhere in the file that crinklers import code can't possibly find or use.
Quote:
done manually in the code above and is (so far) required for Linux small sized prods but would be a waste under Windows in such context.
Well as ryg says the GETPROCADDRESS is wglGetProcAddress here. The cross-over with the linux source code I believe the code to be based on is that by ordering the function table you can do all API calls using 'lodsd; call eax'. This isn't an overhead with the import code I was using under linux as what you get back is a list of function addresses anyway. I think you know this already though as we discussed it on irc ages ago...
Ah yeah i remember i helped you crafting that stuff sometimes ago but as i'm old & tired i can't remember everything. These are indeed OGL extensions functions, guess i haven't used those for a while.
But still, Windows comes with DirectX which wouldn't require all these imports so for that OS this is not the most efficient solution ;D
But still, Windows comes with DirectX which wouldn't require all these imports so for that OS this is not the most efficient solution ;D
not necessarily. d3d has more setup code, you need d3dx to get at the hlsl compiler, and hlsl has more red tape than glsl (you need to declare all the inputs/outputs, for example). or you could use compiled shaders, but they're pretty big.
Thanks hitchhikr, parapete and ryg for your remarks. Yes the original asm code is not good, i agree (but be indulgent, after 20 years of programming in "high level language" it's quite hard to come back to pure assembler - i miss the 16 registers in 68000! ;) ) . As i said, i didn't test it and it was just a quick prototype to evaluate the size (and yes, based on a "you massive clone" sample for calling the gl functions ;) )
Ryg, your version is nice. But as you said, the compression is based on the context, and because the rest of the code is still in c++, the compression is not working well. That's probably why even with your version, i still get a 74 bytes compressed with crinkler compare to the 67 bytes in c++. Probably a whole demo coded in asm would be better compressed...
Anyway, i need to practice a bit more x86 asm, it's probably worth it. But i'm also very suprised that the c++ version is going so low after compression, and it's not the first time i have encountered this.
Ryg, your version is nice. But as you said, the compression is based on the context, and because the rest of the code is still in c++, the compression is not working well. That's probably why even with your version, i still get a 74 bytes compressed with crinkler compare to the 67 bytes in c++. Probably a whole demo coded in asm would be better compressed...
Anyway, i need to practice a bit more x86 asm, it's probably worth it. But i'm also very suprised that the c++ version is going so low after compression, and it's not the first time i have encountered this.
Everybody is using d3dx (you pushed it yourself) and that's the kind of dll that crinkler imports with great benefits.
The size of the shaders may be rather equivalent all things considered but i think HLSL have a more relaxed (and smaller) syntax than GLSL which results in smaller shaders which also pack better (i'll verify that someday eventually) also there's some discrepancy between ATI & Nvidia GLSL implementations so some optimizations aren't really safe unless you provide 2 versions of your intro or you leave them away, of course.
This is less crucial for 4k intros than for 1k effects, tho.
http://pouet.net/prod.php?which=52938 << HLSL
http://pouet.net/prod.php?which=52940 << GLSL
Pick one.
The size of the shaders may be rather equivalent all things considered but i think HLSL have a more relaxed (and smaller) syntax than GLSL which results in smaller shaders which also pack better (i'll verify that someday eventually) also there's some discrepancy between ATI & Nvidia GLSL implementations so some optimizations aren't really safe unless you provide 2 versions of your intro or you leave them away, of course.
This is less crucial for 4k intros than for 1k effects, tho.
http://pouet.net/prod.php?which=52938 << HLSL
http://pouet.net/prod.php?which=52940 << GLSL
Pick one.
i didn't push anything, and i'm pretty certain that none of our prods needs d3dx (though i may be mistaken).
Now now, chaps.
@lx: While it might sound silly, there is a significant difference between asm and 100% asm.
You might get some space saving by rewriting parts of your code into asm, but, as you have seen, it is not much. The really big saving will come when you write everything in asm. This will make the code more uniform across the intro, making it compress better.
Also make sure that you stick to a certain coding style all the way. Use the same register for the same thing always. Push the same set of registers onto the stack in every function, even if some are not used. In general, use large but similar constructs to do similar things rather than small but different ones.
Your lodsd, call eax idiom is probably good and compact when this is the dominant way to call API functions, but if you get many more API calls which are called in the usual call [function pointer address] way, it might be better to change it to only use this way.
But whatever you do, don't assume that some way will be better than some other way. Try both.
Happy 100% asm coding! :-D
You might get some space saving by rewriting parts of your code into asm, but, as you have seen, it is not much. The really big saving will come when you write everything in asm. This will make the code more uniform across the intro, making it compress better.
Also make sure that you stick to a certain coding style all the way. Use the same register for the same thing always. Push the same set of registers onto the stack in every function, even if some are not used. In general, use large but similar constructs to do similar things rather than small but different ones.
Your lodsd, call eax idiom is probably good and compact when this is the dominant way to call API functions, but if you get many more API calls which are called in the usual call [function pointer address] way, it might be better to change it to only use this way.
But whatever you do, don't assume that some way will be better than some other way. Try both.
Happy 100% asm coding! :-D
Well the issue between NVidia and ATI for GLSL is simple : ATI follows the specification rules, and NVidia makes their own :P
Your best bet without an ATI card to test is AMD GPU ShaderAnalyzer or some such program that compiles the shader code using Catalyst drivers so if your shader compiles in there, it SHOULD work on an ATI card (and it even gives you which cards it's safe on). Problem is, "SHOULD" is not "WILL"...
Anyways another thing to note is that function name loop etc is proven to be larger than directly using the function names in C.
Let me give an example:
Using the strings this way without the getprocaddress loop is actually smaller, though it looks like it wouldn't be.
If you're curious, here's my ASM (NASM) equivalent:
...and the data section:
This seems pretty optimal to me.
Your best bet without an ATI card to test is AMD GPU ShaderAnalyzer or some such program that compiles the shader code using Catalyst drivers so if your shader compiles in there, it SHOULD work on an ATI card (and it even gives you which cards it's safe on). Problem is, "SHOULD" is not "WILL"...
Anyways another thing to note is that function name loop etc is proven to be larger than directly using the function names in C.
Let me give an example:
Code:
p = ((PFNGLCREATEPROGRAMPROC) wglGetProcAddress("glCreateProgram"))();
s = ((PFNGLCREATESHADERPROC) wglGetProcAddress("glCreateShader"))(GL_VERTEX_SHADER);
((PFNGLSHADERSOURCEPROC) wglGetProcAddress("glShaderSource"))(s,1,&shaders_vertex,NULL);
((PFNGLCOMPILESHADERPROC) wglGetProcAddress("glCompileShader"))(s);
((PFNGLATTACHSHADERPROC) wglGetProcAddress("glAttachShader"))(p,s);
s = ((PFNGLCREATESHADERPROC) wglGetProcAddress("glCreateShader"))(GL_FRAGMENT_SHADER);
((PFNGLSHADERSOURCEPROC) wglGetProcAddress("glShaderSource"))(s,1,&shaders_fragment,NULL);
((PFNGLCOMPILESHADERPROC) wglGetProcAddress("glCompileShader"))(s);
((PFNGLATTACHSHADERPROC) wglGetProcAddress("glAttachShader"))(p,s);
((PFNGLLINKPROGRAMPROC) wglGetProcAddress("glLinkProgram"))(p);
((PFNGLUSEPROGRAMPROC) wglGetProcAddress("glUseProgram"))(p);
Using the strings this way without the getprocaddress loop is actually smaller, though it looks like it wouldn't be.
If you're curious, here's my ASM (NASM) equivalent:
Code:
; Shader init
%ifndef _4K_NO_SHADERS_
; Create program
push sn_glCreateProgram
call wglGetProcAddress
call eax
mov [p],eax
%ifndef _4K_NO_VERTEX_SHADER_
; Vertex shader
push sn_glCreateShader
call wglGetProcAddress
push 8b31h ; GL_VERTEX_SHADER
call eax
mov [pfd],eax
push sn_glShaderSource
call wglGetProcAddress
push byte 0
; Have to pass a pointer pointer (not a typo) because glShaderSource only accepts string arrays.
mov [esi],dword shaders_vertex
push esi
push byte 1
push dword [pfd]
call eax
push sn_glCompileShader
call wglGetProcAddress
push dword [pfd]
call eax
push sn_glAttachShader
call wglGetProcAddress
push dword [pfd]
push dword [p]
call eax
%endif
; Fragment shader
push sn_glCreateShader
call wglGetProcAddress
push 8b30h ; GL_FRAGMENT_SHADER
call eax
mov [pfd],eax
push sn_glShaderSource
call wglGetProcAddress
push byte 0
mov [esi],dword shaders_fragment
push esi
push byte 1
push dword [pfd]
call eax
push sn_glCompileShader
call wglGetProcAddress
push dword [pfd]
call eax
push sn_glAttachShader
call wglGetProcAddress
push dword [pfd]
push dword [p]
call eax
; Link and bind
push sn_glLinkProgram
call wglGetProcAddress
push dword [p]
call eax
push sn_glUseProgram
call wglGetProcAddress
push dword [p]
call eax
%endif
...and the data section:
Code:
%ifndef _4K_NO_SHADERS_
%ifndef _4K_NO_VERTEX_SHADER_
shaders_vertex: ; Vertex shader
db "void main(){"
db "gl_Position=ftransform();"
db "}",0
%endif
shaders_fragment: ; Fragment shader
db "void main(){"
db "gl_FragColor=vec4(1);" ; Write pixel
db "}",0
sn_glCreateProgram: ; Names of shader procs
db "glCreateProgram",0
sn_glCreateShader:
db "glCreateShader",0
sn_glShaderSource:
db "glShaderSource",0
sn_glCompileShader:
db "glCompileShader",0
sn_glAttachShader:
db "glAttachShader",0
sn_glLinkProgram:
db "glLinkProgram",0
sn_glUseProgram:
db "glUseProgram",0
%endif
This seems pretty optimal to me.
Quote:
d3d has more setup code
Are you sure? IIRC my minimal D3D9 setup-code is smaller than my minimal GL setup-code. I mean, it's a call to Direct3DCreate9() followed by a COM-call to IDirect3D9::CreateDevice(). In OpenGL you need to call ChangeDisplaySettings(), ChoosePixelFormat(), SetPixelFormat(), GetDC(), wglCreateContext() and wglMakeCurrent(). I'm not sure if I've ever measured the two snippets against each other, the D3D9-code sure looks a lot simpler.
no, not sure. i have to admit i never measured it :)
I also tend to think the d3d9 startup code is smaller. Also, in case you wanted to make a non shader-based intro (??) and setup some antialiasing, then d3d9 wins by some hundred bytes.