pouët.net

Bonzomatic Memory Leak Bounty

category: code [glöplog]
 
Maybe you can help me with an issue we been trying to figure out regarding Bonzomatic.

Summarily, when we use the latest GLFW version of networked bonzomatic there is this weird behavior that starts creeping up, the previous frame texture buffer starts getting corrupted, it's only noticable when we are having multiple instances open after a long session (sensibly 30 minutes with 20 bonzomatics open). A single Bonzomatic instance running also starts displaying this behavior but typically takes longer to show up.

There is a hot key to restart the timming on Bonzomatic which clears up the issue (F1 i think, have to double check), which is what we been using when things start looking weird. We did not test if this is also happening with DirectX (we typically recommend people to use the GLFW version because the shader compile doesn't risk freezing the machine like DX sometimes does). And we didn't notice any similar leak behavior on the previous version of Bonzomatic that did not have the previous frame buffer texture available, so no clue if the leak comes from there specifically or from somewhere else but that's where it starts manifesting itself visually on that version.

This weird behavior was mostly fine for the online editions, i would just remember to press the hotkey every 30 mins and no one would notice, was using a 970GTX for those.

Howhever at Inércia this year we used a 3080TI instead and for some reason instead of getting corrupt textures the entire machine would crash, sometimes within a few seconds of launching all the bonzomatics, other times lasting for over 10-15 minutes without crashing! Crashing the whole machine during an event is not usable. We wanted to do more tests to isolate if the issue was either faulty hardware, the new/different opengl drivers for the 3080TI or bonzomatic itself but the machine's motherboard died shortly after Inércia so we couldn't do any further tests on that specific machine. It's worth mentioning the machine didn't crash during Inércia at any other point for any other competitions or demos playing (nor during previous week when it was being stress tested with videogames and demos), the crash only happened (and multiple times) during the shader royale itself. Windows error message log was just opengl driver error.

So i'm looking for someone experienced either patching memory leaks or dealing with weird opengl driver issues to take a look at Bonzomatic and figure out what could be done to ensure the leak/crash won't occur in the future.

We're planning to use this setup again at Revision and Sessions events next month, i'm fairly confident the crash itself was hardware and we can always try to use the previous version of bonzomatic (without the previous frame texture) if the crash occurs when testing the machines we'll use, but either way getting to the bottom of the leak would be the best, to make sure the setup is more robust overall.

Anyone with some time and expertise willing to take a look at this or share their findings? I'll buy beer at Revision to whoever helps resolve this!
added on the 2023-03-22 15:47:51 by psenough psenough
well, I only looked at the code for like 5 minutes, so I might miss something important here, but:

- The CreateBlaTexture() functions in platform_glfw/Renderer.cpp all allocate new instances of the Texture class
- There is a ReleaseTexture() function, which calls glDeleteTextures
- But it seems the actual Texture objects (class instances) are never free'd/deleted

This can be an issue if you (re-)create many new textures over time.

in main.cpp this seems to leak memory for example (and is indeed related to the previous frame texture buffer as you mentioned):

Code: if (Renderer::nSizeChanged) { Renderer::ReleaseTexture(texPreviousFrame); texPreviousFrame = Renderer::CreateRGBA8Texture(); Capture::CaptureResize(Renderer::nWidth, Renderer::nHeight); }
added on the 2023-03-23 08:50:27 by arm1n arm1n
That looks like a leak. nSizeChanged is set to true if the window size changes (window_size_callback), and the Texture struct is very small, so I don't think it's causing the issue.

I somehow wonder about this motherboard having died... the particular kind of stress bonzomatic causes on the GPU may have caused the machine to heat up more than in the other situations. Or some fan connector bugged out during the party, given that the machine was stress tested before. Just a thought.

The issue could also be unrelated to the corrupted textures observed on the older card.

psenough, can you repro it with another machine?
added on the 2023-03-23 09:30:15 by jco jco
arm1n: thank you for looking into this! not sure the best way to patch it but it's good reference.

jco: after the initial crash we lowered the max temp of the card during the royale to make sure it wasn't an overheating problem causing the crash, and we also looked at the levels on the motherboard temp and it looked fine, but yeah could still have been something like that. it crashed anyways even with lower max temperature.

as for reproducing it on other machines: the crash no, haven't seen it crash another machine since. the leak starting to corrupt the texture yes, for example fieldfx orgas seen same issue when they organized similar event, and some of the bonzomatic stream coders have also seen it occur to them after long usage time (over an hour). but it's only lead to crashes during inércia. we assume the leak is behind the crash issue because windows error report said opengl driver error when crashing, but yeah i agree it could maybe be unrelated to the leak, but then why didn't the machine crash at any other point during the event when we were playing demos that pushed the cpu and gpu just as much? mysteries of computing :)
added on the 2023-03-23 09:49:36 by psenough psenough
Why do you keep using the word "memory leak" when I can't see anything in the description that seems to indicate this being an actual memory leak?

A memory leak isn't just "something going wrong over time", it's a specific class of bugs, where an application is draining the system of memory over time. Has this actually been analyzed closely to show that it is indeed a memory leak? A typical way to analyze this, would be to run using a leak-detector, or tracking process memory usage over time. It's often a bit more difficult to track video memory leaks, because GPU memory often doesn't show up in such tools.

The reason I'm asking, is because if you tell people that this is a memory leak issue, you might end up wasting everyone's time and get no closer to the solution if the problem isn't actually a memory leak.
added on the 2023-03-23 10:36:15 by kusma kusma
yeah, what kusma said. I also doubt it's a memory leak, even a buffer overrun to corrupt the framebuffer textures is highly unlikely, because usually you can't write to them from the CPU side anyway.
The functions GrabFrame() and CopyBackbufferToTexture() looked OK to me at first glance.

As it stands it could be any number of things including GL driver bugs, hardware failures, or problems in completely non-rendering related code parts.
added on the 2023-03-23 10:56:17 by arm1n arm1n
Some screenshots of what the corruption looks like would be useful in trying to diagnose potential culprits, BTW.
added on the 2023-03-23 10:58:49 by kusma kusma
it's the only way i found to describe it. might not indeed be technically accurate if it only affects the texture and not show any symptoms of actual RAM use increase, i'll concede you that.

i'm not very experienced in this area of things as you probably guessed. the texture starts getting garbled with text font and then white/noise, so seemed like a memory leak to me, i didn't really make the distinction in my head between a "regular" memory leak and a video memory leak.

i remember monitoring RAM use over time (on task manager) during the first tests with > 12 bonzomatic's and the memory use was stable, not really moving upwards as memory leaks do. but i'm unsure if this test was before or after the previous frame buffer update, probably before, i'll do another one today to double check, i haven't tried using any sort of specific leak or tracker tool, never had to do any of that professionally, no clue what's the best/easiest/most accurate for this case in particular, thus why i'm asking for help publically.
added on the 2023-03-23 11:00:06 by psenough psenough
Symtom from what i remember and understood was like that :

==
- At beginning while, everything works as expected, you can use the texPreviousFrame and it behave properly
- But at one moment (we can't determine exactly when or what triggers it), it's not working anymore and somehow get switched to the "hidden font texture".
- As this texture is a vec4 with just the alpha component setup, it could also silently "fail" and it's not easy to "screenshot". It doesn't create a particular artifact on the screen or something like that, it just "swap" the texture.
==

As a comment, I guess this font texture is used for the text editor rendering, and I call it "hidden" because you can access it from the fragment shader by adding an extra sampler / texture with whatever name.
added on the 2023-03-23 11:23:37 by totetmatt totetmatt
Quote:
the texture starts getting garbled with text font and then white/noise, so seemed like a memory leak to me

Wouldn't a video memory leak result in API calls that fail with "out of memory", rather than memory corruption?
added on the 2023-03-23 12:29:16 by absence absence
been running multiple instances of bonzomatic offline with the shaders from inércia for a few hours now on my 970GTX, trying to replicate the issue and get a capture of it, issue hasn't come up yet after 2 hours of random use (similar to things i did at the royale).

maybe it's the more constant recompiles of livecode changes that act as a trigger, i am also running this at lower resolution then usual, starting to wonder if that could also be the issue why it's not happening so far.

looking at memory use on task manager, it's been steady, no increase throughout the whole session. so yeah not a traditional memory leak.

totetmatt is working on a script to stress test simulating the recompiles, to try tp get a way to reproduce the issue directly.
added on the 2023-03-23 13:14:34 by psenough psenough
tried another session, this time with totetmatt's scripts simulating live recompiles for multiple users every second, also switched window resolutions a lot more by hiding and showing a user that forces window resizing, abusing that action might be the culprit, managed to replicate the bug at 50:12 on this video, it's still uploading while i type this. You can see the corruption creeping up as font/text characters on some of the shaders getting recompiled every second and on rimina's static offline shader on the bottom right (picked because we knew it was one of the shaders that would show the bug) you can see the bug makes everything get darker, when i force Rimina's shader to recompile it then gets extremely bright instead (still not what it should be). Deleting that single bonzomatic window and adding it back as new window is the only way to get the shader working properly again, the other instances remain with the visual bug.
added on the 2023-03-23 14:56:45 by psenough psenough
With latest version you mean master of the repository you linked? Did the issue appear with older versions or was it recently introduced? To me the described behavior sounds like the sampler setup gets mixed up, maybe the text-editor renderer and the on-demand shader recompiling interfering with each other and corrupting the state due to some unfortunate timing.
added on the 2023-03-24 04:04:27 by LJ LJ
yeah, I also have the gut feeling there is something going on in between the

Code: Renderer::StartTextRendering(); ... Renderer::EndTextRendering();


sandwich. There is even a suspicious comment:

Code: // avoid a strange triangular glitch on coder name when it's the last thing drawn
added on the 2023-03-24 09:32:43 by arm1n arm1n
(in main.cpp)
added on the 2023-03-24 09:33:37 by arm1n arm1n
Quote:
Why do you keep using the word "memory leak" when I can't see anything in the description that seems to indicate this being an actual memory leak?


kusma:
That texture struct is newed but the previous one is not deleted when window size changes, that certainly looks like a few bytes of heap leaking each time this happens. The gl texture DOES get destroyed.
I don't think it's related to the crash, also it will take forever to be even noticable.

For the crash: need repro, I'd probably start by looking for gl specific issues, race conditions, unitialized stuff, things that can cause undefined behavior.
added on the 2023-03-24 10:21:06 by jco jco
Copy my investigation from discord :
Quote:
So I had a look with my poor and shitty C and Opengl skills


# Windows Resize vs texPrevious
When there is a windows resize, it looks like it's not doing this behaviours as you told me ps, it actually recreate a new texture at every change :

github.com/TheNuSan/Bonzomatic/blob/master/src/main.cpp#L888-L893

So the texPreviousFrame is recreated with new resolution at each window resize

Don't know if the window event handler is multithread, but then it means that when resizing the windows could enter a state where the texPreviousFrame isn't matching the frame size, which then is unexpected behaviour ?

Possible but then i saw...

# textureUnit limit ?
When a texture is created, there is this "textureUnit" which is incremented (and never decremented)

github.com/TheNuSan/Bonzomatic/blob/master/src/platform_glfw/Renderer.cpp#L760

If I understand well, it's used during the rendering phase to select the texture to bind (again i'm a OpenGL N00bz). But could it be that somehow it strikes
some limit like 32 or 64 (or any TEXTURE_LIMIT on GPU) that somehow by resizing during a live make strike this limit and makes the textureUnit unsuable, which would explain :
- That other texture are OK and somehow it switch to the hidden font texture ?
- It also impact other shadesr when using loader / VJ script, because even hidding = resizing, so incrementing the unit.


(No glop point for link, so sorry for no automatic link)

Nusan had a closer look and it looks like it might be the issue. A release with a patch has been build and will be tested to assert if it fixes the issue.
added on the 2023-03-24 10:56:07 by totetmatt totetmatt
tested it today again, managed to reproduce the bug after 30 seconds of constant resizing. patched version doesnt show any symptoms with over 2 mins of constant resizing. so seems to be patched!!! \o/

thank you everyone for your insights, it helped getting this ironed out! had been creeping around for over a year without anyone knowing what was the exact culprit. and now it's gone!
added on the 2023-03-24 13:28:19 by psenough psenough
OK, so it seems this was a texture-unit overrun instead of a memory leak. They're not completely unrelated, but that's a good find! Great work, people!
added on the 2023-03-24 14:17:38 by kusma kusma
Nice, that was interesting to follow.
added on the 2023-03-27 20:11:56 by jco jco

login