17:29 imirkin: skeggsb: i'm planning on looking into more HDR-related bits for nouveau. can i assume this is an area you have not really looked at / are not interested in looking at yourself?
17:29 imirkin: [beyond making the LUT stuff work properly a while back]
17:52 karolherbst: imirkin: it's on the things we want to do at least
17:52 imirkin: ok. well, i don't want to collide with other peoples' work
17:52 imirkin: esp since my work is going to proceed at a highly unpredictable rate
17:53 karolherbst: yeah.. let me check
17:53 karolherbst: doesn't seem like to be part of what we want to do for Nouveau (we as RH graphics team)
17:54 imirkin: ok
17:56 karolherbst: and as far as I can tell, I doubt skeggsb was working on it.. but to be sure you can always wait until he responds
17:57 imirkin: nah, i mean i'll start doing little things
17:57 imirkin: if it turns out he already has a series, so be it
19:27 karolherbst: imirkin: ohh, btw, I was posting a series for optimizing constants into the driver const buffer, and I was wondering if you have any opinion on that: https://lists.freedesktop.org/archives/mesa-dev/2019-March/216041.html
19:27 airlied: imirkin: we did buy skeggsb a HDR monitor :)
19:28 karolherbst: airlied: I also have an eGPU case :p
19:28 karolherbst: I still didn't come to writing any meaningful patch yet
19:28 karolherbst: (although I kind of understand the brokeness of nouveau when it comes to just removing the device)
19:29 karolherbst: but yeah.. I kind of suspected that skeggsb would like to look into it
19:30 airlied:could get an egpu case but I've got no thunderbolt machines :-P
19:31 karolherbst: usually they are quite nice as you can chnage your GPU without rebooting, but yeah, without a TB port you are kind of screwed
19:31 karolherbst: get a better motherboard :p
19:33 karolherbst:trying to fix this silly HMM memory double free bug
19:33 imirkin: airlied: heh finally :)
19:37 imirkin: karolherbst: not a huge fan of the fact that you reupload imms on every validation
19:37 karolherbst: yeah...
19:37 karolherbst: I don't like that either
19:37 imirkin: :)
19:37 karolherbst: I was mostly writing the code to see what the benefit would be
19:37 imirkin: calim had previously either done or thought of doing something like that
19:37 karolherbst: maybe we should collect the 1000 most used immediates and have a static table? dunno
19:37 imirkin: i eventaully removed it since it was only like 1% there in the code
19:38 imirkin: the c14_imms thing or whatever
19:38 karolherbst: ahh
19:38 imirkin: it's a lot more beneficial for nv50 actually
19:38 imirkin: since it's a LOT more restrictive in terms of where imms can go
19:38 karolherbst: mhh, interesting
19:38 karolherbst: but maybe with a static table we could have a good enough benefit? dunno what are widley used constants
19:39 imirkin: nah
19:39 karolherbst: would save us the reupload
19:39 imirkin: i like the general approach
19:39 imirkin: but if you could keep better track of which shaders' imms you had most recently uploaded, i'd feel a little better
19:39 karolherbst: reworking insnCanLoad was fun
19:40 karolherbst: imirkin: I was also thinking about having one table for the entire runtime and just append at the end and deduplicate, etc...
19:41 karolherbst: I think I actually added deduplication already
19:41 karolherbst: and if we run out of space, we run out of space
19:41 imirkin: could make sense
19:41 imirkin: otoh i dunno what the benefit is
19:42 karolherbst: less reuploads
19:42 karolherbst: or none
19:42 imirkin: should be the same amount...
19:42 karolherbst: why?
19:42 karolherbst: with the current approach, we have to reupload the entire imm buffer
19:42 imirkin: and as you link diff shaders together, could be trouble
19:42 karolherbst: but if we have one buffer for _all_ shaders?
19:42 imirkin: oh, you mean GLOBALLY?
19:42 karolherbst: yeah
19:43 karolherbst: so we just upload new imms
19:43 karolherbst: well, if the buffer gets bigger, we just upload more data
19:43 karolherbst: otherwise we just keep whatever is in there
19:43 imirkin: and do some kidn of clever thing about keeping stuff around? dunno.
19:43 imirkin: also messes bigtime with the shader cache
19:43 karolherbst: on maxwell we have 18 buffers.. but still only 8 for compute shaders :/
19:43 karolherbst: mhhh, do we actually have a shader cache so far?
19:44 imirkin: i thought we did?
19:44 karolherbst: we only have the TGSI one, right?
19:44 imirkin: maybe i thought wrong
19:44 karolherbst: yeah.. because if we did, we would be able to serialize and deserialize our shader
19:44 karolherbst: and I am sure we don't support that
19:44 karolherbst: (also we had this dicussion already about chaching native shaders)
19:45 imirkin: maybe not =/
19:45 karolherbst: but yeah... that would be a concern indeed
19:45 imirkin: we should though
19:45 imirkin: there's no reason we can't
19:45 karolherbst: yeah...
19:45 karolherbst: we need it for CL anyway (kind of)
19:45 imirkin: should be trivial to implement. just have to serialize the code, relocs, and fixups
19:45 karolherbst: yeah
19:45 karolherbst: I have some WIP stuff somewhere
19:51 karolherbst: imirkin: I kind of hoped you would have some very clever idea on how to make it suck less..
19:51 imirkin: hm
19:52 imirkin: actually i dunno that what you have is all bad... dunno.
19:52 karolherbst: maybe I should just go ahead and benchmark it somehow... two shaders with tons of constants, switching between them or something and see how much it impacts draws/sec or so
19:53 imirkin: or just something like unigine
19:53 karolherbst: maybe
19:54 karolherbst: I think I actually did and it didn't had any significant impact
19:54 karolherbst: mhh, maybe we should have some constbuffer_size/draw counter? That would be a fund thing to have
19:57 karolherbst: or maybe check out how nvidia is doing that, as they do this optimization as well afaik
20:36 endrift: Is there a list of maximum supported versions of GLES in nouveau for different cards?
20:36 endrift: I'm curious what the maximum supported ES version of the TX1 is
20:36 endrift: I can make a 3.1 context but I haven't tried 3.2
20:37 imirkin: ES 3.2 should be supported on all fermi+, i believe
20:37 imirkin: at least nominally
20:38 endrift: cool thanks
20:38 endrift: Not sure what was added in 3.2 but it's good to know
20:38 imirkin: hasn't been updated in a while, but there's also https://people.freedesktop.org/~imirkin/glxinfo/
20:38 endrift: I might be writing Switch homebrew :)
20:38 imirkin: ES3.2 = ES 3.1 + AEP, basically
20:38 endrift: I see AEP and I think Freenode user aep in #qt <_<
20:38 imirkin: ANDROID_extension_pack_es31a or whatever
20:38 endrift: aha
20:39 endrift: btw should I try my Nvidia cards in my G5 again? It's been a long time and I heard some stuff might have gotten fixed
20:39 imirkin: you were the guy with a NV40 AGP, right?
20:39 endrift: There's a radeon in there right now but it only supports GL 2.0
20:39 endrift: not a guy but yes
20:39 imirkin: the individual, sorry :)
20:40 endrift: yes, that was me :)
20:40 imirkin: so unfortunately in the meanwhile my own G5 died
20:40 endrift: oh no :(
20:40 imirkin: so that's pretty much the end of the line as far as my support of all that goes
20:40 imirkin: not that i was providing anything particularly tremendous...
20:41 imirkin: iirc the issue you were having was way beyond my understanding, and can pretty much only be debugged by you (or someone with the hw)
20:41 endrift: If I knew where to start I may try that
20:41 endrift: been busy with other stuff though
20:42 imirkin: by now i don't even remember what the issue was ... something about DMA being fubar?
20:42 endrift: I think so
20:42 endrift: I was getting green/magenta checkerboards and a hardlock
20:42 imirkin: yeah. i don't have any genius suggestions =/
20:42 endrift: oh well :(
20:42 imirkin: i suspect asking someone more familiar with the platform may help
20:42 endrift: Thanks anyway
20:46 endrift: I did notice a few days ago that some code I wrote was working fine on my TX1 jetson nano with nvgpu but had rounding issues with mesa+nouveau on my switch. Anything I could do to try to debug that, or should I just not worry about it?
20:46 endrift: I ended up rewriting the shaders to use integer textures so I didn't have to worry about rounding, but it was the first place I'd seen this
20:46 endrift: works fine on Intel + AMD too afaik
20:50 imirkin: well, given the info available, could be a genero nouveau bug, a maxwell-related bug, or something TX1-related
20:50 imirkin: you could also be doing something dodgy which happens to work out ok on other platforms
20:50 imirkin: but that we're not super-careful about
20:50 imirkin: what do you mean by "rounding issues"?
20:51 imirkin: like you wanted a circle but got a square? :p
20:52 endrift: it's probably the dodgy thing
20:52 endrift: I was trying to store integers in floats because I thought integer texture color attachments for draw framebuffers didn't work
20:53 endrift: so I had an integer in the range 0-63 that I was dividing on write and multiplying on read and I got completely different results sometimes
20:54 karolherbst: endrift: are you using mesa master?
20:54 karolherbst: or some newish version?
20:54 karolherbst: we had a bug on maxwell in the integer div code which could explain such result
20:55 karolherbst: uhm behaviour
20:55 endrift: so for i915 I'm using latest release and it works, for nouveau I'm using...whatever is on the switch
20:55 karolherbst: I am sure there is nothing on the switch itself :p
20:55 endrift: which appears to be 19.0.0
20:56 endrift: you know what I mean :P
20:56 karolherbst: okay, I think 19.0.0 should contain the fix
20:56 karolherbst: maybe you are hitting undefined behaviour or something though
20:57 endrift: dunno. I should have just been using integer textures/samplers in the first place
20:57 endrift: and I am now
20:57 endrift: I seem to be having some trouble still but I think it's with clearing integer textures
20:57 endrift: I don't think glClear works but glClearBuffer does
20:57 endrift: *glClearBufferiv
21:00 imirkin: endrift: hm, well integer division being broken would be a bit of a bummer
21:00 imirkin: it's happened before though
21:00 endrift: so I was converting to float before doing the division so I wouldn't lose precision
21:00 endrift: (hopefully)
21:01 endrift: and since the number was only 6 bits it should fit fine in highp
21:01 imirkin: well, keep in mind integer division rounds down
21:01 endrift: and since I was dividing by a power of two, it should just futz with the exponent
21:01 imirkin: (for positive numbers)
21:01 endrift: yes that's why I wasn't doing integer division
21:01 imirkin: also for floats, denorms get flushed
21:01 imirkin: although i doubt you were hitting that
21:01 endrift: I might have been for all I know
21:02 imirkin: that's for super-small floats
21:02 endrift: but with highp I shouldn't have been
21:02 endrift: yeah
21:02 imirkin: i.e. less than 2^-31 or whatever
21:02 endrift: I'm familiar with denormals
21:02 imirkin: nouveau ignores highp/etc
21:02 imirkin: it's all 32-bit
21:02 endrift: thaaaaaat might have been my problem
21:02 endrift: oh
21:02 endrift: nevermind
21:02 endrift: everything I was doing should fit in 32 bit
21:03 imirkin: only the mobile chips and super-modern chips benefit from caring about mediump/etc, so it's hard to care
21:03 endrift: (I don't remember how big a single-float mantissa is but it's bigger than 6 bits by a lot)
21:03 imirkin: 23 bits
21:03 endrift: (I think it's like 23
21:03 endrift: yeah that
21:03 imirkin: could be an opt-gone-wrong
21:03 endrift: I was messing around with floating point representations at my job a few months ago so I should know that one
21:04 endrift: granted I was mostly using half float and less
21:04 imirkin: less than half-float?
21:04 imirkin: quarter-float? :)
21:05 endrift: basically
21:05 endrift: 1/5/2
21:05 imirkin: wow
21:05 imirkin: that's not a lot of bits.
21:05 endrift: We needed bandwidth more than precision for this project
21:06 imirkin: well, clearly you weren't going to get precision with 8-bit floats...
21:06 endrift: so I was experimenting to see how low I could get the precision while still having it work
21:06 endrift: indeed
21:06 endrift: 2 mantissa bits (+1 implied) kinda sucked though
21:06 endrift: especially handling denormals
21:06 imirkin: heh
21:07 imirkin: well, if you want to create a repro, or an apitrace or whatever
21:07 imirkin: then i can have a look
21:07 imirkin: otherwise there's not a lot we can say
21:07 imirkin: other than "yes, bugs may exist in nouveau" :)
21:07 imirkin: which is a safe bet all-around
21:08 endrift: I don't think I can generate an apitrace on the switch
21:08 endrift: maybe I can
21:08 imirkin: i'll bbl - rebooting, potentially multiple times. i check the logs though if you want to post something. or file a bugzilla issue.
21:08 imirkin: well, you said you tested on intel
21:08 imirkin: and it worked
21:08 endrift: ideally I'd install nouveau on the jetson nano
21:08 endrift: oh are those apitraces portable?
21:08 imirkin: apitraces are generally portable
21:08 endrift: I guess they should be
21:08 imirkin: don't use bindless
21:08 imirkin: and don't use coherent buffers
21:09 imirkin: those are the two big no-no's of portable apitraces
21:09 imirkin: (and don't use features which are only available on some subset of gpu's ... like if you rely on INTEL_foo for it, that kinda reduces the portability...)
21:09 endrift: If I just install a stock AArch64 Linux distro on my Jetson Nano and install xorg-video-nouveau, should I expect that to work?
21:09 imirkin: no
21:09 endrift: dang
21:10 imirkin: wayland is more likely to work
21:10 endrift: I can try wayland
21:10 imirkin: the display is handled by tegra
21:10 endrift: ah
21:10 imirkin: while the accel is handled by nouveau
21:10 endrift: I'll look around to see if I can find a guide
21:10 imirkin: this is not extremely-well-handled in xorg ddx's
21:11 imirkin: and fwiw, a stock linux distro on a Jetson TK1 i have causes it to hang pretty fast (no nouveau involved)
21:11 imirkin: i might have somehow fried it or something, dunno. but i've never gotten that stuff to work reliably.
21:11 endrift: hm, ok
21:12 endrift: I'll see about generating an apitrace once I fix this clear bug
21:12 imirkin: (basically upon substantial network activity, it seems to die. which is unfortunate since i nfs-root it...)
21:13 imirkin: alright, well i'm out for a bit. file a bug if you get anything concrete, we can investigate.
21:13 imirkin: good luck
21:22 endrift: for future reference: the afterimage/clearing issue was actually alpha blending
21:23 endrift: turns out I was decreasing the alpha in places I hadn't thought about and alpha blending was disabled on my i915, but not on nouveau
21:23 endrift: my bug
21:23 endrift: I'll capture an apitrace for the other issue now
21:30 endrift: https://endrift.com/files/mgba-qt.28.trace can anyone on a maxwell replay this for me and take a screenshot of it on one of the last frames?
21:53 karolherbst: my qapitrace just crashes :O
21:57 karolherbst: heh...
21:57 karolherbst: endrift: which frame? 108?
21:57 imirkin: skeggsb: so i tried to hook up fp16 formats -- no go. do i have to disable dithering or disable lut or something else maybe?
21:58 endrift: doesn't matter exactly, just one near the end
21:58 endrift: the last few should be relatively identical
21:58 karolherbst: endrift: https://i.imgur.com/gteM0Iw.png
21:58 karolherbst: this is a gp107 though
21:59 karolherbst: but it's more or less identical to maxwell
21:59 endrift: well that looks right
21:59 endrift: yoshi is behind everything when renderered the Switch
21:59 endrift: can't actually see him
21:59 karolherbst: uff
21:59 karolherbst: weird
22:00 imirkin: this is on GK208: https://i.imgur.com/iUZUhc5.png
22:01 imirkin: seems pretty similar to karol's :)
22:01 endrift: also looks correct
22:01 endrift: wonder if there's any chance it could be a bug in the Switch version of libdrm_nouveau
22:01 karolherbst: endrift: try to run with NV50_PROG_OPTIMIZE=0
22:02 karolherbst: although I don't know if that's exposed in a release build
22:02 endrift: I'm not sure how to do that on the switch
22:02 karolherbst: setenv or something? dunno
22:03 imirkin: endrift: are you using ASTC anywhere?
22:03 karolherbst: ahh, it's debug only
22:03 karolherbst: mhhhh
22:03 endrift: imirkin: I don't know what that is
22:03 karolherbst: imirkin: why is NV50_PROG_OPTIMIZE debug only?
22:03 imirkin: karolherbst: dunno
22:03 imirkin: karolherbst: all the debug stuff is debug-only :)
22:03 karolherbst: well
22:03 imirkin: endrift: texture compression format
22:03 karolherbst: but telling user to just try it would be cool
22:03 endrift: oh, no I'm not
22:04 imirkin: that's the only thing i know of that would be legitimately different on the TK1/TX1
22:04 imirkin: they have native ASTC support, while for desktop it's emulated on cpu
22:04 imirkin: [and nouveau exposes that native support, ideally, but in practice not super-duper-tested]
22:04 endrift: I wonder if something in how I set up the EGL context on the Switch broke it
22:05 imirkin: endrift: how does yoshi make it on there?
22:05 imirkin: could it be a depth testing/etc snafu?
22:05 endrift: the depth is stored in that integer-in-float texture I was talking about, I render five buffers individually and then composite them manually in a shader
22:05 imirkin: you also appear to have a GL_RGBA4 texture. this has been the source of some annoyance for nouveau.
22:06 imirkin: (e.g. some people expect to be able to render to a GL_RGBA4)
22:06 endrift: (This is because the system I'm emulating does weird not-really-friendly things with layer ordering)
22:06 endrift: I'm reading from it, not rendering to it
22:06 endrift: if that texture didn't work the graphics would be completely corrupted
22:07 imirkin: endrift: also you appear to be making use of GL_BLEND_ADVANCED_COHERENT? that's not supported on nouveau
22:07 endrift: the problem appears to either be with the window texture (texture 0) or the way it's compositing the flag textures
22:07 endrift: I am?
22:07 imirkin: well at least you glDisable it a lot :)
22:08 endrift: That's probably Qt's wrapper doing it
22:08 imirkin: which leads to warnings about an unknown enum
22:08 endrift: I've never even heard of it
22:08 karolherbst: endrift: maybe the best way to test would be to boot linux
22:08 karolherbst: and test there
22:08 karolherbst: also much less painful to debug
22:08 endrift: it's also Qt's fault that there are a mess of glViewport(0, 0, -1, -1)
22:09 imirkin: uh huh. always blame someone else... :p
22:09 endrift: Does the Switch Linux use nouveau?
22:09 karolherbst: what else?
22:09 endrift: L4T ships with the binary drivers, I thought they might somehow have been using that
22:09 karolherbst: mhhh
22:09 endrift: but yeah that's a good point
22:09 karolherbst: you wouldn't be able to ship it
22:09 karolherbst: or allowed to
22:09 endrift: oh. right
22:09 endrift: GPL
22:09 karolherbst: but maybe they don't give a damn
22:09 karolherbst: mhhh
22:09 karolherbst: not only GPL
22:09 karolherbst: but also nvidias license
22:10 endrift: Then how did I get Jetpack with official drivers...?
22:10 karolherbst: I think it's okay as long as you repackage it for you know, packages
22:10 endrift: anyway
22:10 karolherbst: but I don't think you can include the driver in a big blob
22:10 endrift: I can try that
22:10 karolherbst: yeah
22:10 endrift: I'm a little disinclined to try it since, uh, with integer textures it just works
22:10 karolherbst: that would be probably the best way to do it
22:11 karolherbst: mhhh
22:11 endrift: and I have to do less munging of things
22:11 karolherbst: as long as it's fast enough
22:11 endrift: are integer textures slower?
22:11 karolherbst: no idea
22:11 endrift: the biggest speed issue I have is with VRAM uploads
22:11 imirkin: int textures are fine
22:11 endrift: which I do more often than I'd like
22:11 karolherbst: yeah...
22:11 karolherbst: emulators
22:11 endrift: that specific scene I linked uploads the palette a bunch
22:11 endrift: because it changes one entry
22:12 endrift: most scanlines
22:12 imirkin: endrift: note that you can use something like a buffer texture / image, make it a persistent buffer, etc
22:12 endrift: and that causes it to stall
22:12 karolherbst: why not just add another indirection and do partial uploads?
22:12 karolherbst: or some crap like that
22:12 imirkin: talk to HdkR about how dolphin stream data in like that
22:12 endrift: I tried PBOs and that didn't work
22:12 endrift: well
22:12 endrift: it did but it wasn't faster
22:12 karolherbst: ohhhh
22:12 karolherbst: I know
22:12 endrift: I'm going to try doing indirection tricks next
22:12 karolherbst: just do the fancy shit
22:13 endrift: I already do some indirection tricks for a bunch of stuff
22:13 karolherbst: you don't have dedicated memory, right?
22:13 endrift: you mean dedicated VRAM?
22:13 imirkin: switch doesn't have dedicated vram
22:13 imirkin: which is why i suggest the persisten buffer
22:13 karolherbst: yeah
22:13 karolherbst: GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT
22:13 endrift: how would I do that?
22:13 imirkin: not coherent - only persistent.
22:13 karolherbst: with glBufferStorage
22:13 karolherbst: ohh,
22:13 imirkin: coherent breaks apitrace ;)
22:13 karolherbst: right :D
22:14 endrift: yeah I don't think I need it to be coherent
22:14 imirkin: and doesn't really provide THAT much advantage
22:14 endrift: so this renderer is being used not only on the switch, so I'll need to make sure it works on desktop too
22:14 endrift: but I'll look into that
22:14 imirkin: this will work on desktop too
22:15 karolherbst: you still use a PBO
22:15 karolherbst: but instead of using os memory
22:15 karolherbst: you map GPU memory into your process
22:15 karolherbst: and just write into that
22:15 karolherbst: more or less
22:16 karolherbst: and then you texture with data = NULL
22:16 endrift: that may help a lot
22:16 endrift: and this is supported in ES 3.1?
22:16 karolherbst: uhm.. good question
22:16 karolherbst: that's a GL feature, no idea if that's part of GLES
22:17 imirkin: it's there in ES
22:17 imirkin: iirc it's not in core ES 3.1
22:17 imirkin: but it's there in AEP, and in ES 3.2
22:17 endrift: glBufferStorage isn't even in 3.1
22:17 imirkin: hrm. nope, not in AEP
22:17 karolherbst: was it GL_OES_texture_buffer?
22:18 imirkin: OES_buffer_storage / EXT_buffer_storage
22:18 endrift: oof, I need GL 4.4 for this
22:18 karolherbst: ahhh
22:18 imirkin: endrift: not really
22:18 karolherbst: do we support OES_buffer_storage inside mesa?
22:18 imirkin: the feature's exposed in a lot of mesa impls
22:18 imirkin: karolherbst: EXT_buffer_storage. looks like there's no OES version.
22:19 endrift: I can swing that on the Switch but I might have trouble on desktop
22:19 endrift: so I might fall back to the PBO version
22:19 endrift: we'll see
22:19 karolherbst: we indeed support it :)
22:19 endrift: On desktop I'm trying to work with 3.0, or at least 3.2 Core
22:19 karolherbst: endrift: it can be optional for more perf :p
22:19 karolherbst: or lower GPU/CPU usage or whatever
22:19 endrift: indeed
22:20 endrift: I'm gonna see what happens if I try using it
22:20 karolherbst: it is usually faster than keeping a CPU side buffer and uploading it with subTex
22:20 karolherbst: uhm... TexSub
22:20 imirkin: depends how much it's used by the gpu. if it's a desktop gpu, persistent usually means it's stored in cpu-side memory
22:20 imirkin: so if the gpu will read that data lots of times, it'll be slower
22:21 karolherbst: ohh yeah, could be
22:21 imirkin: if it'll read that data once, it should be better (since no extra upload)
22:21 imirkin: but without separate vram, there are no such considerations
22:25 endrift: hmm I'm gonna need to use a lot of fences
22:26 karolherbst: mhhh, true, synchronisation might become an issue
22:29 karolherbst: but hey, you have to stall with your current approach anyway
22:31 endrift: true
22:32 endrift: actually
22:32 karolherbst: but maybe you could have like a ring buffer or something and just stall if the CPU side is too fast
22:32 endrift: I can use glMemoryBarrier instead
22:32 endrift: I think?
22:32 karolherbst: endrift: the issue is rather that the GPU might still read from that buffer when you are modifying it
22:32 karolherbst: but
22:33 karolherbst: let me ask one thing, the palette is a texture?
22:33 karolherbst: why?
22:33 endrift: ...what should it be?
22:33 karolherbst: how many entries does the palette have? 256?
22:33 endrift: 512
22:34 karolherbst: why not use an UBO?
22:34 endrift: Not available in GL 3.2
22:34 karolherbst: then.. an uniform
22:34 karolherbst: which is just a big array
22:34 karolherbst: or 2d array
22:34 endrift: wait they are in 3.2
22:34 karolherbst: then you don't have to synchronize
22:35 endrift: they're not in 3.0
22:35 karolherbst: as UBOs are essentially copied to on chip memory
22:35 karolherbst: or uniforms
22:35 karolherbst: yeah.. a 2d uniform array should be enough, would just requre array_of_array
22:35 endrift: I can get away with making it 256 entries since each shader would only be using up to 256 of them at a time
22:35 karolherbst: or is the palette 1D?
22:35 endrift: and I know which in advance
22:35 endrift: palette is 2D, 16x32
22:35 karolherbst: okay
22:36 endrift: but I can just do the multiplication to make that 1D
22:36 karolherbst: arrays_of_arrays is 3.1 in ES and 4.3 in GL :D
22:36 karolherbst: yeah..
22:36 karolherbst: endrift: are the dimensions constant?
22:36 endrift: yes
22:36 karolherbst: I mean, the access
22:36 karolherbst: k
22:36 karolherbst: then it doesn't even matter
22:37 karolherbst: or can you have a variable access to the palette?
22:37 endrift: what do you mean?
22:37 karolherbst: is it like palette[x][y] or palette[5][7]
22:37 endrift: in terms of access or definition?
22:37 endrift: in definition, the latter
22:37 karolherbst: constant expressions or can the accessors be runtime variable?
22:38 endrift: accesses are runtime variable
22:38 karolherbst: okay
22:38 karolherbst: and each palette entry is a vec3?
22:38 karolherbst: or vec4?
22:38 endrift: oh, you can't do variable accesses into uniforms in GLSL 1.30 can you
22:38 karolherbst: dunno
22:38 endrift: vec4
22:39 endrift: technically I don't care about .a but it's still there
22:39 karolherbst: 8kb for the palette
22:39 karolherbst: I think that's small enough to fit in uniforms
22:40 karolherbst: mhhh
22:40 endrift: oh, the palette format is 1_5_5_5_REV
22:40 endrift: (or 5_6_5 in GL ES land)
22:40 endrift: it's 2 bytes per entry
22:40 endrift: only 1kb
22:40 endrift: *1kB
22:40 endrift: **1kiB :P
22:41 karolherbst: ohhhh
22:41 karolherbst: ehh
22:41 karolherbst: uhm
22:41 imirkin: endrift: indirect access into uniforms should work just fine
22:41 karolherbst: I still think having an uniform/ubo might be faster... dunno
22:41 imirkin: er hrm....
22:41 imirkin: maybe need ubo for that
22:41 karolherbst: having to deal with the format at runtime sounds annoying
22:42 imirkin: actually no - should definitely work ok =/
22:42 imirkin: [iirc]
22:42 endrift: I can deal with the format on the CPU side
22:42 endrift: I already do on the GL ES version
22:42 karolherbst: right, but then you need more space :p
22:42 endrift: true
22:42 karolherbst: ahh
22:42 endrift: the ES version converts it to 5_6_5
22:42 karolherbst: but anyway, it would be interested to see if that is actually aster
22:42 endrift: still the same number of bytes though
22:42 karolherbst: *faster
22:42 endrift: it may be
22:42 endrift: especially if I use a UBO
22:43 karolherbst: uniform == UBO
22:43 karolherbst: on the hardware it's the same
22:43 karolherbst: imirkin: do you know if using images vs texture ops makes any difference for read only data?
22:44 karolherbst: like could reading from an image be faster than reading from a sampler?
22:44 endrift: fwiw I'm using texelFetch everywhere
22:44 imirkin: karolherbst: yeah
22:44 imirkin: textures go via a texture cache
22:44 imirkin: so it's definitely better to go through that
22:44 karolherbst: ohhh, because textures are assumed to be read only
22:44 imirkin: images go through L2, but textures have a bigger cache
22:45 karolherbst: mhhh
22:45 karolherbst: I am really wondering how much uniforms are actually helping.. but access to uniforms is equal to reading from a gpu register
22:45 karolherbst: but the invocation overhead might be bigger
22:45 karolherbst: ?
22:46 karolherbst: dunno
22:46 endrift: you can't specify the size of a uniform can you
22:46 imirkin: uniforms are super-cached too, obviously
22:46 imirkin: endrift: huh?
22:46 karolherbst: endrift: it's defined by the shader
22:46 imirkin: you can't not specify the size of a uniform...
22:46 endrift: No I mean
22:46 karolherbst: uniform palette vec4[256];
22:46 karolherbst: uhm...
22:46 endrift: you can't say "this ivec3 has 8 bits per component"
22:46 imirkin: heh
22:46 endrift: I mean I guess you can say lowp
22:47 karolherbst: uniform ivec4 palette[256];
22:47 imirkin: there are exts
22:47 imirkin: but they don't do anything extremely helpful
22:47 karolherbst: ohh int8...
22:47 imirkin: i8vec3 or whatever
22:47 endrift: I only need 5 bits per component
22:47 endrift: since each palette entry is 15 bits
22:47 imirkin: thing is ... it's hard to take advantage of that
22:47 imirkin: you can pack it by hand
22:47 karolherbst: yeah... we don't have int8/int16 support in nouveau
22:47 karolherbst: or gl at all?
22:47 endrift: I just want to make sure I don't use too many uniforms
22:47 imirkin: and then use bitfieldExtract to get the right bits out
22:48 karolherbst: const buffers are 64k, right?
22:48 endrift: wait what's bitfieldExtract
22:48 imirkin: yes.
22:48 imirkin: http://docs.gl/sl4/bitfieldExtract
22:48 karolherbst: endrift: glsl 4.0 feature :p
22:48 endrift: too new
22:48 imirkin: there's a mesa ext to get it earlier
22:48 karolherbst: well, it's just && and <<
22:48 endrift: what I can do though
22:48 imirkin: also if you do it "by hand", nouveau's optimization will pick up on that.
22:48 karolherbst: uhm && and >>
22:49 karolherbst: eh...
22:49 karolherbst: I should sleep
22:49 karolherbst: it's getting embarrassing
22:49 karolherbst: & and >> :p
22:50 endrift: upload e.g. 0x7FFF, then do (ivec3(0x1F, 0x3E0, 0x7C00) & x) >> (ivec3(0, 5, 10))
22:50 endrift: and that'll get what I want
22:50 imirkin: right
22:50 endrift: that's not too bad
22:50 endrift: I'll try that
22:51 endrift: thanks
22:52 karolherbst: can we do partial updates to uniforms?
22:52 karolherbst: or do we have to reupload the entire thing?
22:52 imirkin: for user uniforms, we reupload the whole thing
22:52 imirkin: we don't get an indicatino from st/mesa about what's updated
22:52 karolherbst: k, thought as much
22:52 karolherbst: yeah.. I was actually wondering if you can actually do that through the gl API
22:53 imirkin: GL api knows
22:53 karolherbst: ohh, really?
22:53 imirkin: you run glUniform on very specific locations
22:53 karolherbst: how do you partially update an array?
22:53 karolherbst: ohhh, interesting
22:53 imirkin: you can't really bulk-update an array
22:53 imirkin: you have to do it one elem at a time iirc
22:53 karolherbst: but.. how?
22:53 karolherbst: ufff
22:53 karolherbst: really?
22:53 karolherbst: that sounds annoying
22:53 imirkin: yeah, UBO's are nice :)
22:53 karolherbst: :D
22:54 imirkin: actually i guess you can update multiple ones
22:54 imirkin: use something as a base, and then e.g. glUniform4fv()
22:54 karolherbst: yeah.. sounds about right
22:55 karolherbst: ubos can be used with persistent buffers as well, right?
22:56 imirkin: ya
22:56 imirkin: everywhere
22:57 skeggsb: imirkin: i'm actually not sure HW supports it, until (possibly) turing
22:57 skeggsb: despite it being in the class headers..
22:57 imirkin: skeggsb: gah
22:57 skeggsb: what error code does evo throw?
22:57 imirkin: no error
22:57 karolherbst: "All NVIDIA GPUs from the 900 and 1000 series support HDR display output"
22:58 imirkin: just ... all black :)
22:58 karolherbst: at least that's what nvidia is saying
22:58 imirkin: karolherbst: HDR != fp16
22:58 karolherbst: ohhh, true
22:58 imirkin: skeggsb: i'm currently suspecting LUT
22:58 imirkin: also i'm testing on the G84 :)
22:59 imirkin: largely coz ... convenience. maybe i should give it a shot on the GK208
23:16 endrift: wHOA
23:16 endrift: That gave me a >50% speedup in one of my pathological tests
23:17 karolherbst: :D
23:17 endrift: thanks :D
23:17 karolherbst: the uniform stuff?
23:17 endrift: yeah
23:17 karolherbst: nice
23:17 endrift: it was running at 43-48 fps before, now it's running >60
23:18 karolherbst: I assume you mostly get more shaders running in parallel
23:18 karolherbst: as you have lower register preassure
23:18 karolherbst: wasn't thinking of that before
23:18 endrift: I guess uniforms have their own block on the GPU?
23:18 karolherbst: yeah
23:18 endrift: ah
23:18 karolherbst: reading from them is as fast as reading a register
23:18 endrift: That means I have another thing I should move to uniforms
23:19 endrift: (affine params)
23:19 endrift: actually I should use UBOs for this because it's structured data
23:19 karolherbst: uhm
23:19 karolherbst: uniform block?
23:19 endrift: yeah
23:19 endrift: that
23:19 karolherbst: you can have a struct for a uniform
23:19 karolherbst: right
23:19 endrift: I misspoke
23:19 karolherbst: k
23:19 endrift: I'll do that after I commit this
23:20 karolherbst: are you working on upstream vbam or just your own fork?
23:20 karolherbst: or different emulator?
23:21 endrift: neither?
23:21 endrift: mGBA
23:21 endrift: it's from scratch*
23:21 karolherbst: ahh
23:21 karolherbst: I see
23:21 endrift: *vaguely based on another emulator I wrote in 2012 that was from scratch
23:26 endrift: maybe I'll actually do this tomorrow
23:26 endrift: I've been hacking at this for a while
23:28 imirkin: endrift: all the stuff that's the same across invocations whould be in a uniform
23:29 endrift: invocations of the shader, or the draw?
23:29 imirkin: indirect indexing into uniforms is fairly cheap, usually
23:29 imirkin: invocations of the shader
23:29 endrift: so e.g. once per fragment
23:29 endrift: (for a fragshader
23:29 endrift: )
23:29 karolherbst: the entire stage
23:29 endrift: makes sense
23:29 karolherbst: well
23:29 karolherbst: _all_ stages actually
23:29 endrift: I have a bunch of things that are the same per y, but vary per x and it's really annoying
23:29 imirkin: if you need format conversion, texturing is a cheap way to get it "for free"
23:30 endrift: not sure how to manage that
23:30 imirkin: although i dunno how it compares to uniforms. depends on what exactly you need.
23:30 imirkin: the other nice thing about uniforms is that on nvidia hw they don't cause stalls between successive draws
23:30 imirkin: i.e. it's optimized for draw; update uniform; draw; update uniform; draw
23:31 imirkin: (the updates are staged to a special place, and written out in sequence afterwards)
23:32 imirkin: but the draw can run before all that happens, using the special place to look up the correct uniform values
23:34 endrift: seems fast on i915 too
23:34 endrift: no clue about AMD
23:35 karolherbst: ubos are usually much faster than texture operations
23:36 karolherbst: or uniforms in general
23:36 endrift: I was worried about upload
23:36 karolherbst: it's one page?
23:36 endrift: .........good point
23:37 karolherbst: I think we even just push those values inside the command buffer
23:37 karolherbst: dunno if we do that for user uniforms...
23:37 karolherbst: or when we do it?
23:38 endrift: technically I'm only uploading 1/4 of a page :P
23:39 endrift: but I do it up to...(128+4)*160 times per frame
23:39 endrift: so I was worried
23:40 karolherbst: do you always update the entire array?
23:40 karolherbst: although that's probably the easiest anyway
23:40 karolherbst: and fastest
23:42 endrift: it's 160 if volatile scanline parameters are changed that many times per frame
23:43 endrift: I'm looking at ways to make more scanline params non-volatile
23:43 endrift: ideally it's 1
23:43 endrift: and in many cases it is
23:44 imirkin: a texture buffer can be many MBs
23:44 imirkin: you lose some of the benefits of uniforms, but still get a good cache
23:44 karolherbst: imirkin: the question is rather, how much sense does it make to use a texture if you hve only roughly 1kb of data anyway
23:45 karolherbst: and it's really just a lookup table
23:45 imirkin: 128 * 160 = more data
23:45 imirkin: my point is that if it doesn't fit into a uniform
23:45 imirkin: or across a couple uniform buffers
23:45 endrift: no I mean
23:45 endrift: that's how many times I reassign the value of the uniform
23:46 imirkin: could you precompute it once up-front
23:46 endrift: well...yes, but then I'd need that many uniforms
23:46 endrift: since it can be different per y
23:46 imirkin: right
23:46 endrift: (y is the number of scanlines tall it is)
23:46 imirkin: right
23:46 endrift: *160
23:46 imirkin: so just do that?
23:47 karolherbst: but that doesn't fit inside one uniform anymore though
23:47 imirkin: hence my other comments
23:47 karolherbst: ahh
23:47 imirkin: hopefully it's all coming together now :)
23:48 endrift: so that's what my other approach was going to be
23:48 karolherbst: endrift: where do you split your draws?
23:49 endrift: I split draws on volatile scanline parameters--i.e. things that would make me need to change the values in my uniforms to get it right before and after the change
23:49 karolherbst: mhhh
23:49 karolherbst: so in theory you could do one draw per frame?
23:49 endrift: no
23:49 endrift: well
23:49 endrift: in most cases yes
23:49 endrift: in some cases no
23:50 endrift: there's one parameter that can change that would change which shaders I use
23:50 endrift: and that can change midframe
23:50 karolherbst: ahhh
23:50 endrift: but I would like to reduce it to JUST that being the thing that splits if possible
23:50 karolherbst: how different are those shaders?
23:50 endrift: I also need to convert one thing I'm doing with a really hacky approach to a vertex shader
23:50 endrift: very
23:54 endrift: actually there's another thing
23:54 endrift: writing to emulated VRAM
23:54 endrift: that really only happens in bulk during scene transitions so it's tolerable
23:55 endrift: but I've already optimized it a lot, it's like 3x faster than it was when I started
23:55 karolherbst: how big is vram?
23:55 imirkin: and 3x slower than it will be when you're done optimizing :)
23:55 endrift: imirkin: I hope so!
23:55 endrift: karolherbst: 0x1C000 bytes
23:55 endrift: I forget what that is in decimal
23:55 endrift: I think it's 192kiB?
23:56 imirkin: just under 128kb
23:56 karolherbst: 112kb
23:56 imirkin: 0x10000 is 64k
23:56 karolherbst: mhhh
23:56 karolherbst: endrift: when can you write to vram?
23:56 endrift: a lot of the time
23:56 karolherbst: and how much of it do you have to be able to access inside the shaders?
23:56 endrift: ^ that's how I optimized it
23:56 endrift: only upload it when it's needed or between frames
23:56 karolherbst: mhhhh
23:58 karolherbst: do you have to write to it inside your shaders?
23:58 endrift: no
23:58 endrift: never
23:58 karolherbst: but access is variable I guess
23:58 endrift: yes
23:59 karolherbst: I am wondering...
23:59 karolherbst: as it totally fits into 2 ubos
23:59 karolherbst: indirectly accessing ubos might actually be faster
23:59 karolherbst: because we can do that actually
23:59 karolherbst: ohhhhhhhhh
23:59 karolherbst: mhhh
23:59 endrift: how big can a ubo be?
23:59 karolherbst: 64k
23:59 karolherbst: but it's not limited