17:29imirkin: skeggsb: i'm planning on looking into more HDR-related bits for nouveau. can i assume this is an area you have not really looked at / are not interested in looking at yourself?
17:29imirkin: [beyond making the LUT stuff work properly a while back]
17:52karolherbst: imirkin: it's on the things we want to do at least
17:52imirkin: ok. well, i don't want to collide with other peoples' work
17:52imirkin: esp since my work is going to proceed at a highly unpredictable rate
17:53karolherbst: yeah.. let me check
17:53karolherbst: doesn't seem like to be part of what we want to do for Nouveau (we as RH graphics team)
17:56karolherbst: and as far as I can tell, I doubt skeggsb was working on it.. but to be sure you can always wait until he responds
17:57imirkin: nah, i mean i'll start doing little things
17:57imirkin: if it turns out he already has a series, so be it
19:27karolherbst: imirkin: ohh, btw, I was posting a series for optimizing constants into the driver const buffer, and I was wondering if you have any opinion on that: https://lists.freedesktop.org/archives/mesa-dev/2019-March/216041.html
19:27airlied: imirkin: we did buy skeggsb a HDR monitor :)
19:28karolherbst: airlied: I also have an eGPU case :p
19:28karolherbst: I still didn't come to writing any meaningful patch yet
19:28karolherbst: (although I kind of understand the brokeness of nouveau when it comes to just removing the device)
19:29karolherbst: but yeah.. I kind of suspected that skeggsb would like to look into it
19:30airlied:could get an egpu case but I've got no thunderbolt machines :-P
19:31karolherbst: usually they are quite nice as you can chnage your GPU without rebooting, but yeah, without a TB port you are kind of screwed
19:31karolherbst: get a better motherboard :p
19:33karolherbst:trying to fix this silly HMM memory double free bug
19:33imirkin: airlied: heh finally :)
19:37imirkin: karolherbst: not a huge fan of the fact that you reupload imms on every validation
19:37karolherbst: I don't like that either
19:37karolherbst: I was mostly writing the code to see what the benefit would be
19:37imirkin: calim had previously either done or thought of doing something like that
19:37karolherbst: maybe we should collect the 1000 most used immediates and have a static table? dunno
19:37imirkin: i eventaully removed it since it was only like 1% there in the code
19:38imirkin: the c14_imms thing or whatever
19:38imirkin: it's a lot more beneficial for nv50 actually
19:38imirkin: since it's a LOT more restrictive in terms of where imms can go
19:38karolherbst: mhh, interesting
19:38karolherbst: but maybe with a static table we could have a good enough benefit? dunno what are widley used constants
19:39karolherbst: would save us the reupload
19:39imirkin: i like the general approach
19:39imirkin: but if you could keep better track of which shaders' imms you had most recently uploaded, i'd feel a little better
19:39karolherbst: reworking insnCanLoad was fun
19:40karolherbst: imirkin: I was also thinking about having one table for the entire runtime and just append at the end and deduplicate, etc...
19:41karolherbst: I think I actually added deduplication already
19:41karolherbst: and if we run out of space, we run out of space
19:41imirkin: could make sense
19:41imirkin: otoh i dunno what the benefit is
19:42karolherbst: less reuploads
19:42karolherbst: or none
19:42imirkin: should be the same amount...
19:42karolherbst: with the current approach, we have to reupload the entire imm buffer
19:42imirkin: and as you link diff shaders together, could be trouble
19:42karolherbst: but if we have one buffer for _all_ shaders?
19:42imirkin: oh, you mean GLOBALLY?
19:43karolherbst: so we just upload new imms
19:43karolherbst: well, if the buffer gets bigger, we just upload more data
19:43karolherbst: otherwise we just keep whatever is in there
19:43imirkin: and do some kidn of clever thing about keeping stuff around? dunno.
19:43imirkin: also messes bigtime with the shader cache
19:43karolherbst: on maxwell we have 18 buffers.. but still only 8 for compute shaders :/
19:43karolherbst: mhhh, do we actually have a shader cache so far?
19:44imirkin: i thought we did?
19:44karolherbst: we only have the TGSI one, right?
19:44imirkin: maybe i thought wrong
19:44karolherbst: yeah.. because if we did, we would be able to serialize and deserialize our shader
19:44karolherbst: and I am sure we don't support that
19:44karolherbst: (also we had this dicussion already about chaching native shaders)
19:45imirkin: maybe not =/
19:45karolherbst: but yeah... that would be a concern indeed
19:45imirkin: we should though
19:45imirkin: there's no reason we can't
19:45karolherbst: we need it for CL anyway (kind of)
19:45imirkin: should be trivial to implement. just have to serialize the code, relocs, and fixups
19:45karolherbst: I have some WIP stuff somewhere
19:51karolherbst: imirkin: I kind of hoped you would have some very clever idea on how to make it suck less..
19:52imirkin: actually i dunno that what you have is all bad... dunno.
19:52karolherbst: maybe I should just go ahead and benchmark it somehow... two shaders with tons of constants, switching between them or something and see how much it impacts draws/sec or so
19:53imirkin: or just something like unigine
19:54karolherbst: I think I actually did and it didn't had any significant impact
19:54karolherbst: mhh, maybe we should have some constbuffer_size/draw counter? That would be a fund thing to have
19:57karolherbst: or maybe check out how nvidia is doing that, as they do this optimization as well afaik
20:36endrift: Is there a list of maximum supported versions of GLES in nouveau for different cards?
20:36endrift: I'm curious what the maximum supported ES version of the TX1 is
20:36endrift: I can make a 3.1 context but I haven't tried 3.2
20:37imirkin: ES 3.2 should be supported on all fermi+, i believe
20:37imirkin: at least nominally
20:38endrift: cool thanks
20:38endrift: Not sure what was added in 3.2 but it's good to know
20:38imirkin: hasn't been updated in a while, but there's also https://people.freedesktop.org/~imirkin/glxinfo/
20:38endrift: I might be writing Switch homebrew :)
20:38imirkin: ES3.2 = ES 3.1 + AEP, basically
20:38endrift: I see AEP and I think Freenode user aep in #qt <_<
20:38imirkin: ANDROID_extension_pack_es31a or whatever
20:39endrift: btw should I try my Nvidia cards in my G5 again? It's been a long time and I heard some stuff might have gotten fixed
20:39imirkin: you were the guy with a NV40 AGP, right?
20:39endrift: There's a radeon in there right now but it only supports GL 2.0
20:39endrift: not a guy but yes
20:39imirkin: the individual, sorry :)
20:40endrift: yes, that was me :)
20:40imirkin: so unfortunately in the meanwhile my own G5 died
20:40endrift: oh no :(
20:40imirkin: so that's pretty much the end of the line as far as my support of all that goes
20:40imirkin: not that i was providing anything particularly tremendous...
20:41imirkin: iirc the issue you were having was way beyond my understanding, and can pretty much only be debugged by you (or someone with the hw)
20:41endrift: If I knew where to start I may try that
20:41endrift: been busy with other stuff though
20:42imirkin: by now i don't even remember what the issue was ... something about DMA being fubar?
20:42endrift: I think so
20:42endrift: I was getting green/magenta checkerboards and a hardlock
20:42imirkin: yeah. i don't have any genius suggestions =/
20:42endrift: oh well :(
20:42imirkin: i suspect asking someone more familiar with the platform may help
20:42endrift: Thanks anyway
20:46endrift: I did notice a few days ago that some code I wrote was working fine on my TX1 jetson nano with nvgpu but had rounding issues with mesa+nouveau on my switch. Anything I could do to try to debug that, or should I just not worry about it?
20:46endrift: I ended up rewriting the shaders to use integer textures so I didn't have to worry about rounding, but it was the first place I'd seen this
20:46endrift: works fine on Intel + AMD too afaik
20:50imirkin: well, given the info available, could be a genero nouveau bug, a maxwell-related bug, or something TX1-related
20:50imirkin: you could also be doing something dodgy which happens to work out ok on other platforms
20:50imirkin: but that we're not super-careful about
20:50imirkin: what do you mean by "rounding issues"?
20:51imirkin: like you wanted a circle but got a square? :p
20:52endrift: it's probably the dodgy thing
20:52endrift: I was trying to store integers in floats because I thought integer texture color attachments for draw framebuffers didn't work
20:53endrift: so I had an integer in the range 0-63 that I was dividing on write and multiplying on read and I got completely different results sometimes
20:54karolherbst: endrift: are you using mesa master?
20:54karolherbst: or some newish version?
20:54karolherbst: we had a bug on maxwell in the integer div code which could explain such result
20:55karolherbst: uhm behaviour
20:55endrift: so for i915 I'm using latest release and it works, for nouveau I'm using...whatever is on the switch
20:55karolherbst: I am sure there is nothing on the switch itself :p
20:55endrift: which appears to be 19.0.0
20:56endrift: you know what I mean :P
20:56karolherbst: okay, I think 19.0.0 should contain the fix
20:56karolherbst: maybe you are hitting undefined behaviour or something though
20:57endrift: dunno. I should have just been using integer textures/samplers in the first place
20:57endrift: and I am now
20:57endrift: I seem to be having some trouble still but I think it's with clearing integer textures
20:57endrift: I don't think glClear works but glClearBuffer does
21:00imirkin: endrift: hm, well integer division being broken would be a bit of a bummer
21:00imirkin: it's happened before though
21:00endrift: so I was converting to float before doing the division so I wouldn't lose precision
21:01endrift: and since the number was only 6 bits it should fit fine in highp
21:01imirkin: well, keep in mind integer division rounds down
21:01endrift: and since I was dividing by a power of two, it should just futz with the exponent
21:01imirkin: (for positive numbers)
21:01endrift: yes that's why I wasn't doing integer division
21:01imirkin: also for floats, denorms get flushed
21:01imirkin: although i doubt you were hitting that
21:01endrift: I might have been for all I know
21:02imirkin: that's for super-small floats
21:02endrift: but with highp I shouldn't have been
21:02imirkin: i.e. less than 2^-31 or whatever
21:02endrift: I'm familiar with denormals
21:02imirkin: nouveau ignores highp/etc
21:02imirkin: it's all 32-bit
21:02endrift: thaaaaaat might have been my problem
21:02endrift: everything I was doing should fit in 32 bit
21:03imirkin: only the mobile chips and super-modern chips benefit from caring about mediump/etc, so it's hard to care
21:03endrift: (I don't remember how big a single-float mantissa is but it's bigger than 6 bits by a lot)
21:03imirkin: 23 bits
21:03endrift: (I think it's like 23
21:03endrift: yeah that
21:03imirkin: could be an opt-gone-wrong
21:03endrift: I was messing around with floating point representations at my job a few months ago so I should know that one
21:04endrift: granted I was mostly using half float and less
21:04imirkin: less than half-float?
21:04imirkin: quarter-float? :)
21:05imirkin: that's not a lot of bits.
21:05endrift: We needed bandwidth more than precision for this project
21:06imirkin: well, clearly you weren't going to get precision with 8-bit floats...
21:06endrift: so I was experimenting to see how low I could get the precision while still having it work
21:06endrift: 2 mantissa bits (+1 implied) kinda sucked though
21:06endrift: especially handling denormals
21:07imirkin: well, if you want to create a repro, or an apitrace or whatever
21:07imirkin: then i can have a look
21:07imirkin: otherwise there's not a lot we can say
21:07imirkin: other than "yes, bugs may exist in nouveau" :)
21:07imirkin: which is a safe bet all-around
21:08endrift: I don't think I can generate an apitrace on the switch
21:08endrift: maybe I can
21:08imirkin: i'll bbl - rebooting, potentially multiple times. i check the logs though if you want to post something. or file a bugzilla issue.
21:08imirkin: well, you said you tested on intel
21:08imirkin: and it worked
21:08endrift: ideally I'd install nouveau on the jetson nano
21:08endrift: oh are those apitraces portable?
21:08imirkin: apitraces are generally portable
21:08endrift: I guess they should be
21:08imirkin: don't use bindless
21:08imirkin: and don't use coherent buffers
21:09imirkin: those are the two big no-no's of portable apitraces
21:09imirkin: (and don't use features which are only available on some subset of gpu's ... like if you rely on INTEL_foo for it, that kinda reduces the portability...)
21:09endrift: If I just install a stock AArch64 Linux distro on my Jetson Nano and install xorg-video-nouveau, should I expect that to work?
21:10imirkin: wayland is more likely to work
21:10endrift: I can try wayland
21:10imirkin: the display is handled by tegra
21:10imirkin: while the accel is handled by nouveau
21:10endrift: I'll look around to see if I can find a guide
21:10imirkin: this is not extremely-well-handled in xorg ddx's
21:11imirkin: and fwiw, a stock linux distro on a Jetson TK1 i have causes it to hang pretty fast (no nouveau involved)
21:11imirkin: i might have somehow fried it or something, dunno. but i've never gotten that stuff to work reliably.
21:11endrift: hm, ok
21:12endrift: I'll see about generating an apitrace once I fix this clear bug
21:12imirkin: (basically upon substantial network activity, it seems to die. which is unfortunate since i nfs-root it...)
21:13imirkin: alright, well i'm out for a bit. file a bug if you get anything concrete, we can investigate.
21:13imirkin: good luck
21:22endrift: for future reference: the afterimage/clearing issue was actually alpha blending
21:23endrift: turns out I was decreasing the alpha in places I hadn't thought about and alpha blending was disabled on my i915, but not on nouveau
21:23endrift: my bug
21:23endrift: I'll capture an apitrace for the other issue now
21:30endrift: https://endrift.com/files/mgba-qt.28.trace can anyone on a maxwell replay this for me and take a screenshot of it on one of the last frames?
21:53karolherbst: my qapitrace just crashes :O
21:57karolherbst: endrift: which frame? 108?
21:57imirkin: skeggsb: so i tried to hook up fp16 formats -- no go. do i have to disable dithering or disable lut or something else maybe?
21:58endrift: doesn't matter exactly, just one near the end
21:58endrift: the last few should be relatively identical
21:58karolherbst: endrift: https://i.imgur.com/gteM0Iw.png
21:58karolherbst: this is a gp107 though
21:59karolherbst: but it's more or less identical to maxwell
21:59endrift: well that looks right
21:59endrift: yoshi is behind everything when renderered the Switch
21:59endrift: can't actually see him
22:00imirkin: this is on GK208: https://i.imgur.com/iUZUhc5.png
22:01imirkin: seems pretty similar to karol's :)
22:01endrift: also looks correct
22:01endrift: wonder if there's any chance it could be a bug in the Switch version of libdrm_nouveau
22:01karolherbst: endrift: try to run with NV50_PROG_OPTIMIZE=0
22:02karolherbst: although I don't know if that's exposed in a release build
22:02endrift: I'm not sure how to do that on the switch
22:02karolherbst: setenv or something? dunno
22:03imirkin: endrift: are you using ASTC anywhere?
22:03karolherbst: ahh, it's debug only
22:03endrift: imirkin: I don't know what that is
22:03karolherbst: imirkin: why is NV50_PROG_OPTIMIZE debug only?
22:03imirkin: karolherbst: dunno
22:03imirkin: karolherbst: all the debug stuff is debug-only :)
22:03imirkin: endrift: texture compression format
22:03karolherbst: but telling user to just try it would be cool
22:03endrift: oh, no I'm not
22:04imirkin: that's the only thing i know of that would be legitimately different on the TK1/TX1
22:04imirkin: they have native ASTC support, while for desktop it's emulated on cpu
22:04imirkin: [and nouveau exposes that native support, ideally, but in practice not super-duper-tested]
22:04endrift: I wonder if something in how I set up the EGL context on the Switch broke it
22:05imirkin: endrift: how does yoshi make it on there?
22:05imirkin: could it be a depth testing/etc snafu?
22:05endrift: the depth is stored in that integer-in-float texture I was talking about, I render five buffers individually and then composite them manually in a shader
22:05imirkin: you also appear to have a GL_RGBA4 texture. this has been the source of some annoyance for nouveau.
22:06imirkin: (e.g. some people expect to be able to render to a GL_RGBA4)
22:06endrift: (This is because the system I'm emulating does weird not-really-friendly things with layer ordering)
22:06endrift: I'm reading from it, not rendering to it
22:06endrift: if that texture didn't work the graphics would be completely corrupted
22:07imirkin: endrift: also you appear to be making use of GL_BLEND_ADVANCED_COHERENT? that's not supported on nouveau
22:07endrift: the problem appears to either be with the window texture (texture 0) or the way it's compositing the flag textures
22:07endrift: I am?
22:07imirkin: well at least you glDisable it a lot :)
22:08endrift: That's probably Qt's wrapper doing it
22:08imirkin: which leads to warnings about an unknown enum
22:08endrift: I've never even heard of it
22:08karolherbst: endrift: maybe the best way to test would be to boot linux
22:08karolherbst: and test there
22:08karolherbst: also much less painful to debug
22:08endrift: it's also Qt's fault that there are a mess of glViewport(0, 0, -1, -1)
22:09imirkin: uh huh. always blame someone else... :p
22:09endrift: Does the Switch Linux use nouveau?
22:09karolherbst: what else?
22:09endrift: L4T ships with the binary drivers, I thought they might somehow have been using that
22:09endrift: but yeah that's a good point
22:09karolherbst: you wouldn't be able to ship it
22:09karolherbst: or allowed to
22:09endrift: oh. right
22:09karolherbst: but maybe they don't give a damn
22:09karolherbst: not only GPL
22:09karolherbst: but also nvidias license
22:10endrift: Then how did I get Jetpack with official drivers...?
22:10karolherbst: I think it's okay as long as you repackage it for you know, packages
22:10karolherbst: but I don't think you can include the driver in a big blob
22:10endrift: I can try that
22:10endrift: I'm a little disinclined to try it since, uh, with integer textures it just works
22:10karolherbst: that would be probably the best way to do it
22:11endrift: and I have to do less munging of things
22:11karolherbst: as long as it's fast enough
22:11endrift: are integer textures slower?
22:11karolherbst: no idea
22:11endrift: the biggest speed issue I have is with VRAM uploads
22:11imirkin: int textures are fine
22:11endrift: which I do more often than I'd like
22:11endrift: that specific scene I linked uploads the palette a bunch
22:11endrift: because it changes one entry
22:12endrift: most scanlines
22:12imirkin: endrift: note that you can use something like a buffer texture / image, make it a persistent buffer, etc
22:12endrift: and that causes it to stall
22:12karolherbst: why not just add another indirection and do partial uploads?
22:12karolherbst: or some crap like that
22:12imirkin: talk to HdkR about how dolphin stream data in like that
22:12endrift: I tried PBOs and that didn't work
22:12endrift: it did but it wasn't faster
22:12karolherbst: I know
22:12endrift: I'm going to try doing indirection tricks next
22:12karolherbst: just do the fancy shit
22:13endrift: I already do some indirection tricks for a bunch of stuff
22:13karolherbst: you don't have dedicated memory, right?
22:13endrift: you mean dedicated VRAM?
22:13imirkin: switch doesn't have dedicated vram
22:13imirkin: which is why i suggest the persisten buffer
22:13karolherbst: GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT
22:13endrift: how would I do that?
22:13imirkin: not coherent - only persistent.
22:13karolherbst: with glBufferStorage
22:13imirkin: coherent breaks apitrace ;)
22:13karolherbst: right :D
22:14endrift: yeah I don't think I need it to be coherent
22:14imirkin: and doesn't really provide THAT much advantage
22:14endrift: so this renderer is being used not only on the switch, so I'll need to make sure it works on desktop too
22:14endrift: but I'll look into that
22:14imirkin: this will work on desktop too
22:15karolherbst: you still use a PBO
22:15karolherbst: but instead of using os memory
22:15karolherbst: you map GPU memory into your process
22:15karolherbst: and just write into that
22:15karolherbst: more or less
22:16karolherbst: and then you texture with data = NULL
22:16endrift: that may help a lot
22:16endrift: and this is supported in ES 3.1?
22:16karolherbst: uhm.. good question
22:16karolherbst: that's a GL feature, no idea if that's part of GLES
22:17imirkin: it's there in ES
22:17imirkin: iirc it's not in core ES 3.1
22:17imirkin: but it's there in AEP, and in ES 3.2
22:17endrift: glBufferStorage isn't even in 3.1
22:17imirkin: hrm. nope, not in AEP
22:17karolherbst: was it GL_OES_texture_buffer?
22:18imirkin: OES_buffer_storage / EXT_buffer_storage
22:18endrift: oof, I need GL 4.4 for this
22:18imirkin: endrift: not really
22:18karolherbst: do we support OES_buffer_storage inside mesa?
22:18imirkin: the feature's exposed in a lot of mesa impls
22:18imirkin: karolherbst: EXT_buffer_storage. looks like there's no OES version.
22:19endrift: I can swing that on the Switch but I might have trouble on desktop
22:19endrift: so I might fall back to the PBO version
22:19endrift: we'll see
22:19karolherbst: we indeed support it :)
22:19endrift: On desktop I'm trying to work with 3.0, or at least 3.2 Core
22:19karolherbst: endrift: it can be optional for more perf :p
22:19karolherbst: or lower GPU/CPU usage or whatever
22:20endrift: I'm gonna see what happens if I try using it
22:20karolherbst: it is usually faster than keeping a CPU side buffer and uploading it with subTex
22:20karolherbst: uhm... TexSub
22:20imirkin: depends how much it's used by the gpu. if it's a desktop gpu, persistent usually means it's stored in cpu-side memory
22:20imirkin: so if the gpu will read that data lots of times, it'll be slower
22:21karolherbst: ohh yeah, could be
22:21imirkin: if it'll read that data once, it should be better (since no extra upload)
22:21imirkin: but without separate vram, there are no such considerations
22:25endrift: hmm I'm gonna need to use a lot of fences
22:26karolherbst: mhhh, true, synchronisation might become an issue
22:29karolherbst: but hey, you have to stall with your current approach anyway
22:32karolherbst: but maybe you could have like a ring buffer or something and just stall if the CPU side is too fast
22:32endrift: I can use glMemoryBarrier instead
22:32endrift: I think?
22:32karolherbst: endrift: the issue is rather that the GPU might still read from that buffer when you are modifying it
22:33karolherbst: let me ask one thing, the palette is a texture?
22:33endrift: ...what should it be?
22:33karolherbst: how many entries does the palette have? 256?
22:34karolherbst: why not use an UBO?
22:34endrift: Not available in GL 3.2
22:34karolherbst: then.. an uniform
22:34karolherbst: which is just a big array
22:34karolherbst: or 2d array
22:34endrift: wait they are in 3.2
22:34karolherbst: then you don't have to synchronize
22:35endrift: they're not in 3.0
22:35karolherbst: as UBOs are essentially copied to on chip memory
22:35karolherbst: or uniforms
22:35karolherbst: yeah.. a 2d uniform array should be enough, would just requre array_of_array
22:35endrift: I can get away with making it 256 entries since each shader would only be using up to 256 of them at a time
22:35karolherbst: or is the palette 1D?
22:35endrift: and I know which in advance
22:35endrift: palette is 2D, 16x32
22:36endrift: but I can just do the multiplication to make that 1D
22:36karolherbst: arrays_of_arrays is 3.1 in ES and 4.3 in GL :D
22:36karolherbst: endrift: are the dimensions constant?
22:36karolherbst: I mean, the access
22:36karolherbst: then it doesn't even matter
22:37karolherbst: or can you have a variable access to the palette?
22:37endrift: what do you mean?
22:37karolherbst: is it like palette[x][y] or palette
22:37endrift: in terms of access or definition?
22:37endrift: in definition, the latter
22:37karolherbst: constant expressions or can the accessors be runtime variable?
22:38endrift: accesses are runtime variable
22:38karolherbst: and each palette entry is a vec3?
22:38karolherbst: or vec4?
22:38endrift: oh, you can't do variable accesses into uniforms in GLSL 1.30 can you
22:39endrift: technically I don't care about .a but it's still there
22:39karolherbst: 8kb for the palette
22:39karolherbst: I think that's small enough to fit in uniforms
22:40endrift: oh, the palette format is 1_5_5_5_REV
22:40endrift: (or 5_6_5 in GL ES land)
22:40endrift: it's 2 bytes per entry
22:40endrift: only 1kb
22:40endrift: **1kiB :P
22:41imirkin: endrift: indirect access into uniforms should work just fine
22:41karolherbst: I still think having an uniform/ubo might be faster... dunno
22:41imirkin: er hrm....
22:41imirkin: maybe need ubo for that
22:41karolherbst: having to deal with the format at runtime sounds annoying
22:42imirkin: actually no - should definitely work ok =/
22:42endrift: I can deal with the format on the CPU side
22:42endrift: I already do on the GL ES version
22:42karolherbst: right, but then you need more space :p
22:42endrift: the ES version converts it to 5_6_5
22:42karolherbst: but anyway, it would be interested to see if that is actually aster
22:42endrift: still the same number of bytes though
22:42endrift: it may be
22:42endrift: especially if I use a UBO
22:43karolherbst: uniform == UBO
22:43karolherbst: on the hardware it's the same
22:43karolherbst: imirkin: do you know if using images vs texture ops makes any difference for read only data?
22:44karolherbst: like could reading from an image be faster than reading from a sampler?
22:44endrift: fwiw I'm using texelFetch everywhere
22:44imirkin: karolherbst: yeah
22:44imirkin: textures go via a texture cache
22:44imirkin: so it's definitely better to go through that
22:44karolherbst: ohhh, because textures are assumed to be read only
22:44imirkin: images go through L2, but textures have a bigger cache
22:45karolherbst: I am really wondering how much uniforms are actually helping.. but access to uniforms is equal to reading from a gpu register
22:45karolherbst: but the invocation overhead might be bigger
22:46endrift: you can't specify the size of a uniform can you
22:46imirkin: uniforms are super-cached too, obviously
22:46imirkin: endrift: huh?
22:46karolherbst: endrift: it's defined by the shader
22:46imirkin: you can't not specify the size of a uniform...
22:46endrift: No I mean
22:46karolherbst: uniform palette vec4;
22:46endrift: you can't say "this ivec3 has 8 bits per component"
22:46endrift: I mean I guess you can say lowp
22:47karolherbst: uniform ivec4 palette;
22:47imirkin: there are exts
22:47imirkin: but they don't do anything extremely helpful
22:47karolherbst: ohh int8...
22:47imirkin: i8vec3 or whatever
22:47endrift: I only need 5 bits per component
22:47endrift: since each palette entry is 15 bits
22:47imirkin: thing is ... it's hard to take advantage of that
22:47imirkin: you can pack it by hand
22:47karolherbst: yeah... we don't have int8/int16 support in nouveau
22:47karolherbst: or gl at all?
22:47endrift: I just want to make sure I don't use too many uniforms
22:47imirkin: and then use bitfieldExtract to get the right bits out
22:48karolherbst: const buffers are 64k, right?
22:48endrift: wait what's bitfieldExtract
22:48karolherbst: endrift: glsl 4.0 feature :p
22:48endrift: too new
22:48imirkin: there's a mesa ext to get it earlier
22:48karolherbst: well, it's just && and <<
22:48endrift: what I can do though
22:48imirkin: also if you do it "by hand", nouveau's optimization will pick up on that.
22:48karolherbst: uhm && and >>
22:49karolherbst: I should sleep
22:49karolherbst: it's getting embarrassing
22:49karolherbst: & and >> :p
22:50endrift: upload e.g. 0x7FFF, then do (ivec3(0x1F, 0x3E0, 0x7C00) & x) >> (ivec3(0, 5, 10))
22:50endrift: and that'll get what I want
22:50endrift: that's not too bad
22:50endrift: I'll try that
22:52karolherbst: can we do partial updates to uniforms?
22:52karolherbst: or do we have to reupload the entire thing?
22:52imirkin: for user uniforms, we reupload the whole thing
22:52imirkin: we don't get an indicatino from st/mesa about what's updated
22:52karolherbst: k, thought as much
22:52karolherbst: yeah.. I was actually wondering if you can actually do that through the gl API
22:53imirkin: GL api knows
22:53karolherbst: ohh, really?
22:53imirkin: you run glUniform on very specific locations
22:53karolherbst: how do you partially update an array?
22:53karolherbst: ohhh, interesting
22:53imirkin: you can't really bulk-update an array
22:53imirkin: you have to do it one elem at a time iirc
22:53karolherbst: but.. how?
22:53karolherbst: that sounds annoying
22:53imirkin: yeah, UBO's are nice :)
22:54imirkin: actually i guess you can update multiple ones
22:54imirkin: use something as a base, and then e.g. glUniform4fv()
22:54karolherbst: yeah.. sounds about right
22:55karolherbst: ubos can be used with persistent buffers as well, right?
22:57skeggsb: imirkin: i'm actually not sure HW supports it, until (possibly) turing
22:57skeggsb: despite it being in the class headers..
22:57imirkin: skeggsb: gah
22:57skeggsb: what error code does evo throw?
22:57imirkin: no error
22:57karolherbst: "All NVIDIA GPUs from the 900 and 1000 series support HDR display output"
22:58imirkin: just ... all black :)
22:58karolherbst: at least that's what nvidia is saying
22:58imirkin: karolherbst: HDR != fp16
22:58karolherbst: ohhh, true
22:58imirkin: skeggsb: i'm currently suspecting LUT
22:58imirkin: also i'm testing on the G84 :)
22:59imirkin: largely coz ... convenience. maybe i should give it a shot on the GK208
23:16endrift: That gave me a >50% speedup in one of my pathological tests
23:17endrift: thanks :D
23:17karolherbst: the uniform stuff?
23:17endrift: it was running at 43-48 fps before, now it's running >60
23:18karolherbst: I assume you mostly get more shaders running in parallel
23:18karolherbst: as you have lower register preassure
23:18karolherbst: wasn't thinking of that before
23:18endrift: I guess uniforms have their own block on the GPU?
23:18karolherbst: reading from them is as fast as reading a register
23:18endrift: That means I have another thing I should move to uniforms
23:19endrift: (affine params)
23:19endrift: actually I should use UBOs for this because it's structured data
23:19karolherbst: uniform block?
23:19karolherbst: you can have a struct for a uniform
23:19endrift: I misspoke
23:19endrift: I'll do that after I commit this
23:20karolherbst: are you working on upstream vbam or just your own fork?
23:20karolherbst: or different emulator?
23:21endrift: it's from scratch*
23:21karolherbst: I see
23:21endrift: *vaguely based on another emulator I wrote in 2012 that was from scratch
23:26endrift: maybe I'll actually do this tomorrow
23:26endrift: I've been hacking at this for a while
23:28imirkin: endrift: all the stuff that's the same across invocations whould be in a uniform
23:29endrift: invocations of the shader, or the draw?
23:29imirkin: indirect indexing into uniforms is fairly cheap, usually
23:29imirkin: invocations of the shader
23:29endrift: so e.g. once per fragment
23:29endrift: (for a fragshader
23:29karolherbst: the entire stage
23:29endrift: makes sense
23:29karolherbst: _all_ stages actually
23:29endrift: I have a bunch of things that are the same per y, but vary per x and it's really annoying
23:29imirkin: if you need format conversion, texturing is a cheap way to get it "for free"
23:30endrift: not sure how to manage that
23:30imirkin: although i dunno how it compares to uniforms. depends on what exactly you need.
23:30imirkin: the other nice thing about uniforms is that on nvidia hw they don't cause stalls between successive draws
23:30imirkin: i.e. it's optimized for draw; update uniform; draw; update uniform; draw
23:31imirkin: (the updates are staged to a special place, and written out in sequence afterwards)
23:32imirkin: but the draw can run before all that happens, using the special place to look up the correct uniform values
23:34endrift: seems fast on i915 too
23:34endrift: no clue about AMD
23:35karolherbst: ubos are usually much faster than texture operations
23:36karolherbst: or uniforms in general
23:36endrift: I was worried about upload
23:36karolherbst: it's one page?
23:36endrift: .........good point
23:37karolherbst: I think we even just push those values inside the command buffer
23:37karolherbst: dunno if we do that for user uniforms...
23:37karolherbst: or when we do it?
23:38endrift: technically I'm only uploading 1/4 of a page :P
23:39endrift: but I do it up to...(128+4)*160 times per frame
23:39endrift: so I was worried
23:40karolherbst: do you always update the entire array?
23:40karolherbst: although that's probably the easiest anyway
23:40karolherbst: and fastest
23:42endrift: it's 160 if volatile scanline parameters are changed that many times per frame
23:43endrift: I'm looking at ways to make more scanline params non-volatile
23:43endrift: ideally it's 1
23:43endrift: and in many cases it is
23:44imirkin: a texture buffer can be many MBs
23:44imirkin: you lose some of the benefits of uniforms, but still get a good cache
23:44karolherbst: imirkin: the question is rather, how much sense does it make to use a texture if you hve only roughly 1kb of data anyway
23:45karolherbst: and it's really just a lookup table
23:45imirkin: 128 * 160 = more data
23:45imirkin: my point is that if it doesn't fit into a uniform
23:45imirkin: or across a couple uniform buffers
23:45endrift: no I mean
23:45endrift: that's how many times I reassign the value of the uniform
23:46imirkin: could you precompute it once up-front
23:46endrift: well...yes, but then I'd need that many uniforms
23:46endrift: since it can be different per y
23:46endrift: (y is the number of scanlines tall it is)
23:46imirkin: so just do that?
23:47karolherbst: but that doesn't fit inside one uniform anymore though
23:47imirkin: hence my other comments
23:47imirkin: hopefully it's all coming together now :)
23:48endrift: so that's what my other approach was going to be
23:48karolherbst: endrift: where do you split your draws?
23:49endrift: I split draws on volatile scanline parameters--i.e. things that would make me need to change the values in my uniforms to get it right before and after the change
23:49karolherbst: so in theory you could do one draw per frame?
23:49endrift: in most cases yes
23:49endrift: in some cases no
23:50endrift: there's one parameter that can change that would change which shaders I use
23:50endrift: and that can change midframe
23:50endrift: but I would like to reduce it to JUST that being the thing that splits if possible
23:50karolherbst: how different are those shaders?
23:50endrift: I also need to convert one thing I'm doing with a really hacky approach to a vertex shader
23:54endrift: actually there's another thing
23:54endrift: writing to emulated VRAM
23:54endrift: that really only happens in bulk during scene transitions so it's tolerable
23:55endrift: but I've already optimized it a lot, it's like 3x faster than it was when I started
23:55karolherbst: how big is vram?
23:55imirkin: and 3x slower than it will be when you're done optimizing :)
23:55endrift: imirkin: I hope so!
23:55endrift: karolherbst: 0x1C000 bytes
23:55endrift: I forget what that is in decimal
23:55endrift: I think it's 192kiB?
23:56imirkin: just under 128kb
23:56imirkin: 0x10000 is 64k
23:56karolherbst: endrift: when can you write to vram?
23:56endrift: a lot of the time
23:56karolherbst: and how much of it do you have to be able to access inside the shaders?
23:56endrift: ^ that's how I optimized it
23:56endrift: only upload it when it's needed or between frames
23:58karolherbst: do you have to write to it inside your shaders?
23:58karolherbst: but access is variable I guess
23:59karolherbst: I am wondering...
23:59karolherbst: as it totally fits into 2 ubos
23:59karolherbst: indirectly accessing ubos might actually be faster
23:59karolherbst: because we can do that actually
23:59endrift: how big can a ubo be?
23:59karolherbst: but it's not limited