08:07pmoreau: mupuf: Hum, I misunderstood you then. I thought you were saying Kepler+ would be easier to deal with.
08:09karolherbst: mupuf: clocking to lowest might also be a good idea (looking at tesla here)
08:09karolherbst: but only if the voltage actually drops?
08:17karolherbst: pmoreau: ever looked into vulkan compute?
08:18pmoreau: karolherbst: Kinda: I followed a simple example/tutorial (about 400 LoC :-D) and ported it to using vkcpp while implementing it, and then ran it.
08:19pmoreau: But other than that, no. Haven’t tried anything more with it, though I’m certainly planning to.
08:23karolherbst: pmoreau: I see
08:25karolherbst: pmoreau: well, I think the nir to nvir pass is kind of in a state where people could actually starting to implement things against it. I am quite sure I got the CFG bits not completly wrong. Well I don't know of any CFG based tests which fail due to the CFG handling
08:26pmoreau: Nice! I might look into it during the weekend then, as we won’t submit the paper next Friday, as it’s not ready.
08:28karolherbst: pmoreau: :)
08:29karolherbst: I already moved on testing against glsl-1.30 (including older) tests
08:29karolherbst: but I should search for a few more CFG tests
08:29pmoreau: What’s the rate on glsl-1.10 nowadays?
08:29karolherbst: quite good
08:29karolherbst: well glsl-1.30 is this: [4611/4611] skip: 2, pass: 3716, fail: 890, crash: 3
08:30karolherbst: currently still fixing some trivial failures (missing alu ops)
08:30karolherbst: pmoreau: handling input/output write masks fixed a crap load of tests
08:31karolherbst: like you have a variable split into 4 components and have 4 declerations for this in nir
08:31karolherbst: allthough it is really just taking one slot in nouveau
08:31pmoreau: Regarding the CFG, I am not sure what later GLSL versions might have added that would complicate the CF.
08:31karolherbst: and then the shifting and address adjustment and so on
08:31karolherbst: pmoreau: tests
08:31karolherbst: pmoreau: the first test using continue is in glsl-1.30 ;)
08:32pmoreau: What about “break”? glsl-4.00? ;-p
08:32karolherbst: nah, the other stuff was all in 1.10 already
08:33mupuf: pmoreau: sure, maxwell+ refuses to POST without the power supply :D
08:34pmoreau: The hardware does that automatically? :-D
08:34karolherbst: I think ATI hardware did this since ever
08:34mupuf: karolherbst: reclocking is possibly not enough. Disabling accel would be best
08:34karolherbst: might be the firmware though
08:34karolherbst: mupuf: I meant doing both
08:34mupuf: Still better than the blob
08:34mupuf: That's overkill :D
08:34karolherbst: reclocking can reduce the power consumption by a significant amount
08:35mupuf: Power gating would be ideal
08:35karolherbst: yes, but less important
08:35karolherbst: reclocking has a bigger impact
08:35karolherbst: well depending on the default state
08:35karolherbst: but usually it is true
08:35mupuf: ?? No, it doesnt
08:35mupuf: Power, not clock gating
08:35karolherbst: ahh, I see
08:36mupuf: Aka, the thing we don't know how to do ;)
08:36mupuf: But it cuts the power entirely ;)
08:36karolherbst: but does it really matters? I mean even if nvidia does it, I think we setup the GPU in a state where we already do a lot of those things
08:36karolherbst: through devinit scripts
08:37karolherbst: never saw a significant gap between nvidia and nouveau regarding power consumption
08:37karolherbst: like on my kepler with all the clock gating stuff it is somewhere around 5%
08:37mupuf: Because they never pg pgraph? :D
08:37karolherbst: okay, that makes sense
08:37mupuf: It's like powergating 90% of the gpu :D
08:38karolherbst: I can assume there is no real need for this... usually
08:38mupuf: Indeed :D
08:38pmoreau: But maybe the last 10% are responsible for 90% of the total consumption? ;-p
08:38karolherbst: mupuf: but even on full idle?
08:39mupuf: pmoreau: if by the vbios, you mean the hardware, then yes :)
08:40mupuf: karolherbst: when is it ever full idle on windows?
08:40karolherbst: mupuf: laptops
08:40karolherbst: for a cretain amount of time
08:40karolherbst: until they power it off :p
08:40pmoreau: mupuf: I meant the hardware, or some firmware, similar to dvfs. But I was wondering whether it would be in the VBIOS or not.
08:41mupuf: Yeah, it is part of the vbios
08:41mupuf: It displays a message
08:41karolherbst: mupuf: but I meant like on linux, where we can have the GPU being full idle for sure
08:41mupuf: Thr best they could do in HW would be to force clock or power gating pgraph
08:42mupuf: karolherbst: yeah, but noone optmise for imirkin's setup ;)
08:42mupuf: Anyway, my plane is about to take off
08:42mupuf: I plugged the right gpus for fan dev
08:43karolherbst: mupuf: laptops again
08:43mupuf: So, i'll hopefully have time for it now that i am on vacation!
08:43karolherbst: mupuf: they don't turn the GPU off there ;)
08:44mupuf: See you, guys :)
08:44karolherbst: pmoreau: after adding some missing opcodes: [4611/4611] skip: 2, pass: 3812, fail: 794, crash: 3 :) getting there
08:50pmoreau: mupuf: Have a nice flight and enjoy your vacations!
08:51pmoreau: karolherbst: Nice! You can have the GPU being full idle when Nouveau fails/does not supporting suspending/resuming a card.
08:52pmoreau: For example, on all Mac laptops with dual card, the NVIDIA one is never suspended, because they don’t advertise Optimus capabilities or similar.
08:53karolherbst: pmoreau: now I have only fround_even and isign missing
08:53karolherbst: isign should be trivial, but round_even?
08:54karolherbst: CVT is used for rounding
08:54pmoreau: You probably have a flag for that on the hardware instruction. Not sure how it is expressed in NVIR, possibly a flag?
08:54karolherbst: with explicit rounding mode
08:54karolherbst: pmoreau: round_even is the opcode
08:54pmoreau: s/possibly a flag/possibly a modifier
08:54karolherbst: pmoreau: no
08:54karolherbst: pmoreau: in nir, you always have those things explicit as it seems
08:55karolherbst: there is even a imax and umax
08:55pmoreau: I’m talking about NVIR, not NIR ;-)
08:55karolherbst: there we have those flags
08:55karolherbst: ->rnd = ...
08:55karolherbst: but I will take care of that when arriving in the office, I should kind of go :p
08:56karolherbst: or at leasg get out of bed :D :p
08:58pmoreau: Anyway, bbiab. Need to watch a student competition for the CG course.
12:11dupondje: Hi Guys! Some quick question. The Quadro M1200 Mobile is perfectly supported on Nouveau?
12:25pmoreau: dupondje: No card is perfectly supported by Nouveau. So, which criteria are important to you?
13:10dupondje: pmoreau: powermanagement :) as its a card inside my new laptop
13:12pmoreau: Nouveau should be able to power off the card when it is not used. Other than that, the card should boot using its lowest power level, but Nouveau does not implement power- and clock-gating.
13:14pmoreau: dupondje: As long as you are not planning to use the card, it should sleep happily and consume very little.
13:15pmoreau: If you use it, it won’t be as bad as it could be, but you won’t get much performance out of it.
13:15dupondje: When does it gets started? Only for applications that you start with PRIME ?
13:15pmoreau: Commands like `lspci` will wake it, but it should go back to sleep after 5 seconds of inactivity (IIRC).
13:16dupondje: ah ok
13:16dupondje: also seems like I hit https://bugs.freedesktop.org/show_bug.cgi?id=100423
13:16dupondje: on shutdown that was
13:18pmoreau: A MMIO read FAULT? Or MMIO write FAULT? (both are found in that bug report)
13:21pmoreau: On shutdown? That is interesting. Can you give a link to your dmesg please?
13:22dupondje: it was on the initial install. When its done it askss for a reboot, then got that message
13:22dupondje: kern.log:Dec 15 14:07:52 lt-jeanlouis kernel: [ 1.310748] nouveau 0000:01:00.0: bus: MMIO read of 00000000 FAULT at 022554 [ IBUS ]
13:22dupondje: kern.log:Dec 15 14:07:52 lt-jeanlouis kernel: [ 1.312207] nouveau 0000:01:00.0: bus: MMIO read of 00000000 FAULT at 10ac08 [ IBUS ]
13:23pmoreau: OK, so when it’s booting.
13:23pmoreau: (and initialising the card)
13:24dupondje: those are from booting indeed. But on the initial reboot (shutdown stage), I had the error also
13:25dupondje: but thats not logged I guess, as system locked :(
13:25pmoreau: Which chipset is your card, if you run `lspci -d 10de:`?
13:25pmoreau: Oh, system locks up on shutdown? Every time?
13:27dupondje: 01:00.0 3D controller: NVIDIA Corporation GM107GLM [Quadro M1200 Mobile] (rev a2)
13:27pmoreau: You might be able to retrieve (at least bits of the logs) from journalctl (if you are using systemd), by running `journalctl -b -X` where `X` is the n-th previous boot.
13:27dupondje: not everytime it seems :)
13:27pmoreau: Good for you, but might be harder to debug
13:28dupondje: Specifying boot ID or boot offset has no effect, no persistent journal was found.
13:28dupondje: hehe :D
13:28dupondje: anyway ill play around further first
13:28dupondje: my wifi hangs also :(
13:28pmoreau: Feel free to add yourself to the CC list on that bug report you linked.
13:30dupondje: I'll do
13:31dupondje: does the nvidia drivers have other advantaged expect power & performance?
13:31dupondje: seems like they don't support wayland yet :)
13:31pmoreau: Nouveau is having issues with Wayland it seems, so, almost the same as not supporting it
13:33dupondje: works fine atm :D
13:34pmoreau: With the NVIDIA driver, you will also get a Vulkan driver, OpenCL support (up to 1.2), CUDA, OpenGL (up to 4.6, whereas Nouveau only does up to 4.3?, maybe 4.5?), and less buggy overall.
13:34pmoreau: Isn’t Wayland running on the the Intel integrate GPU though?
13:47RSpliet: pmoreau: think nouveau does 4.5 but doesn't advertise it due to CTS failures?
13:49pmoreau: Possibly, and because we haven’t submitted a driver conformance application for Nouveau (not that I’m aware of)
13:49RSpliet: Am successfully using nouveau w/ wayland since half a year, apart from a few more random hangs than previously nothing majorly shocking
13:49RSpliet: No use submitting a conformance application if you fail tests right? ;-)
13:50RSpliet: We get native DX9 acceleration on Wine though, that's something the official driver doesn't do!
13:50RSpliet: Not sure if it buys us much with the perf disadvantage we have
13:50RSpliet: but it's something
13:51RSpliet: karolherbst: you've been with your nose in the compiler recently. How difficult do you think it is to implement loop unrolling? Do we still have the necessary metadata in TGSI or NIR to do this easily (loop invariant vs. loop body etc)?
13:51pmoreau: I should try Wayland on my computers.
13:51RSpliet: pmoreau: same question for SPIR-V ;-)
13:51karolherbst: RSpliet: we should have it in TGSI
13:51karolherbst: spir-v is pain
13:52karolherbst: RSpliet: it should be even easier with nir though
13:52karolherbst: nir gives us pretty much eveyrthing we want to know
13:52karolherbst: TGSI... we have to remember stuff we want to know
13:52karolherbst: and with spir-v everything is screwed up
13:53karolherbst: because unstructured control flow
13:53RSpliet: I know too little about SPIR-V unfortunately. What do you mean with unstructured?
13:53karolherbst: but I think we have a few information there as well
13:53pmoreau: karolherbst: That is only true for OpenCL SPIR-V: GLSL/Vulkan SPIR-V is structured.
13:53karolherbst: RSpliet: random bras to anywhere
13:53karolherbst: pmoreau: well right, but you know
13:53RSpliet: no control flow graph and BBs? Meh
13:53karolherbst: RSpliet: well, there are labels
13:54karolherbst: RSpliet: anyway, I wouldn't do it based on spir-v, just convert it to nir and use that
13:54karolherbst: RSpliet: but the main issue is, how do we do opts based on nvir?
13:54RSpliet: karolherbst: the tricky bit is understanding what your loop invariant is and transform that for an unrolled version of your loop.
13:55karolherbst: RSpliet: nir has a loop unrolling pass, so you could just check how they do it
13:55karolherbst: thing is, how would we do it in nvir ;
13:56RSpliet: Ah that's useful. It's more that I've observed NVIDIA seems to find this an important optimisation for certain OpenCL kernels. Probably because it gives them more scope for early DRAM request issuing and slightly lower branching overhead
13:56karolherbst: lower branching overhead is always nice
13:57RSpliet: Yeah, the other half is "where do you stop", which is RA-driven, which is influenced by the instruction scheduling pass, which you have to do after loop unrolling (to move the loads up), which now gives you a circular depencency ;-)
13:57karolherbst: RSpliet: but to be honest, is there any difference if we implement loop unrolling in nir or in codegen? I see a lot of good reasons to have a lot of optimisations inside codegen, but loop unrolling?
13:58RSpliet: karolherbst: unrolling is unrolling afaik. But if unrolling leads to GPR usage beyond 32, you're cutting yourself in the fingers on kepler.
13:58RSpliet: there's similar thresholds for other gens ;-)
13:58karolherbst: you can specify the max depth of unrolling
13:59RSpliet: Yeah, but that depends on how big your loop is and the livesets going into it
13:59karolherbst: do you really increase your GPR usage that much with unrolling?
13:59RSpliet: Ideally yes
13:59karolherbst: why ideally?
14:00RSpliet: if you create a lot of dependencies within your unrolled loop you'll limit the benefit of unrolling
14:00karolherbst: well but if you really just unroll
14:00karolherbst: without reusing any values
14:00karolherbst: just unrolling
14:01karolherbst: so you really just kill the bras away
14:02karolherbst: I mean the basics of loop unrolling are, that you pretty much just cut and paste the loop body, remove the abort condition and the CFG ops and just paste the amount of blocks you need with an adjusted end block, right?
14:03karolherbst: and then you can still do your optimisations
14:03RSpliet: Well, yes and no. The loop counter is likely to index into an array (or two), you'd want to rewrite those loads to take counter+fixed immediate rather than increment counter, load
14:04karolherbst: okay so you reduce the gpr usage
14:04karolherbst: because that index value isgone
14:04RSpliet: That's the simplest of cases, but there's a lot of cases like that that could reduce your opportunities to reschedule ops
14:05karolherbst: well sure
14:05RSpliet: At the other end of the loop there's aggregate ops and stores. If your loop calculates some value through a series of ops, you ideally do the aggregate addition you need to do in the end rather than between every copied body (iteration)
14:06karolherbst: as I said: we should ignore optimizing the unrolled code
14:06karolherbst: becuase this just adds issues we don't need right now
14:06karolherbst: of course if we do optimisations we could increase GPR usage
14:06karolherbst: but this is true for other opts as well
14:06karolherbst: and other opts also can reduce rescheduling optimisations
14:07karolherbst: and we still do those, right?
14:07RSpliet: the optimisation consists of removing precisely the dependencies that make loop unrolling worthwhile - namely those that allow to reschedule ops such that they interleave
14:07karolherbst: it still comes down of being a trade off
14:08karolherbst: those post unroll optimisations
14:08RSpliet: I suspect this has a much bigger impact than just the branching overhead, because branch overhead is quite easy to mask with other warps :-)
14:08karolherbst: we don't have to do all of those, not always at least
14:08karolherbst: but if we just unroll and get a small benefit from this without all the post opts, then we can take stuff like GPR usage into account when doing those opts
14:10karolherbst: and that was my suggestion: who not handling a higher level IR the unrolling without doing any opts which need information only codegen has and move everything away where there is no point on checking the conditions again or something like that
14:10karolherbst: like you probably won't be able to do more loop unrolling after doing all the opts in codegen
14:10karolherbst: because evrything you have to know, you know already before translating to nvir
14:11karolherbst: just an idea
14:12RSpliet: Pragmatically that might not be a bad approach for starters, Perfect world it's not great. If you unroll before you have GPR information you risk losing performance by unrolling too little OR by unrolling too much (and losing the ability to kill the dependencies that you want for resched because the opt has become too expensive)
14:13karolherbst: that is, if we really have to know that, but that is kind of my plan
14:13karolherbst: we don't need to do something in codegen if there is no benefit in doing it there
14:13karolherbst: I mean sure, with infinite time and devs we can do everything in codegen
14:13karolherbst: but... reality ;)
15:03RSpliet: Well, in this case there is a benefit for doing things in codegen (better GPR usage estimates), but if we keep this use-case in mind when improving NVIR it's not a bad thing to go for the NIR pass as a first step. There's a million areas for nouveau perf improvements, so any step would do. Also, I wonder how much of this unrolling is useful for non-compute workloads, but I bet someone who knows GL will have a more informed opinion
15:10karolherbst: wuhu pmoreau glxgears runs now as well, yay
15:54pmoreau: karolherbst: Yes! Now we can run benchmarks between going through NIR compared to TGSI!
15:55karolherbst: that interpolation stuff is messy....
15:56karolherbst: because all layers have different naming schemes (mesa/gallium/nouveau/nouveau2..) and then nothing seems to fit anyway...
15:57karolherbst: especially, because I get stuff like "INTERP_MODE_NONE" from nir and I am the one ending up interpreting it
15:57karolherbst: and have to choose the correct mode
15:57karolherbst: TGSI is more explicit here, like it uses "DCL IN, POSITION, LINEAR" for the exact same input
15:58RSpliet: glxgears... uses shaders? Or is that nouveau default shaders for T&L?
15:58pmoreau: Is that how it’s supposed to be in NIR, or is it a bug?
15:58karolherbst: RSpliet: it actually uses shaders
15:58karolherbst: RSpliet: for fun reasons, I got glxspheres working before glxgears...
15:58karolherbst: pmoreau: not quire sure, but I think this is how it is supposed to be
15:58karolherbst: pmoreau: I will just check what the other drivers are doing
15:58pmoreau: RSpliet: I’m sure it uses 4.6 and passes the shaders as SPIR-V 8-)
15:59karolherbst: maybe we should do glxgears 4.5
15:59karolherbst: and use fancy 4.5 features
16:00karolherbst: without changing the result of course
16:00karolherbst: it should still look the same
16:00pmoreau: Let’s do that! And then compare it against vkgears and see which one wins!
16:01karolherbst: wasn't there a different gears implementation anyway?
16:01karolherbst: some fancy one with a cube and a glxgears on all sides?
16:19karolherbst: pmoreau: here is how that stuff maps: https://gist.githubusercontent.com/karolherbst/1e5da72195f88a50af66ba076e3210fa/raw/34534d4499b42d165259e2a19228e36390d36033/gistfile1.txt
16:19karolherbst: ohh, I could add the nvir print names :D
16:21karolherbst: for maximum confusion: https://gist.githubusercontent.com/karolherbst/1e5da72195f88a50af66ba076e3210fa/raw/78c5ddd0d3e8feb49e3f1e694d5b413a45587be9/gistfile1.txt
16:21karolherbst: imirkin: can we tidy this up pls? :D
16:32RSpliet: using tesselation shaders to make rusty gears?
17:12g4570n: Hi, I have Devuan Jessie installed on a desktop pc, I can put it in suspend mode but when I wake it up the XFCE Desktop is freezed, I can not use it, I have to restart. This can be a Nouveau problem or can it be a kernel problem? What do you think
17:12g4570n: The mouse cursor, at times, moves and freezes again. The keyboard does not respond directly.
17:21karolherbst: g4570n: check dmesg through SSH if possible
17:25imirkin_: karolherbst: i highly recommend you read up on how interpolation works in glsl
17:25imirkin_: there are a LOT of various cases
17:25imirkin_: and yes - it's all extremely confusing
17:25imirkin_: esp when you start taking legacy GL into account
17:25imirkin_: i.e. wtf shade model flat does, etc
17:26imirkin_: (not to be confused with interpolation mode flat, of course)
17:26imirkin_: the short version is that only COLOR inputs are affected by the shade model flat/smooth setting
17:27imirkin_: for other things, there's "regular" perspective interpolation, noperspective, and flat interpolation
17:27g4570n: karolherbst: to find the fault it throws?
17:27imirkin_: additionally it can be interpolated at the center, sample, or centroid
17:28imirkin_: (which are meaningless for flat interp, but meaningful for the other 2)
17:28imirkin_: this information goes into a combination of shader program header, and the interp instructions
17:28imirkin_: (program header configures the interpolation unit, so it knows what to do when the interp instruction calls it)
17:30imirkin_: on nv50 it's somewhat different since there is no shader program header
17:30imirkin_: anyways, i'd recommend not focusing on that too much -- get the CFG working first.
17:30imirkin_: these are the little details that don't matter and can be figured out later.
17:31imirkin_: RSpliet: mesa generates shaders for ff setups
17:31imirkin_: [optionally, of course... iirc it's a setting in ctx->Const or something]
17:31imirkin_: i think that early versions of gallium had some fixed function support, but the current gallium api is shader-only
17:52karolherbst: imirkin_: yeah... I checked what also glsl -> tgsi is doing and how nouveau interprets those tgsi bits
17:52karolherbst: imirkin_: I think I somehow got it quite right, missing some exceptions though
17:53karolherbst: imirkin_: I just got confused because nir gave me a different interpolation mode than TGSI does
17:53karolherbst: but it seemed to be fine in the end
17:53imirkin_: naems are a little different
17:53imirkin_: look at how st_glsl_to_tgsi maps them
17:53karolherbst: it wasn't just the name
17:59karolherbst: imirkin_: INTERP_MODE_NONE vs INTERP_MODE_NOPERSPECTIVE aka TGSI_INTERPOLATE_LINEAR
18:00imirkin_: NOPERSPECTIVE == LINEAR
18:00imirkin_: MODE_NONE means "no interp specified, aka use perspective, unless it's gl_Color in which case use what's in the shade model"
18:00karolherbst: yeah I know
18:01karolherbst: but tgsi gave me noperspective, nir gives me none
18:01imirkin_: i.e. TGSI_INTERPOLATE_COLOR :)
18:01imirkin_: mmm ... that's surprising. i guess there's a higher default then? dunno.
18:01karolherbst: yeah, maybe dunno
18:01imirkin_: NONE should be quite distinct from noperspective
18:01karolherbst: anyway, that confused me and I tried to find my mistake
18:01karolherbst: the shader ends up doing a conditional discard
18:02karolherbst: ohh mhh
18:02karolherbst: right, the shader is also different
18:02karolherbst: maybe some tgsi opt?
18:03karolherbst: because with the nir version I end up doing linterp pass f32 $r0 a[0x7c] + rcp + pinterp mul f32 $r0 a[0x70] $r0
18:03karolherbst: the tgsi just does linterp pass f32 $r0 a[0x70]
18:03imirkin_: well, one's perpsective and the other is flat.
18:03karolherbst: and then for both "set ftz u8 $p0 lt f32 $r0 10.000000" + $p0 discard
18:04imirkin_: for ... what's at 0x70... gl_FragCoord? that sounds wrong...
18:04imirkin_: 0x7c is the w used for barycentric interp
18:05imirkin_: 0x78 would logically be gl_FragDepth
18:05imirkin_: and i think 0x70 and 0x74 are gl_FragCoord.xy which would be flat-interpolated iirc
18:05imirkin_: (since they're kinda special)
18:05karolherbst: 7c is POSITION:3, right?
18:06imirkin_: sounds right
18:06imirkin_: i don't remember these offhand
18:06imirkin_: it obviously varies between nv50 and nvc0
18:06karolherbst: well, nir has this: "decl_var shader_in INTERP_MODE_NONE vec4 gl_FragCoord (VARYING_SLOT_POS, 0, 0)"
18:06karolherbst: which should be the 0x70 part
18:07imirkin_: so i think INTERP_MODE_NONE == flat
18:08karolherbst: TGSI_INTERPOLATE_COLOR or TGSI_INTERPOLATE_PERSPECTIVE
18:08imirkin_: but with VARYING_SLOT_POS you should have special handling
18:08imirkin_: those only make sense for generic varyings
18:31karolherbst: wondering why gl_FragDepth doesn't work, then I found this: "TGSI uses TGSI_SEMANTIC_POSITION.z for the depth output, while NIR uses a single float FRAG_RESULT_DEPTH."
18:31imirkin_: yeah. good times.
18:32karolherbst: I think I handle this in the nir pass and don't add ugly code in the driver
18:32imirkin_: not sure why the driver would care.
18:33karolherbst: well, the driver uses TGSI stuff here
18:33karolherbst: for the slot assigning
18:34karolherbst: or kind of depends on TGSI semantics
18:35imirkin_: yeah, so you have to work with the given interfaces.
18:36karolherbst: which makes it kind of ugly, because I depend on TGSI stuff in the nir pass as well
18:36karolherbst: but well, should be fine for now
18:48imirkin_: yeah, i was never a fan of that leakage
18:48imirkin_: but it seemed a LOT easier to do it that way
18:48imirkin_: isntead of defining yet-another enum <-> enum mapping
18:48imirkin_: for zero gain
19:05pmoreau: Wow, an email from NVIDIA about new documentation, about Pascal and Volta? (Just the EVO stuff it looks like, but still, I wasn’t expecting that.)
19:05pmoreau: Guess I’ll have to merge those in envytools
19:20karolherbst: ... now I found that TGSI_SEMANTIC_CLIPDIST magic thing
19:29pmoreau: Hum, they changed the class names and versions logic with Volta: the class name before was DISP0ABX, new name is NVD_20, as for class number it jumped from 9870 to C370. No big deal, but interesting nonetheless.
21:29karolherbst: interesting, I usually get much more signed expression in nir than in TGSI
21:30imirkin_: for a lot of ops it doesn't matter
21:32karolherbst: yeah, just wondering
21:33karolherbst: well I would have expected that TGSI and NIR kind of care about the signess the same amount, but maybe NIR is just more explicit/stricter here
21:33imirkin_: i'm sure they're both correct.
21:33imirkin_: but in some cases, the signedness doesn't matter, and they may have gone with different defaults
21:33imirkin_: e.g. integer compare ... UCMP vs ieq
21:33karolherbst: I see
21:34imirkin_: same diff. but nir uses "ieq" which signifies signed, while tgsi has UCMP which signifies unsigned
21:34imirkin_: in practice ... doesn't matter.
21:34imirkin_: or iadd vs uadd
21:34imirkin_: same thing
21:34imirkin_: tgsi tends to use U*, and only I* when it really matters
21:34imirkin_: i think nir tends to just use i* always, and u* when it really matters
21:34karolherbst: something like this
21:34karolherbst: but I am actually looking at what I made out of it
21:35imirkin_: not one that's right or wrong. just what it is.
21:35karolherbst: I see
21:35imirkin_: [and due to the history of TGSI, the no-prefix stuff is float]
21:35imirkin_: i.e. CMP = float, UCMP = int
21:35karolherbst: how important is that rz rounding mode in cvt?
21:36karolherbst: I hope my test passes after fixing it
21:36imirkin_: it's the distinction between trunc(), round()
21:36imirkin_: i mean - it totally doesn't matter if you don't need correct results
21:36karolherbst: mhh, no, I was more talking about the F2I case
21:37karolherbst: I guess doing rz here is kind of required?
21:37karolherbst: but looking at the TGSI... where did it come from
21:38karolherbst: well, at the tgsi to nvir pass
21:39karolherbst: okay, found it
21:43karolherbst: that RZ fixed it indeed
21:46karolherbst: imirkin_: by the way, I was able to implemented the CFG stuff without having to track any state except nir_block -> BasicBlock mappings
21:51imirkin_: yeah, that sounds right
21:52imirkin_: in case it's not obvious...
21:52imirkin_: RZ = round zero
21:52imirkin_: RM = round minus
21:52imirkin_: RN = round nearest
21:52imirkin_: RI = round infinity
21:52imirkin_: (actually RM = round minus infinity)
21:52karolherbst: there are comments in the nv_ir.h header ;)
21:52karolherbst: als the I variants
21:53imirkin_: actually the non-integer variants are very confusing :)
21:53imirkin_: i.e. wtf are you rounding
21:53imirkin_: but ... it eventually makes sense if you really think hard
21:53imirkin_: i suggest avoiding that
21:53karolherbst: related to numbers being not representable?
21:54karolherbst: or what does the rounding do for floating points?
21:54imirkin_: well, like e.g. fma
21:54imirkin_: has a rounding situation it has to worry about for float
21:54imirkin_: or even mul, you end up with a result that's between 2 floats that are representable
21:54imirkin_: do you round to one or the other
21:59karolherbst: makes sense
21:59imirkin_: (or even an add
22:00imirkin_: but it's not like glsl has that specified
22:00imirkin_: or cares
22:00imirkin_: but the hw has to do *something*
22:00karolherbst: mhh, there are three things missing for the glsl-1.10 tests: 1. c15 stuff, where I am not really sure what that's all about (more interpolation stuff?) 2. texturing 3. arrays
22:00karolherbst: mhh, right
22:00imirkin_: c15 is the driver constbuf
22:01karolherbst: yeah, I am aware
22:01imirkin_: where various useful information is stored
22:01karolherbst: but there is a reason why it is used I mean
22:01imirkin_: yeah... among other things, the bottom bits of it are for texture handles
22:01karolherbst: it's the case where it is accessed at the end to "adjust" fp outputs with a MUL
22:01imirkin_: oh yeah
22:01imirkin_: ignore that
22:01karolherbst: and so on
22:01imirkin_: that's like a corner case of a corner case hidden inside a corner case that's a corner case itself
22:02karolherbst: what needs this?
22:02karolherbst: yeah... sounds important
22:02imirkin_: ok, so THAT is something else
22:02imirkin_: this is for vp
22:02imirkin_: you need to take UCP's into account
22:02imirkin_: and you need to use the gl_ClipVertex instead of the gl_Position for clipping
22:02imirkin_: so for each UCP
22:03imirkin_: you need to do gl_ClipVertex dot UCP[i] and store that into clip distance[i]
22:03karolherbst: I see
22:03imirkin_: but for fp
22:03karolherbst: so that is what clip distance is about
22:03imirkin_: the way that gl_SampleMaskIn is computed
22:03imirkin_: is ... annoying in certain annoying cases
22:04imirkin_: (i think)
22:04imirkin_: and the UCP is stored in c15
22:04imirkin_: (UCP = user clip plane)
22:04karolherbst: well, I already get the output configured in the slots and everything, so I really just need to do those additional muls at the end of the shader and write into clip distance?
22:04imirkin_: i.e. what's passed in via glClipPlane
22:04imirkin_: well ... dot
22:04imirkin_: i.e. mul + add
22:04imirkin_: per-component mul and then sum the results
22:04karolherbst: ohh right
22:04karolherbst: yeah, I see it now
22:05imirkin_: aka mul + mad + mad + mad
22:05karolherbst: having a working pass makes this all a lot easier :)
22:05imirkin_: although perhaps mul + mul + mad + mad + add would be faster? dunno
22:05karolherbst: do we care?
22:06imirkin_: definitely not.
22:06karolherbst: oh wow, that nir input I get gives codegen a real challenge
22:06imirkin_: coz of all the immediates?
22:06karolherbst: even c0 isnt moved into a mul/mad
22:07imirkin_: you must be doing something wrong
22:07imirkin_: can you pastebin the *input* into codegen
22:07imirkin_: i.e. before it does anything
22:07karolherbst: you know the issue already
22:07imirkin_: mov vs load?
22:07karolherbst: the other issue
22:07imirkin_: all the immediates are at the top?
22:08karolherbst: it is in the nir
22:08karolherbst: I am sure I could run some nir opts to clean that shit up
22:08imirkin_: nouveau should be able to resolve that
22:08imirkin_: i'm surprised that it doesn't
22:09karolherbst: we don't loop the opts
22:09karolherbst: I think it has to do a lot of things
22:09karolherbst: first resolve the constant
22:09karolherbst: move the reg out of the load...
22:09karolherbst: and then it doesn't move the buffer access in anymore
22:10karolherbst: by the way: this also looks quite fun: ld u32 %r95 c0[0x0000000000000000+0xc] (0) :)
22:10karolherbst: and then I get this: 2: ld u32 $r8 c0[$r255+0x0] (8)
22:10karolherbst: something like that
22:11karolherbst: just with a 0xc
22:13karolherbst: imirkin_: that "Converter::buildDot" thing is for the clip stuff?
22:14karolherbst: ohh, it is also used somewhere else
22:14karolherbst: for DP*
22:14imirkin_: oh right
22:14karolherbst: well, only there
22:14imirkin_: yeah, TGSI has like 20 different DP* variants
22:14imirkin_: a bunch have gotten removed
22:14karolherbst: they kind of look the same
22:14karolherbst: except for the dimension
22:15karolherbst: ahh, this sounds like the stuff I search "Converter::handleUserClipPlanes"
22:15karolherbst: looks like I could just reuse that method
22:16karolherbst: more or less
22:18karolherbst: imirkin_: the painful part is, in nir we can have multiple exit points of a function. Like nir does the exporting code prior each ret instruction
22:19imirkin_: that works in nvir
22:19karolherbst: I know
22:19karolherbst: you see the issue
22:19imirkin_: that's why there's an "end" block
22:19imirkin_: and if you look at how this is done in the tgsi thing
22:19karolherbst: I know
22:19karolherbst: nir explicitly exports fp outputs
22:20imirkin_: as does tgsi...
22:20karolherbst: so if you have a ret somewhere, you also have those store_outputs there
22:20karolherbst: not really
22:20imirkin_: it has OUT
22:20imirkin_: and you write to OUT
22:20karolherbst: well right
22:20imirkin_: which is an explicit store of those outputs
22:20imirkin_: same deal.
22:20karolherbst: but ont in the code, right?
22:20karolherbst: in nir it is integrated in the cfg
22:20imirkin_: well, perhaps if you see the converter doing something dumb
22:20karolherbst: and if you have 4 rets, you get 4 blocks with exports
22:20imirkin_: then that dumb thing may not be so dumb after-all
22:21karolherbst: I mean, I can't have a common exit block without trying to merge all those exit blocks in the nir->nvir pass
22:21karolherbst: and I would rather not try to write such merging code
22:23karolherbst: like this: https://gist.githubusercontent.com/karolherbst/ef131abc8c06ae8517129a4b3f3e3fe3/raw/fd2d538ef3338faa016c4a0de260bbb05557d467/gistfile1.txt
22:24karolherbst: so in the end it means, that I would have to insert that clipping stuff for every return instruction as well
22:24karolherbst: and at the "real" end of the function
22:24karolherbst: or I move that into a common end block
22:24karolherbst: and just jump to that
22:24karolherbst: right, that should work