00:32imirkin: JayFoxRox: depends on the format - if you pick UVYV or VYUY, then UVPLANE isn't used
00:32imirkin: JayFoxRox: if you pick NV12 or NV21, then UVPLANE is used
00:33imirkin: JayFoxRox: have a look at how i drive the overlay in nouveau -- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/nouveau/dispnv04/overlay.c?h=v4.19-rc2
00:34imirkin: JayFoxRox: should work with the drm driver as an overlay plane, not sure what your target platform is
00:35imirkin: (and yes, confirmed that there are two sets of regs for double-buffering... afaik you can't even enable both at once)
00:35imirkin: let me know if you still have questions after reading over the nouveau impl
00:36JayFoxRox: imirkin: writing python code to poke registers on an xbox [our open-source compiler doesn't even have a C stdlib, so running nouveau won't be possible]
00:36JayFoxRox: the existing code is very helpful tho! I will definitely play around with it some more
00:38imirkin: ok cool
00:38imirkin: there are funny stride requirements, so be caerful
00:39imirkin: that took a bit to work out
00:39imirkin: also the brightness/saturation stuff had to be fit within the parameters of the kernel - i.e. no floats, but had to compute sin & cos...
00:48imirkin: er, make that hue/saturation
03:52rhyskidd: imirkin: that nvbios disp ver 0x22 fix for GP108 also works on Volta, so thanks for that
03:52imirkin: rhyskidd: cool.
03:52imirkin: i didn't check very carefully if anything new existed
03:53imirkin: i just wanted to see the display scripts :)
06:22pmoreau: RSpliet: If you are already over 32 and the coalescing does not bring you over 64, I would expect the driver to do the coalescing.
06:25pmoreau: RSpliet: FWIW, here are a few primitives I used to force the driver to coalesce loads/stores: https://hastebin.com/eguwiwugov.cpp You’ll probably need to change a few things as it’s in CUDA, but it should still be doable in OpenCL.
12:10RSpliet: pmoreau: Thanks man. I ended up performing some manual labour
12:10RSpliet: pmoreau: https://hastebin.com/emukowelis.nginx
12:13RSpliet: Didn't end up with a fantastic shader, uses 46GPRs... but at least I didn't fail at doing what I wanted to :-P
12:16karolherbst: RSpliet: and is the shader performing better or worse or the same?
12:17RSpliet: I'll test that in a bit. First trying to do an apples to apples comparison between my optimised shader and the so-called NVIDIA optimised one :-P
12:18RSpliet: (ergo: doing the same thing with float2)
12:22RSpliet: Inconclusive... the time it takes to execute the shader is too short :')
12:29RSpliet: It is like 40% smaller though in number of insns
12:31RSpliet: ~740->473 insns
12:34imirkin: skeggsb: let me know if there's anything else those hdmi2 patches need for merging
12:37karolherbst: RSpliet: that must be a shader with quite a lot of load/stores
12:38karolherbst: or rather was one
12:44RSpliet: It did 5*(vec length) loads, and 2*(vec length) stores. Additionally, there's a loop that contains four individual loads that could be coalesced into a single vec4 load. For register pressure reasons I went for float2 vectors, so I reduced 18 ldst ops to 8.
12:46RSpliet: I... oh. Yeah, they also unrolled a loop manually. Duh, that'll lead to a lot of extra code
12:49karolherbst: extra code won't matter anyway ;)
12:49karolherbst: just that less instructions are executed
12:49RSpliet: Yeah, I can play that game too
12:49RSpliet: with #pragma unroll
12:54RSpliet: With sufficient parallelism you're right btw, but for large GPR, a high instruction count unfortunately has a negative effect on the icache. Sad, because high-GPR programs tend to have more instructions :-P
12:56imirkin: simpler programs run faster than more complex ones...
12:58karolherbst: RSpliet: well, but just unrolling loops doesn't mean you also have to increase the GPR count, it isn't like you need more space to store the state, you basically just copy some instructions around and skip the CFG ones ;)
12:59karolherbst: of course further optimizations could lead to higher GPR counts or lower, but unrolling itself shouldn't have a significant impact here
13:00RSpliet: Oh no blind unrolling shouldn't affect GPR at all. But that's also pointless in most cases, branching overhead is well maskable by other threads.
13:01karolherbst: you still execute more instructions in total ;)
13:03karolherbst: no idea how much nv hw does branch prediciton or if they do it at all, but with unrolling you also need less of that (and less headache with the issues around branch prediciton in general anyway)
13:03RSpliet: The way kepler is set up that doesn't have to make a difference though.
13:03RSpliet: one warp scheduler dealing with control flow still leaves three dual-issuing warp schedulers to saturate your FPUs if you needed to
13:04karolherbst: okay sure, but kepler is special here anyway.
13:09RSpliet: There's limited bandwidth to the register file, there's the probability of the instruction being an SFU instruction leading you to block. There's plenty of reasons why branching overhead can be negligible on more modern GPUs. Not saying it's never there, but I'd expect the impact of that visible in power consumption more than in perf for most moderately sized loops :-P
13:28RSpliet: Heh, it doesn't seem to want to issue B128 reads from constmem.
13:29RSpliet: nvdisasm implies this 128b read exists. Could it be wrong? :-D
13:29karolherbst: RSpliet: ld vs mov I think
13:29karolherbst: you can mov from 32 but not 128? something like that?
13:30HdkR: Ooo, load coalescing? +1 :)
13:30RSpliet: Nah, it's "LDC" according to nvdisasm
13:31RSpliet: HdkR: it comes with alignment constraints that make it impractical in the general case
13:41imirkin: and requires waiting on a barrier for it to complete on maxwell+
13:42imirkin: RSpliet: and yeah, LDC with 128-bit stopped existing on kepler
13:42imirkin: nvdisasm claims it's a thing, but it's not
14:11RSpliet: imirkin: That confirms my observation, thanks
14:31RSpliet: So much creativity in public benchmark kernels... "col = (ei+1) / d_Nr + 1 - 1;" is an absolute gem from rodinia
14:48karolherbst: RSpliet: well somehow you have to benchmark compilers :p
14:53RSpliet: I guess you can't complain about code quality for benchmarks. At least they present a nice case for floating point atomics.
14:53RSpliet: They just don't know it...
14:54karolherbst: well, benchmarks devs don't do questionable "optimizations" because their goal is to produce crappy code in the first place :p
14:54karolherbst: or at least so I hope
14:54RSpliet: You're absolutely right. And they are absolutely wrong :-D
14:55RSpliet: They're supposed to create code representative of the real world.... with benchmarks like these they are insulting real-world devs
14:55karolherbst: or telling them they should rather produce readable code, because compilers are good enough :p
14:55karolherbst: there is a lot of crappy C code out there just because a dev tried to be "smart"
14:56RSpliet: The kernels I'm looking at are crappy OpenCL C code because a dev tried to be "smart"
14:56RSpliet: So... I guess this benchmark is representative of the real world ;-)
14:56karolherbst: I guess it works both ways
15:00karolherbst: pendingchaos: there is one thing regarding xmad, which I didn't really check. If we have a mul(a & 0xffff, b & 0xffff) where a and b are ints, we could translate that into one XMAD instruction, right?
15:01karolherbst: I am wondering how common something like that is, as this could be a potential optimization you could do inside the shader
15:01karolherbst: (adding & 0xffff to hint the compiler about valid values)
15:02RSpliet: or casting to half-float?
15:02karolherbst: in GL?
15:02RSpliet: half int :-P
15:02karolherbst: also, integer
15:02karolherbst: we don't have shorts in GL, do we?
15:03karolherbst: I know vulkan has it, or spirv at least
15:03karolherbst: or so I think
15:03RSpliet: Mmm, apparently not indeed.
15:04karolherbst: HdkR: do you know if games are using it?
15:04HdkR: NV_gpu_shader5 as well
15:04karolherbst: HdkR: is this interesting for dolphin?
15:04karolherbst: could improve integer muls on maxwell/pascal
15:04HdkR: Nah, Dolphin needs 24bit integers
15:05karolherbst: sure, but for everything?
15:05pendingchaos: karolherbst: not promising: https://hastebin.com/kipesizose.txt
15:05HdkR: The main bit that is heavy is the fragment TEV stages which all operate at 24bit
15:05karolherbst: I see
15:05karolherbst: HdkR: I was more thinking about shader inputs though
15:06HdkR: Could be a neat idea in the future to use uint16_t directly on the vertex side to remove some CPU overhead on vertex processing I guess
15:06karolherbst: pendingchaos: yeah.. sad. I guess it isn't really worth the effort
15:06pendingchaos:disappears for a few minutes or so
15:07karolherbst: HdkR: yeah, might be
15:07RSpliet: That 16-bit extension is quite new it seems
15:08RSpliet: August 2017
15:21l2y: Is it okay that when I try to 'X -configure', X exits with error (drm failed to open device, first section of Troubleshooting), but still generates a config? X starts just fine through startx
15:22l2y: And early kms is enabled as per archwiki, and it's not blacklisted of course
15:29l2y: Never mind, with sudo the error doesn't happen, and the exit code 2 is the same
15:35karolherbst: l2y: or don't use a X config at all if you don't configure it yourself anyway
15:35karolherbst: in most cases you don't need one
15:36l2y: Well, I am going to configure it myself. I want to use two separate X Screens. Nvidia blob attempts were mostly unsuccessful, so I want to see if nouveau can do better, karolherbst
15:37karolherbst: ohh I see
15:37l2y: Nouveau has a Zaphod option for it, currently merging configs from blob and free and going to try it out
15:38l2y: With blob one can generate separate X Screens config using a GUI (nvidia-settings)
15:38karolherbst: okay so is it about dual GPU setups or you really simply just to want to have two seperate things on two displays without being able to move things between both displays?
15:39pendingchaos: mwk: I think doing scheduling in a preprocessor is working out to be better than doing it in envyas
15:39pendingchaos: it feels simpler overall
15:39l2y: I have one GPU and two monitors, and I want these monitors to have different DPI, and since this is only possible by using a Screen per display...
15:41karolherbst: l2y: mhhhhh, right
15:41karolherbst: l2y: I thought under wayland or that maybe gnome were able to handle something like that? Or maybe it is still some WIP work?
15:42karolherbst: just thinking... I thought there is something supporting this, but I don't know what it was for sure
15:44karolherbst: pendingchaos: will you also add support for pushing envydis output with sched opcodes through it? I am thinking about envydis orig.bin | envysched | envyas and then to diff both envydis outputs (original and what envysched ended up inserting)
15:45pendingchaos: that can be done
15:45pendingchaos: perhaps also a --schedule_all (or maybe just -a) argument to envydis?
15:46pendingchaos: so you don't have to manually add .beginsched/.endsched
15:46karolherbst: maybe we could then get tons of nvidia generated shaders and have it as a test to see if we aren't breaking things
15:46karolherbst: pendingchaos: I ignore details until I see patches :p
15:46karolherbst: or at least some mockup about how the input should look like
15:47l2y: karolherbst: I read that nvidia cards' performance is very poor under Wayland due to refusal of EGL support from open-source projects, or whatever, so I haven't even tried it
15:47karolherbst: but yeah, maybe having a parameter to control that might make sense
15:48karolherbst: l2y: yeah, but I think compositors started to support it? Dunno, but at least with nouveau that might work out
15:48karolherbst: never tried it as I never had this situation
15:48l2y: Okay, thanks, will try
16:45RSpliet: In OpenCL 1.2, is the literal 1.0 defined to be of type double? That seems to be how my version of the blob interprets them... but you'd expect the default to be float...
16:48pmoreau: I would expect the rules to be similar to C/C++, which default to double.
16:49RSpliet: It's undocumented in the 1.2 spec, but that's indeed what I'm observing
16:49RSpliet: Which is another cock-up in Rodinia SRAD costing valuable cycles :')
16:51pmoreau: I had that bite me once, with a PI constant defined without the 'f' suffix... :-/
16:52pmoreau: Since then, I always turn on the compiler warnings about double usage in kernels (I don’t remember if it’s an option to nvcc or to ptxas).
16:52RSpliet: I'm unsure, but another kernel that I reduced the code size of by 20-25% just by sprinking f's around
16:53RSpliet: I can't believe how many f's I'm giving!
16:54pmoreau: You’re paying respect too
16:57pmoreau: But wow, 20-25% code reduction is not too shabby!
16:59RSpliet: Well, it's because they're doing divisions with those literals. For some reason NVIDIA imports a whole routine used for lowering that is unneccesary with single precision FP numbers (because really what the parboil bench is doing is a reciprocal)
16:59RSpliet: pardon, Rodinia
16:59RSpliet: Parboil has other issues :-P
17:00pmoreau: From the OpenCL C 2.0 specification: “The OpenCL C programming language (also referred to as OpenCL C) is based on the ISO/IEC 9899:1999 C language specification (a.k.a. C99 specification) with specific extensions and restrictions. Please refer to the ISO/IEC 9899:1999 specification for a detailed description of the language grammar.”
17:01RSpliet: Heh, yeah, well... the dialect of choice is still OpenCL 1.2. NVIDIA doesn't go beyond that I don't think... do they?
17:01pmoreau: (And I think it is the same for earlier versions as well)
17:02pmoreau: It does support nowadays some functions from 2.0, but not that many I think; Phoronix had an article on that about a year ago IIRC, and I think Michael talked a bit about it in his latest OpenCL benchmarking article.
17:03pmoreau: I just checked the OpenCL 1.2 specification (which includes the spec for OpenCL C as well), and it has the same sentence as the 2.0 one)
17:03pmoreau: Even the OpenCL C 1.0 specification had it
17:05RSpliet: That clarifies. Thanks
17:05pmoreau: (Looking at the examples in the spec, >99% of the time they define a float literal, they do put the 'f' at the end.)
17:06RSpliet: That's probably because unlike application developers, the spec writers are competent :-P
17:06RSpliet: Sorry, I'm ranting a bit.... people did put serious effort into this stuff and I should be more grateful O:-)
17:08pmoreau: Ah ah ah :-D Well, you still find mistakes in specs too, like the spec for OpenCL 1.0, 1.1, 1.2, 2.0, 2.2 say something about action "foo" in context "bar", but the OpenCL 2.1 spec says nothing.
17:17RSpliet: This one I'll forgive them: don't open-code a clamp, you'll waste 7 instructions :-P
17:18HdkR: clamp is a scary function to call. I'll just code my own </s>
17:21RSpliet: Reminds me of the time mwk proudly claimed there's an NVIDIA ISA that lets you encode "clamsex porn"
17:21RSpliet: (clamped, sign-extended, predicated or-not?)
17:27mwk: that was clampsex
17:27mwk: it was the motion vector processor for VP2/VP3/VP4
17:28mwk: or not, the instruction is called "clamps" in envydis
17:28mwk: porn is definitely encodable though
17:29mwk: though it's for *writing* to the predicate file, not for predicating an instruction
17:30RSpliet: does the "n" encode "not"? In that case I guess the boolean operation is most often referred to as NOR... but don't want to be the one spoiling the fun :-D
17:31mwk: NOR is ~(a | b)
17:31mwk: ORN is a | ~b
17:32mwk: also known as ORC (or complement) in some ISAs... powerpc IIRC
17:34mwk: porn basically does $pX |= ~(instruction predicate output)
17:57karolherbst: RSpliet, pmoreau: Nvidia actually implements CL 2.0 afaik allthough they don't advertise it
17:57karolherbst: but the compiler supported the CL 2.0 language for quite a long time already
17:58linkmauve: Some user has a gt710 and GNOME is apparently very sluggish, is this expected or could there be some known issue?
17:58linkmauve: On Debian testing.
17:59karolherbst: linkmauve: uhm, depends on how many effects are enabled and the resolution
17:59karolherbst: the gt710 is quite slow
17:59karolherbst: linkmauve: reclocking is worth a shot
17:59karolherbst: as it should work on those
18:00linkmauve: It’s a fully vanilla Debian apparently.
18:00linkmauve: karolherbst, any easy instructions for that?
18:01karolherbst: linkmauve: "echo 0xf | sudo tee /sys/kernel/debug/dri/0/pstate"
18:02karolherbst: but maybe it isn't as stable on his GPU, but then we could look into why not
18:02linkmauve: 0xf is auto reclocking?
18:03karolherbst: no, just the highest one
18:03karolherbst: most/all keplers come with 0x7 (lowest) and 0xf (highest)
18:03karolherbst: 0xa, 0xd and 0xe are also seen sometimes
18:03karolherbst: the file can be read out for all available perf levels
18:04karolherbst: last line being power supply: clocks...
18:04karolherbst: kind of like a current state line
18:07karolherbst: linkmauve: "nouveau.config=NvClkMode=15" can be set for setting it on boot automatically
18:57pendingchaos: karolherbst, imirkin: about Maxwell ISA: do write barriers always signal after read barriers?
18:57pendingchaos: so: https://pastebin.com/raw/LiS8J7rY
18:58karolherbst: imirkin: currently working on implementing GetGraphicsResetStatusARB and now I am thinking about how to implement it on the kernel side. Not sure if new nvif ioctl or just adding another NOUVEAU_GETPARAM_ variant
18:58karolherbst: but the param stuff doesn't really fit as we can't set the reset status
18:59karolherbst: pendingchaos: I think so, yes
18:59karolherbst: pendingchaos: the other way around would be weird and wouldn't matter anyway
18:59karolherbst: reading regs after writing the result?
19:01karolherbst: pendingchaos: ohh, do you plan to implement reuse as well? currently I don't think we actually use it though
19:01pendingchaos: codegen uses it
19:01pendingchaos: I'm not sure if I'll implement it
19:01karolherbst: but not in our hand written stuff, or do we?
19:01karolherbst: yeah... sounds quite messy to implement it
19:01pendingchaos: we don't in gm107.asm
19:02karolherbst: anyway, something which we might want to keep in mind
19:02pendingchaos: IIRC, it didn't seem to give much benefit when I was experimenting with replacing some imuls with xmads
19:02pendingchaos: but maybe that's just a Pascal thing
19:04karolherbst: well, the benefit isn't that big, but I think it reduces the latency a bit of the instruction
19:04karolherbst: or maybe lower stall count needed?
19:06karolherbst: "// Reuse a register from the second blocking registers" inside maxas
19:07karolherbst: ohh that is quite smart
19:07pendingchaos: I don't think it lowers the stall count needed
19:07karolherbst: no, it caches the content of the register
19:07pendingchaos: after looking at some random nvidia-generated code
19:08karolherbst: if you have a write barrier on insn0, you can wait on that instead of having to setup a new read barrier on insn1
19:08karolherbst: in cases you need both
19:09pendingchaos: MaxAs assumes write barriers always signal after read barriers?
19:09karolherbst: like opcode $r0 $r1 $r2 (write barrier) opcode2 $r2 $r1.reuse $r2 (no barriers) opcode3 $r1 $r0 (wait on barrier from opcode)
19:10karolherbst: so normally you would have to wait on the second instruction to signal the read barrier, but maybe with reuse you don't have to as the content is already fetched
19:11karolherbst: maxas also writes something about bank conflicts: https://github.com/NervanaSystems/maxas/wiki/SGEMM#calculating-c-register-banks-and-reuse
19:12pmoreau: Reducing register bank conflicts: that sounds like a topic for RSpliet! :-)
19:14karolherbst: seems like .reuse is just a hint to cache the register
19:14karolherbst: and the GPU sometimes does it automatically
19:15karolherbst: pmoreau: I seriously don't want to know what those maxas guys are up to, so that it is a valid business concern to optimize the builtin math library by 1%
19:18karolherbst: imagine how much hardware you actually have to have to let people work on something with a questionable outcome in the first place, so that even if that fails the risk would be worth it instead of jsut buying more GPUs
19:23karolherbst: interesting, those are actually intel guys
19:28pmoreau: Interesting indeed. Were there getting some inspiration for their future discrete GPU? :-D
21:20RSpliet: .reuse Is probably for "register bypass" logic. Quite common in in-order CPUs... a bit more complex for superscalar
21:22RSpliet: pmoreau, karolherbst: ^ I'm a little surprised it's implemented (but not too...), as it mainly reduces issue latency. Wouldn't help with sufficient warp-parallelism
21:22RSpliet: Although... well, okay, there's register bank throughput to consider, and I guess power consumption too :-P
21:27karolherbst: RSpliet: I assume it might be able to skip some barriers as well
21:37l2y: There is nothing else I can do to prevent tearing except setting GLXVBlank, right?
21:39l2y: Got the Zaphod working, finally. And setting DisplaySize more or less correctly calculates DPI. Tearing is the only thing that remains
21:42l2y: Also, not sure what is going on, but without setting DisplaySize Xorg more or less correctly reported my monitor size in mm (through xdpyinfo), but when I set these numbers in config, it has reported bigger numbers in mm -- incorrect ones. I expected it to adjust the DPI from 96x96 to 101x101, instead it has adjusted the display size in mm to match 96x96 DPI :D What the...
21:46l2y: I must say the above happens only for one display, while the other one gets the more or less correct DPI of 182x182, which is the expected behaviour
21:47l2y: Should I open a bug somewhere, with all the additional info attached?
22:05RSpliet: karolherbst: what kind of barriers are these?
22:05RSpliet: I haven't looked into post-Kepler ISA much unfortunately...
22:05karolherbst: RSpliet: read/write barriers on register read/writes
22:06karolherbst: RSpliet: like you have to create a read barrier if you read from registers, same for writing if the instruction has a variable execution length
22:06RSpliet: All reg writes? Or just ld/st/tex?
22:06karolherbst: and instructions consuming those registers have to wait on those barriers
22:06RSpliet: Ahh ok!
22:06karolherbst: like imul has to do it
22:07RSpliet: They really stripped down HW scheduling to the bone... wow
22:07karolherbst: so if you have imul $r2 $r0 $r1; iadd $r0 $r3 $r4, imul has to set a read barrier on $r0 and iadd has to wait on it
22:08karolherbst: because the iadd could actually execute faster and overwrite $r0 while imul is reading it ;)
22:08karolherbst: I had to fix a carry bit issue caused by something like that
22:08RSpliet: Makes sense. In case there was any doubt, that also completely rules out them doing register renaming :-P
22:08RSpliet: (but I suspect there was no such doubt :-D)
22:09karolherbst: RSpliet: https://cgit.freedesktop.org/mesa/mesa/commit/src/gallium/drivers/nouveau/codegen/lib?id=e4f675dc42887734b43b549784955e81d284b202
22:09karolherbst: "With significant big work groups"
22:09karolherbst: which was quite fun as smaller work grups didn't trigger that issue
22:10karolherbst: "rd 0x1" read barrier 2 enabled
22:10karolherbst: wt 0x2 wait on read barrier 2 (bitmask)
22:10karolherbst: so you can wait on multiple barriers at once ;)
22:11karolherbst: RSpliet: also, you don't want to track down bugs like those :D
22:11RSpliet: karolherbst: heh, robclark must be able to help you with that ;-)
22:12RSpliet: But that takes some serious codegen work...
22:13RSpliet: You *could* wait for a read two/three instructions back... but equally the distance could guarantee you don't need a barrier. And in the end there's only a handful of them, so how do you handle running out? :-D
22:14karolherbst: but we have quite a few of them actually
22:15karolherbst: 8 or something?
22:16karolherbst: RSpliet: or you can just wait on the oldest one and reuse the barrier slot
22:16RSpliet: Luck... with a capital F :-P
22:17karolherbst: anyway, with good scheduling you don't run into those problems anyway
22:17RSpliet: Yeah, I'm sure there's good strategies, but... a lot of decision logic. And good analysis. No need to enable a barrier if there's no WAR hazard
22:17karolherbst: and nvidia seems to use two nops + full stall to flush any barriers
22:18karolherbst: RSpliet: enabling them is for free basically
22:18karolherbst: minimum stall counts just goes up by 1
22:18karolherbst: from 2 to 3 or something
22:18karolherbst: or +1 generally
22:18karolherbst: but again, only the minimum stall count
22:18karolherbst: if you need to stall more anyway, there is no penalty
22:19RSpliet: Near-free in HW, but if your analysis somehow assumes that barrier cannot be re-used. Which is why solid analysis could help reduce the need for barrier flushes :-)
22:19karolherbst: yeah dunno
22:19karolherbst: the alternative is worse anyway :p
22:20karolherbst: you kind of get a 50% perf penalty if you wait on all barriers, max stall count, set all barriers every instruction
22:20RSpliet: For graphics... nobody'll notice, right? :-D
22:20karolherbst: soo even if you need to reuse some barrier slots from time to time ;)
22:21RSpliet: Oh no it
22:21RSpliet: 's not that you need to re-use them... but your algorithm needs to know which one it can re-use :-)
22:22RSpliet: *the algorithm. Sorry, not trying to give you even more work :-D
22:22karolherbst: you can always wait on them and enable it
22:22karolherbst: even if you always choose the worst one
22:22karolherbst: doesn't matter
22:23karolherbst: not setting them is dangerous, or not waiting on them ;)
22:26RSpliet: I *guess* if you have guaranteed in-order issue, then you could also just *shift* the barrier set forward, and the barrier wait backwards
22:26RSpliet: Insn 4 depends on Insn 1, Insn 3 depends on Insn 2. If 3 waits on 2, 4 has implicitly waited on 1...