00:02 karolherbst: RSpliet: well, but we have the TRAP
00:02 karolherbst: I could try to ask nvidia about that though
00:02 karolherbst: but...
00:03 karolherbst: RSpliet: also, I'd guess ShaderLocalMemoryCrsSize just takes from the tls area?
00:05 karolherbst: imirkin: I think we have a few ideas about madsp ;) but I think we are also not entirely sure what's nvdisasm reports is actually correct
00:07 HdkR: TRAP harder
00:10 karolherbst: HdkR: well, I would like to know how to reliably install a Trap handler ;)
00:11 karolherbst: I have some ideas, but.. somehow I wasn't really able to really do much inside the handler
00:11 karolherbst: like pausing execution or something
00:12 karolherbst: but I guess it would be enough to just dump the state into some memory and just read it out later
00:19 imirkin: karolherbst: i just need to get some mmt's to see what's up :)
00:20 karolherbst: imirkin: did you had any success on your kepler2?
00:21 imirkin: i got 3d to work with tiling =]
00:21 imirkin: but i'm still getting faults in some cases, and i want to look harder
00:23 karolherbst: ahh
00:23 karolherbst: and what kind of magic are you using for it?
00:32 HdkR: ooo, tiling
00:37 imirkin: karolherbst: nothing really. it mostly Just Works (tm)
00:37 karolherbst: ohh, interesting
00:38 imirkin: the key bit was
00:38 imirkin: info[7] = mt->layout_3d ? 1 : 0;
00:38 imirkin: but ... 1 is clearly wrong
00:38 imirkin: but i need help in understanding what it should be.
00:38 karolherbst: ohhh, Isee
00:38 karolherbst: yeah, I'll never tried to change that value
00:38 imirkin: i also played some games with getting a bound 2d layer of a 3d texture to work
00:38 imirkin: but that somehow fails
00:38 imirkin: even though the shader is basically the same
00:43 RSpliet: karolherbst: yeah the trap is no coincidence of course. I've not had a chance to re-think everything (lack of time) sadly
00:44 RSpliet: I haven't heard of the "TLS" area. What's that for?
00:45 RSpliet: My comment on allocating memory was generic. I don't know much about the mechanisms behind VRAM backed stack space... but it has to be clever to map multiple WARPS/SMs into one buffer.
00:45 karolherbst: RSpliet: l[] memory
00:46 karolherbst: also used for spilling
00:46 karolherbst: we just pre allocate tons of it based on the amount of maximum threads being able to run and stuff
00:46 karolherbst: so I'll gues we could just allocate +1kB per thread more or so
00:52 RSpliet: I suspect you might want to cook up some experiment where you use PTX (or in-line PTX) to trace the blob running a CUDA/OpenCL program with an increasing number of stack elements (idk, issue 1 to 100 prebrks in powers of two or something), see how the blob handles ShaderLocalMemoryCrsSize... and with a bit of luck it spills some details on where it's allocated too.
00:55 karolherbst: I think I just set that value and see if it changes anything :p
00:55 RSpliet: That is, if our oracle named mwk doesn't already know :-P
00:55 karolherbst: yeah, probably
00:56 karolherbst: but I guess its just part of the tls area and probably it's taking stuff from at the end
00:56 karolherbst: the tls area is magically split up anyway
00:56 karolherbst: per thread
01:10 RSpliet: Ok, but the control stack is per-warp
01:10 RSpliet: I don't rule out even more magic...
01:11 RSpliet: But 32 bits of mask, some bits of "entry type", and a few bits for the per-warp PC. I think envytools says each stack entry is 64-bits, which sounds plausible.
01:12 karolherbst: RSpliet: I think the issue with that shader is simply that too many threads diverged or something
01:12 karolherbst: I mean, it's kind of odd that doing the non taken branch predidcated actually helps
01:13 RSpliet: karolherbst: It shouldn't matter... I didn't see a way in which there are more than I think 5 stack changes.
01:13 RSpliet: pardon, 5 stack entrie
01:13 RSpliet: s
01:13 karolherbst: right, but the one shader works (predicated branch), the other causes the traps (predicated branch)
01:13 karolherbst: uhm
01:13 karolherbst: ...
01:14 karolherbst: the other causing traps has a predicated bra instruction
01:14 karolherbst: I just want to understand why that matters that much
01:14 RSpliet: First thing to rule out is that we aren't limited to 4 stack entries per warp
01:15 karolherbst: I am super sure that we aren't
01:15 karolherbst: because... 4 isn't much
01:16 karolherbst: would have cause a lot of other issues already
01:16 RSpliet: Yep, but 8*4*8*#SMs can quickly become several megabytes.
01:16 RSpliet: (8 bytes per entry, 8 warps/SM in-flight)
01:17 RSpliet: I assume the limit is beyond that indeed, but doesn't hurt being systematic about ruling out issues :-)
01:17 karolherbst: RSpliet: okay, but even then, why does the rather trivial change in the generated binary helps?
01:18 RSpliet: karolherbst: because it removes the need for a 5th entry on the stack
01:18 karolherbst: what's so evil about a $p0 bra
01:18 RSpliet: $p0 bra pushes
01:19 karolherbst: and a $p0 break is essentially for free even if the threads diverge?
01:19 karolherbst: allthough it pops the pre break
01:19 RSpliet: Yep, $p0 break just manipulates the break mask.
01:19 karolherbst: but only for a few threads
01:19 RSpliet: Not quite
01:19 karolherbst: mhhh, but yeah
01:19 karolherbst: maybe 4 entries and that's it
01:20 RSpliet: $p0 break manipulates the break mask (imagine the active thread mask to be the result of ANDing four masks)
01:20 RSpliet: whenever the active thread mask is all-zero, it pops an entry off the stack. The type of the entry determines which of the four masks is updated with the mask on the stack
01:20 karolherbst: but then again, I am wondering why we didn't had issues with it before :/
01:21 karolherbst: or we just got lucky?
01:21 RSpliet: so a $p0 break could lead to a "pop", but that's not necessarily the prebrk mask. Could also be the mask from a $p0 bra nested inside a loop that can be broken out of
01:22 karolherbst: mhh, let me check
01:22 karolherbst: RSpliet: well, we had the $p0 from the previous loop iteration
01:22 karolherbst: "$p0 bra"
01:22 RSpliet: karolherbst: sadly, although I get the mechanism behind branching, I don't know much about the HW details. I'd say a 4 entry stack is incredibly small... but they could vary the hw stack size per-generation
01:23 karolherbst: RSpliet: how does it work it you loop over the same BBs?
01:23 karolherbst: like you have that BB:20 -> BB:15 ... BB:20 -> BB:15 loop
01:23 karolherbst: BB:17 has that "$p0 break BB:16" to break the loop
01:24 imirkin: ok, now i'm officially mad
01:24 karolherbst: BB:18 has that "not $p0 bra BB:20"
01:24 imirkin: IMADSP.U16H0.U16H0.U16H0 R3, R2, c[0x7][0x4bc], R1;
01:24 karolherbst: BB:20 does "bra BB:15"
01:24 imirkin: one might imagine that having high bits set in the const value shouldn't matter
01:24 RSpliet: karolherbst: do you still have the link to gist? I find it easier to reason about the final asm
01:24 imirkin: and yet ... and'ing it with 0xffff first helsp
01:24 karolherbst: RSpliet: https://gist.github.com/karolherbst/389a7cfd703a3419a641324e0dc266f5
01:24 imirkin: so ... wtf!
01:24 karolherbst: imirkin: I am sure nvdisasm is not 100% correct here
01:24 imirkin: that doesn't make me feel better.
01:25 karolherbst: there is something fishy about the sizes stuff
01:25 karolherbst: and the U16 might be actually be a U24 or something
01:25 imirkin: sigh.
01:25 karolherbst: pmoreau and I looked into it in the past and there was _something_ wrong
01:25 RSpliet: karolherbst: sorry, I mean after branch targets were resolved, like its been through envydis ;-)
01:26 karolherbst: imirkin: I have a mesa commit somewhere
01:26 karolherbst: maybe it explains what you see
01:26 karolherbst: (hopefully)
01:26 imirkin: well, i don't mind sticking a bitmask in
01:27 imirkin: what i mind is thinking it does one thing when it does another
01:27 RSpliet: karolherbst: got it. https://gist.githubusercontent.com/karolherbst/d24a9a5ac387cd2bdd221efce407c143/raw/8ce582d34d2dd50dbb8b7773eb380b19842cd055/zsd.envy
01:29 imirkin: Passed: 159/159 (100.0%)
01:29 imirkin: for dEQP-GLES31.functional.image_load_store.3d.*
01:29 karolherbst: imirkin: https://github.com/karolherbst/mesa/commit/3fca678c2e905bdbc8ea9b2d2332912792778414
01:29 imirkin: checking i didn't break anything
01:30 imirkin: and i still have no clue why mt->layout_3d ? 1 : 0 works
01:30 karolherbst: imirkin: what mask did that IMADSP.U16H0.U16H0.U16H0 R3, R2, c[0x7][0x4bc], R1; had?
01:30 imirkin: the MADSP must be doing some bitshift
01:30 imirkin: SUBOP_MADSP(4,2,8)
01:30 imirkin: Passed: 747/747 (100.0%)
01:31 imirkin: for dEQP-GLES31.functional.image_load_store.*
01:31 karolherbst: imirkin: mhh, yeah, that should be okay then
01:32 imirkin: what does it actually do though
01:32 karolherbst: imirkin: I kind of liked the rework I've done in that patch though
01:32 imirkin: is it like (x * y) << 16 + z ?
01:33 karolherbst: well, only one way to fiind out
01:33 RSpliet: Ok, so superficially. Push at 0x148. Push at 0x158. Push at 0x198. 0x1d0 is a potential pop (break out of loop, no push). Push at 0x1e8. 0x1f8 is another potential pop (break out of loop). 0x290 is a push, 0x348 triggers a pop. 0x388 and 0x3d0 are pops. 0x398 could be a pop...
01:34 karolherbst: RSpliet: well, what I've notcied it, the more loops the shader does
01:34 karolherbst: the more likley it traps
01:34 karolherbst: if the loop count is overall small, it never traps
01:35 karolherbst: like around 14*14 inner loops was okayish
01:35 RSpliet: Ok, that's bad. Can you somehow manipulate it to be 1*128 or 1*256?
01:35 karolherbst: it's a 2n^2 loop
01:36 karolherbst: but yeah
01:36 karolherbst: I can play around with it a little tomorrow
01:36 karolherbst: RSpliet: that shader: https://gist.github.com/karolherbst/d24a9a5ac387cd2bdd221efce407c143
01:36 karolherbst: "int steps = int(ceil(_thickness * float(_samples) * c_ss));"
01:36 karolherbst: when I forced steps to be 7 all was okay
01:36 RSpliet: Perhaps it's worth pairing up pops and pushes. Another option is that in the corner cases it tries to pop more than it pushes...
01:36 karolherbst: it step was something like 14 or so, a few inivocations traped
01:36 karolherbst: like 10 or something
01:37 karolherbst: with super high values it was like 25% of all invocations
01:37 karolherbst: super highi meaning >100
01:38 RSpliet: Like, ssy 3d8 followed by the $not p0 bra 0x368 implies that the sync at 0x388 is meant to pop the first ssy 3d8 element off the stack.
01:38 karolherbst: RSpliet: so it's kind of likely the amount of total iterations are kind of relevant here, or maybe it's just a coincidence
01:40 karolherbst: imirkin: IMADSP: "Integer Extract Multiply Add"
01:40 karolherbst: whatever that "extract" means? dunno
01:40 karolherbst: ohh "Extracted Integer Multiply And Add."
01:40 karolherbst: maybe a better wording
01:41 RSpliet: The next instruction is a prebrk 0x3b8... which implies that when everyone leaves the outer loop, the code at 3b8 is executed... ending with a sync. That'll ether pop the BRA 0x368 off the stack, or if it doesn't exist the ssy 3d8. So that seems alright as well.
01:44 karolherbst: imirkin: well, we have to assume that the madsp instruction is faster than doing that stuff manually, right?
01:47 imirkin: i just don't know what it does =]
01:47 imirkin: i think there's a shift involved
01:47 karolherbst: yeah, but we can figure that out
01:48 imirkin: gonna double-check cts and then send it out
01:48 imirkin: will need to be fixed differently on different gens
01:49 RSpliet: karolherbst: There's a different kind of weird going on with that shader
01:49 karolherbst: RSpliet: well yeah, and I want to find out what :p
01:49 RSpliet: You seen, the outer and inner loop both have their own prebrk.
01:50 RSpliet: But when the inner loop is broken out of (0x348), that thread must also break out of the outer loop. It's a "return" in the shader.
01:50 RSpliet: It isn't immediately obvious that the check at 0x390 doesn't throw threads back into active mode in the next iteration of the outer loop.
01:52 RSpliet: Unless that isetp at 0x390 will test precisely which threads didn't meet the condition at 0x288
01:52 karolherbst: imirkin: I don't even the nvidia to generate a IMAD :/
01:52 karolherbst: ahh, now
01:53 karolherbst: okay, now that's a good starting point
01:53 karolherbst: imirkin: any idea what that "extracted" could mean?
01:56 RSpliet: what does that isetp at 0x390 even mean?
01:57 karolherbst: RSpliet: what do you mean?
01:57 RSpliet: karolherbst: 00000390: 0137ff07 5b6a0380 C isetp ne u32 and $p0 0x1 0x0 $r19 0x1
01:58 karolherbst: $p0 = $r19 != 0x0 I think
01:58 RSpliet: it sets the bits in $p0 according to a test that does a non-equal and an AND... how do I interpret those operands
01:58 karolherbst: or 0x1
01:58 karolherbst: not 100% about the source order
01:58 karolherbst: RSpliet: AND with the input predicate
01:58 karolherbst: 0x1 = true predicate apperantly
01:59 karolherbst: so true && $r19 != 0x0
01:59 karolherbst: yeah.. anything else wouldnt make sense
01:59 karolherbst: the first 0x1 is the output true predicate
02:01 RSpliet: Ah ok, and because $r19 is either 0 or ffffffff, that condition at 0x390 does indeed capture all threads that executed the code at 0x298 onwards.
02:02 RSpliet: This is some wild code we're looking at... I'm amazed at the compiler.
02:02 karolherbst: RSpliet: well, the glsl doesn't look all that wild to me
02:03 karolherbst: well, nested loops are always stupid
02:03 karolherbst: especially with 2n^2 complexity
02:04 RSpliet: Yeah, which is why I'm stunned at what the compiler produces. It's more clever than what I'd try by hand
02:04 karolherbst: that's usually the case with compilers ;)
02:04 karolherbst: and still, people always are surprised :p
02:05 RSpliet: Oh but this is not your average control flow mechanism
02:06 imirkin: holy crap. vk-gl-cts takes > 2.5GB to build
02:07 airlied:made the mistake of doing llvm debug symbols, ran out of space all my filesystems :-P
02:07 imirkin: i think this might be a build with debug symbols too
02:07 imirkin: iirc i was debugging something at one point
02:07 RSpliet: Anyway, I've given it a second look... and despite discovering more on the control flow, I still don't spot a code path that could lead to unbound stack growth...
02:07 imirkin: and who knows how to operate cmake...
02:08 imirkin: karolherbst: with the more recent fixes, a CTS submission is likely going to be possibel with kepler
02:08 imirkin: also fixing the issue on maxwell should be straightforward
02:08 imirkin: just need to add the bound "z" value, and treat all 2d accesses as if they were 3d
02:09 karolherbst: airlied: :)
02:10 imirkin:hates cmake
02:10 airlied: cmake is bad, and then llvm uses cmake in even more inventive ways
02:10 imirkin: how do i convince it not to cd to the original dir :(
02:10 karolherbst: imirkin: you mean replace the tex.2d with tex.3d inside the shaders?
02:10 karolherbst: or what do you mean?
02:11 imirkin: karolherbst: yes
02:11 karolherbst: mhhh
02:11 HdkR: I don't hate cmake, but it could be better :P
02:11 imirkin: and feed it a z coord
02:11 karolherbst: imirkin: I thought that causes issues somewhere else
02:11 imirkin: should be fine
02:11 karolherbst: I actually tried it out at some point
02:11 karolherbst: and it wasn't
02:11 karolherbst: or maybe I messed up
02:12 RSpliet: HdkR: the day I discovered ccmake was the day I started liking CMake more than make and autogen scripts
02:12 karolherbst: imirkin: I kind of get the feeling there is some bigger downside of doing a tex.3d inside the sahder... but... well
02:13 karolherbst: maybe it's fine?
02:13 karolherbst: dunno
02:14 HdkR: RSpliet: =O I didn't realize there was a curses GUI. I thought there was only the QT GUI. That's pretty nice
02:14 karolherbst: HdkR: we are currently puzzled about the madsp instruction and what's even more puzzling, that nobody else was ever wondering that :/
02:14 karolherbst: I mean, what it really does
02:14 karolherbst: I don't even get nvidia to emit it
02:15 HdkR: It's just a really mad stack pointer
02:15 karolherbst: yeah.... no
02:15 RSpliet: Further, I think CMake-yellow is a nice enough colour for my bike shed, although I'm sure there's other nice shades out there ;-)
02:15 RSpliet: multiply accumulate single precision?
02:15 karolherbst: RSpliet: extracted imad
02:15 karolherbst: whatever extracted means
02:16 karolherbst: you can do 16x16+16 bit stuff or something
02:16 karolherbst: imirkin thinks there are shifts involved
02:16 karolherbst: maybe there are
02:16 RSpliet: Sooo....SIMD packed?
02:16 karolherbst: dunno
02:16 karolherbst: why would it?
02:16 RSpliet: neural nets
02:16 karolherbst: it's like there since tesla or something
02:17 RSpliet: Ok, no neural nets
02:17 karolherbst: it can also do 24x24+32 stuff
02:17 karolherbst: I think
02:17 HdkR: https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/mad24.html
02:17 karolherbst: not 100% on all combinations
02:17 karolherbst: HdkR: yeah... no
02:17 karolherbst: that's not gonna used for mad24
02:19 HdkR: Are you sure?
02:19 RSpliet: karolherbst: so it doesn't fall in the "Video instructions" section of the PTX ISA documentation?
02:19 karolherbst: HdkR: yes
02:20 karolherbst: 100%
02:20 karolherbst: HdkR: "mad24.lo.u32 %r1, %r2, %r3, %r4;" in PTX -> 2xBFE + IMAD
02:20 karolherbst: RSpliet: it's not even inside PTX
02:20 HdkR: Alright, if you say so
02:20 RSpliet: Not this thing: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#scalar-video-instructions-vmad ?
02:21 karolherbst: certainly not
02:21 karolherbst: that's vmad :p
02:21 karolherbst: not madsp
02:21 karolherbst: we use it with some texture operations
02:21 RSpliet: I don't know how PTX translates to proper assembly of course
02:22 karolherbst: HdkR: the thing is, either this instruction is just not that fast or it only covers some super odd corner case
02:22 HdkR: karolherbst: Wait, it actually uses imad instead of xmad?
02:23 karolherbst: HdkR: SM30
02:23 HdkR: oh
02:23 HdkR: pfft
02:23 karolherbst: ;)
02:23 karolherbst: on SM50 it does 2x BFE + 3x XMAD :p
02:23 HdkR: I mean, I'm not going to answer your speculation statement though :P
02:23 karolherbst: of course you will
02:27 HdkR: lol
02:28 HdkR: karolherbst: Obviously you should just spin a couple million values through it and store the results in to an ssbo to figure out what it does
02:29 karolherbst: of course, but that would be too easy, wouldn't it?
02:29 karolherbst: actually.. I should just do that with CL :D
02:37 karolherbst: HdkR: well, we don't emit that stuff with maxwell yet though :D
02:38 HdkR: Which stuff?
02:38 karolherbst: madsp
02:39 HdkR: You emit it but don't know what it does? :P
02:39 karolherbst: I think that's how it works right now for fermi/kepler
02:40 HdkR: heh
02:47 imirkin: karolherbst: why would it matter? it just does some address calcs
02:47 imirkin: karolherbst: i'm sure it's a tad slower
02:47 imirkin: but the image access/store is even slower than that
02:47 karolherbst: yeah... probably
02:47 imirkin: we could add a TODO to add specializations
02:47 imirkin: and/or binary fixups
02:47 imirkin: i can't be bothered to think about it now
02:48 karolherbst: sooo
02:48 karolherbst: for madsp 0,0,0 the result is 0
02:48 karolherbst: that's at least something
02:48 karolherbst: heh
02:48 imirkin: ;)
02:49 karolherbst: IMADSP.U32.U24.U32 1 1 1
02:49 karolherbst: returns....
02:49 karolherbst: 1
02:49 karolherbst: and..
02:49 karolherbst: IMADSP.U32.U24.U32 2 2 2 returns... 2
02:50 imirkin: Passed: 51/51 (100.0%)
02:50 imirkin: KHR-GL45.shader_image_load_store.*
02:52 imirkin: karolherbst: what about U16L, U16L, U16L with 1 1 1 ?
02:53 imirkin: i think that'll be 0x10001
02:53 karolherbst: let me fiirst get the correc subop value :)
02:53 imirkin: 4,2,8
02:53 karolherbst: yeah.., and what do I have to emit ;)
02:53 karolherbst: there is no emiter code for maxwell/pascal yet
02:54 imirkin: oh yeah, sorry, dunno :)
02:57 imirkin: also there's a funky IMADSP.SD mode which does ... who knows
02:58 karolherbst: "IMADSP.U16H0.U16H0.U16H0"
02:58 imirkin: yes
03:01 karolherbst: something is weird
03:01 karolherbst: returns 1
03:01 imirkin: hmmmmm
03:01 imirkin: can i see the shader?
03:01 karolherbst: let me check what IMAD would return
03:02 karolherbst: yeah... something is funky
03:03 HdkR: funktastic :)
03:04 imirkin: sounds like you're not waiting for the result
03:04 imirkin: and are just taking one of the input args
03:06 karolherbst: ahhh
03:06 karolherbst: no
03:06 karolherbst: output is written into the first source :)
03:06 karolherbst: soooo
03:06 karolherbst: I think madsp does nothing on maxwell
03:06 karolherbst: I remember that somebody came to that conclustion at some point
03:06 imirkin: could be :)
03:06 imirkin: does IMAD work?
03:06 imirkin: er, XMAD
03:10 imirkin: karolherbst: anyways, check out the patch i mailed
03:11 imirkin: i think we'll still have to play the de-tiling game to make things work on fermi
03:11 imirkin: i have a temp patch which makes that possible
03:11 imirkin: but i've stashed it away for now
03:12 karolherbst: ohhh sched opcodes
03:12 karolherbst: meh
03:12 HdkR: lol
03:12 HdkR: oops, scheduling, how sad :P
03:12 imirkin: isn't that what i said?
03:12 karolherbst: ahhh, now it looks better
03:13 imirkin: <imirkin> sounds like you're not waiting for the result
03:13 imirkin: <imirkin> and are just taking one of the input args
03:13 karolherbst: yeah...
03:13 imirkin: :p
03:13 karolherbst: the thought just didn't cross me
03:16 karolherbst: imirkin: 1 1 1 = 2
03:16 imirkin: for IMADSP?
03:16 karolherbst: yeah
03:16 karolherbst: ohhh, now I have osmething weird
03:17 karolherbst: so, nvdisasm prints IMADSP.U16H0.U16H0.U16H0 R2, R2, c[0x0][0xc], R3;
03:17 karolherbst: 0x10001 0x10001 0x10001 = 0x10002
03:17 karolherbst: and....
03:17 karolherbst: I think I know what it does
03:17 karolherbst: :)
03:18 imirkin: my situation, though, i think, is that 1 1 1 -> not 2
03:18 imirkin: coz 2 wouldn't make sense =/
03:18 karolherbst: think about it
03:18 imirkin: U16H0 means "first half"
03:18 karolherbst: it "extracts" the values
03:18 karolherbst: does the reulst
03:18 karolherbst: and "inserts" the vlaue back in
03:18 karolherbst: no idea what happens with the high bits though
03:18 imirkin: oh hm
03:18 imirkin: can you try
03:18 karolherbst: but that's I gonna figure out
03:18 imirkin: 0x10001 * 0x20002 + 0x30003 ?
03:19 imirkin: er actually do
03:19 imirkin: 2 3 5
03:19 imirkin: that way you can tell exactly (they're all prime)
03:19 karolherbst: 0x20002 * 0x30003 + 0x50005 = 0x6000b
03:20 imirkin: right.
03:20 imirkin: one of those U16H0's is a lie.
03:20 karolherbst: yeah
03:20 imirkin: how about 0x20003 * 0x50007 + 0x9000b
03:20 karolherbst: 0xf0020
03:21 imirkin: ok, so that's 0x3 * 0x50007 + 0xb ?
03:21 karolherbst: yeah
03:22 karolherbst: but we already assumed that nvdisasm doesn't know the correct subops :p
03:22 karolherbst: HdkR: let it get fixed
03:22 karolherbst: :p
03:22 HdkR: lol
03:23 karolherbst: HdkR: we can also not report bugs :p
03:25 HdkR: This assumes you're right that nvdisasm is getting the subops incorrect ;)
03:26 imirkin: HdkR: well, in a battle between nvdisasm and the hardware ... i think the hardware wins.
03:26 HdkR: Definitely
03:26 imirkin: there's more of it, and it's harder to change :)
03:27 karolherbst: at some point I knew the correct subops
03:28 karolherbst: imirkin: src1, right?
03:28 imirkin: ?
03:28 karolherbst: src1 was u24 instead of u16
03:28 imirkin: or 32
03:28 karolherbst: so I guess we should emit u16.u24.u16 instead :)
03:29 karolherbst: imirkin: src1 doesn't know u32
03:29 imirkin: ideally i want u16 though
03:29 imirkin: i had to throw a &0xffff in
03:29 imirkin: this doesn't make sense though
03:29 imirkin: i'm doing z * 1 + y
03:29 imirkin: wtf good is that
03:29 imirkin: how can that possibly work
03:30 imirkin: could be that SUCLAMP is mangling the value somehow?
03:30 imirkin: bld.mkOp3(OP_MADSP, TYPE_U32, off, src[2], v, src[1])
03:30 imirkin: ->subOp = NV50_IR_SUBOP_MADSP(4,2,8); // u16l u16l u16l
03:30 imirkin: v = 1 in this case
03:31 karolherbst: mhhh
03:31 karolherbst: now I've got 589856
03:31 karolherbst: uhm
03:31 karolherbst: 0x90020
03:32 imirkin: for what inputs?
03:32 karolherbst: same
03:32 karolherbst: nvdisasm reports IMADSP.U16H0.U24.U32
03:32 karolherbst: which... doesn't make much sense
03:32 karolherbst: but okay
03:32 karolherbst: because that looks like u16.u16.u24/u32 to me
03:32 imirkin: looks like U24 and U16H0 are flipped
03:33 karolherbst: yeah
03:33 karolherbst: I think to remember that was my conclusion back then
03:33 imirkin: let's try it
03:33 HdkR: :)
03:34 karolherbst: 0x20 :)
03:35 karolherbst: IMADSP.U16H0.U24.U16H0
03:35 karolherbst: so yeah...
03:35 karolherbst: 0x4a94018000370202
03:35 imirkin: yep, flipping 2 -> 4 fixes it
03:35 karolherbst: :)
03:36 karolherbst: anyway, should probably sleep now :D
03:36 imirkin: i _still_ don't see why (y + z) * pitch is an interesting quantity
03:37 imirkin: and yet it somehow makes everything work
03:37 imirkin: (y * height + z) * pitch -- that makes more sense.
03:37 imirkin: er, (z * height + y) * pitch
03:37 imirkin: but no - it wants 1.
03:47 karolherbst: I am mainly wondering why nvidia doens't think it's a good idea to emit madsp :/
03:48 karolherbst: imirkin: I think I will clean my madsp patch up and then using the subops is way more convenient as it's now
03:48 karolherbst: (and fixed)
03:48 imirkin: it emits MADSP on kepler for image ops
03:48 karolherbst: main issue: are the current uses correct and just written down wrongly
03:48 karolherbst: or are those wrong and we have to fix those?
03:48 imirkin: i think our current ops were from observing the various patterns there
03:48 karolherbst: yeah
03:49 imirkin: well, perhaps it really wants a 24-bit value
03:49 imirkin: i don't know how it's set on the blob :)
03:49 karolherbst: yeah, dunno
03:49 karolherbst: wasn't able to get ptxas to produce those
03:49 HdkR:looks at mad24
03:49 karolherbst: even with funny casts and masks
03:50 karolherbst: imirkin: with masks you are able to get nvidia to optimize XMAD a bit more
03:50 imirkin: suld.b.3d.v4.b32.zero {r1, r2, r3, r4}, [sf, {r32i0, r32i1, r32i2, r32i3}];
03:50 karolherbst: so... at least I thought that's gonna do it for madsp
03:50 karolherbst: maybe it needs more masks
03:50 karolherbst: dunno
03:51 karolherbst: anyway, gonna sleep now. night
03:56 imirkin: i really don't understand why this code worsk
04:26 imirkin: karolherbst: on top of the patch i mailed to the list: https://github.com/imirkin/mesa/commit/a41268ef34654ce6987b227955b3b8ef95eb4013
04:26 imirkin: i bet that should fix it for maxwell+
04:28 imirkin: if not that, then something similar
04:43 imirkin: doing a full deqp-gles31 run again
04:52 imirkin: skeggsb: ping
04:57 imirkin: skeggsb: want to see if you have any suggestions for debugging why a test (dEQP-GLES31.functional.geometry_shading.query.primitives_generated_instanced) makes an insta-CTXSW_TIMEOUT on my GK208B (NV106)
04:57 imirkin: all the other geometry stuff checks out afaik
05:00 imirkin: this test does what it says on the tin -- looks at the primitives generated query in the presence of an instanced gs, where each instance emits 3 vertices except 1, which emits 6
05:06 imirkin: i'm thinking something that's the equivalent of "gr/gf100: select a stream master to fixup tfb offset queries" here
05:07 imirkin: TFB_UNFUCKUP_INSTANCED_PRIMITIVES somewhere :)
05:07 imirkin: er, more like query i guess
14:27 imirkin: for dEQP-GLES31, with 2 tests removed (one crashes mesa, one kills ctxsw):
14:27 imirkin: Test run totals:
14:27 imirkin: Passed: 36317/37755 (96.2%)
14:27 imirkin: Failed: 0/37755 (0.0%)
15:58 imirkin: 4 fail, 1 warn on cts (and need to recheck the copy_image stuff, which is currently crashing)
15:59 imirkin: KHR-GL45.pipeline_statistics_query_tests_ARB.functional_compute_shader_invocations,Fail
15:59 imirkin: KHR-GL45.tessellation_shader.single.xfb_captures_data_from_correct_stage,Fail
15:59 imirkin: KHR-GL45.direct_state_access.framebuffers_check_status,CompatibilityWarning
15:59 imirkin: KHR-GL45.conditional_render_inverted.functional,Fail
15:59 imirkin: KHR-GL45.limits.max_fragment_input_components,Fail
15:59 imirkin: the xfb one is the only one i'm surprised by
19:43 cosurgi: ouch, one of my xservers died :/
19:43 cosurgi: [1105519.025] nouveau_exa_upload_to_screen:380 - falling back to memcpy ignores tiling
19:44 cosurgi: [1105519.036] (EE) Backtrace:
19:44 cosurgi: [1105519.042] (EE) 0: /usr/lib/xorg/Xorg (xorg_backtrace+0x4a) [0x55604be3666a]
19:44 cosurgi: imirkin: I have a copy of Xorg.1.log from this crash, do you want to see it?
19:45 Lyude: skeggsb: could you review https://patchwork.freedesktop.org/patch/282577/ btw?
19:56 imirkin_: cosurgi: won't hurt
19:57 imirkin_: cosurgi: do you have dmesg as well?
20:01 cosurgi: imirkin_: sure
20:01 cosurgi: imirkin_: where to put it, pastebin?
20:01 imirkin_: ya
20:04 cosurgi: ouch, pastebin gives error "maximum paste size of 512 kilobytes", but this worked: https://paste.ubuntu.com/p/dFqnFmhnTk/
20:05 cosurgi: imirkin_: dmesg is here https://paste.ubuntu.com/p/K5Ws7b2JGD/ , Xorg.1.log is here: https://paste.ubuntu.com/p/dFqnFmhnTk/
20:06 imirkin_: huh.
20:06 imirkin_: no errors in dmesg.
20:06 imirkin_: wtf!
20:06 cosurgi: only this: [1105626.099233] nouveau 0000:04:00.0: gr: intr 00000040
20:06 cosurgi: [1105707.607515] nouveau 0000:04:00.0: gr: intr 00000040
20:06 imirkin_: -2 = ENOENT
20:06 cosurgi: but this does not count?
20:07 cosurgi: lol, about an hour earlier I was fixing segfault in git-cal.bin ( https://gitlab.com/cosurgi/zsh-git-cal-status-cpp ) and it spammed dmesg! :)
20:07 imirkin_: https://cgit.freedesktop.org/nouveau/xf86-video-nouveau/tree/src/nv_accel_common.c#n116
20:07 imirkin_: i think it's this
20:07 imirkin_: which means that it ran out of vram? that seems VERY surprising.
20:07 cosurgi: ok. Do you want to me recompile xserver with some changes in this line?
20:08 cosurgi: vram is the memory in video card?
20:08 imirkin_: yes, video ram
20:08 cosurgi: afaik O have 6GB..
20:08 cosurgi: afaik I have 6GB..
20:08 imirkin_: which is about 6GB more than you need...
20:08 imirkin_: although i guess 3x 4K screens do take up some room
20:09 cosurgi: current;y I have 4 ot 5 xservers running simulateneously. But of course only one at a time is active.
20:09 cosurgi: *4 or 5
20:09 imirkin_: but not an appreciable fraction
20:10 imirkin_: each X server will have about 200MB's worth of data for the screen pixmaps alone
20:10 cosurgi: I was viewing a .pdf file in okular. Generally I had some problems with okular taking up all xclient resources. That's why I prefer to use katarkt tha okular. But I was using okular, and scrolling up a large .pdf file when it crashed.
20:10 imirkin_: (coz you have rotation, it all gets doubled)
20:11 cosurgi: btw, I still do not have mesa compiled. So we are on ,,the safe'' side of OpenGL things, AFAIK.
20:12 imirkin_: the *only* thing i can think of is that something using Present got a bit trigger-happy
20:12 cosurgi: what is Present?
20:12 imirkin_: i'd have to go in and reiew the reasons why nouveau_bo_new would fail with ENOENT though
20:13 imirkin_: x11 protocol for drawing things
20:13 cosurgi: ok.
20:13 cosurgi: If you want to put in there more debug messages, especialy since there's nothing in dmesg. I am happy to recompile xserver ;)
20:13 imirkin_: and i fixed something related to present + rotation
20:13 imirkin_: hold on
20:14 imirkin_: cosurgi: can you try updating to xf86-video-nouveau 1.0.16?
20:14 imirkin_: (recently released)
20:16 cosurgi: it is not yet in debian. Can you give me a link to something similar to xserver-xorg-video-nouveau_1.0.15.orig.tar.gz ?
20:16 cosurgi: I will recompile from it
20:17 cosurgi: oh. wait.
20:17 imirkin_: https://lists.freedesktop.org/archives/nouveau/2019-January/032053.html
20:17 cosurgi: it is in debian sid.
20:17 karolherbst: imirkin_: btw, will probably run a CTS run over the week with all the fixes... but we still need to figure out robust context stuff
20:17 karolherbst: like channel recovery _or_ reporting the dead context :)
20:17 imirkin_: karolherbst: you saw the failures i got, right? that + copy image which crashed.
20:17 karolherbst: don't think so
20:18 cosurgi: imirkin_: I found this in debian sid. Will this one be good? https://packages.debian.org/sid/xserver-xorg-video-nouveau ?
20:18 imirkin_: karolherbst: check logs
20:18 karolherbst: after I got home
20:18 imirkin_: cosurgi: should be fine
20:18 cosurgi: imirkin_: ok. I'm on it!
20:20 imirkin_: cosurgi: the thing i'm specifically lookign for is a fix for present + rotated crtc's
20:20 cosurgi: ok.
20:20 imirkin_: it came out for me as DRI3 rarely updating. you don't have DRI3 enabled though, but other things could also use present
20:20 cosurgi: I can apply any patch (with extra debug info, or whatever) if you want.
20:20 imirkin_: easier to just go to 1.0.16
20:21 cosurgi: yeah, I thought about patch 1.0.16 :)
20:21 imirkin_: although improving that ErrorF wouldn't hurt
20:21 cosurgi: in 1.0.16
20:21 cosurgi: :-)
20:21 imirkin_: just the return code is a wee bit spartan
20:32 cosurgi: whoa it did compile pretty fast.
20:33 cosurgi: OK, I will restart this one xserver now. It's the one where I work most these days. So it's the first one to crash. I prefer to not restart the others.
20:43 cosurgi: [1109162.349] (II) Module nouveau: vendor="X.Org Foundation"
20:43 cosurgi: [1109162.349] compiled for 1.19.2, module version = 1.0.16
20:43 cosurgi: imirkin_: is that correct?
20:43 imirkin_: seems right
20:43 cosurgi: good. I did: grep -E "1\.0\.16" Xorg.* -l
20:43 imirkin_: hopefully not an instra-crash :)
20:43 cosurgi: to find the only Xorg.*log with new version. :)
20:43 cosurgi: insta-crash? I don't think so.
20:44 imirkin_: good
20:44 imirkin_: always a concern with a new version
20:44 imirkin_: esp one that changes a lot of the drmmode_display.c logic
20:44 cosurgi: I worked pretty heavily since we talked last week. xserver was rather fairy loaded. It seemed stable.
20:44 cosurgi: ah
20:44 cosurgi: due to new version?
20:44 cosurgi: well, then we will see soon ;)
20:44 imirkin_: due to wanting to support DP-MST hotplug
20:44 imirkin_: new connectors appearing and disappearing
20:45 imirkin_: not relevant to you, but a bunch of code changes nonetheless
20:47 cosurgi: okay. I will report problems. If in the meantime you will want to improve ErrorF wee bit spartan return code or whatever, just let me know ;)
20:48 imirkin_: yeah. can't realy write a patch now
20:48 imirkin_: but feel free to do something locally
20:48 imirkin_: basically would be interesting to know the circumstances leading up to the alloc failure
20:48 cosurgi: the thing that I most happy about is that this crash did not freeze my whole computer.
20:48 imirkin_: =]
20:49 imirkin_: the system works!
20:49 cosurgi: yeah!
20:51 cosurgi: not sure if I manage something locally ;) I would prefer a patch from you ;p We are doing a release before debian freezez buster to stable.
20:51 imirkin_: k
20:52 cosurgi: I need to fix some bugs in https://gitlab.com/yade-dev/trunk/network/master before that freeze ;)
20:52 imirkin_: good luck
20:52 cosurgi: :)
21:18 cosurgi: imirkin_: oh. I remebered somthing: I was using compton at the time when it crashed. I needed transparency to see more windows. Normally I don't use compton.
21:18 cosurgi: but compton would consume much more memory, because all invisible windows become pixmas too, right?
21:21 imirkin_: cosurgi: aaah, compton almost definitely would use present
21:22 imirkin_: present enables some stuff like flipping and so on
21:22 imirkin_: which won't work with rotated crtc's
21:22 imirkin_: the specific way in which this was broken would lead to various weirdness
21:22 imirkin_: i think 1.0.16 should help with that
21:24 cosurgi: ok. Usually comption is annoying. But sometimes I (using sawfish) set a window to be on upper layer (like 'always on top', except sawfish has multiple layers available), and then I need transparency to see if a window pops-up below that window.
21:24 imirkin_: i played around with that a long time ago
21:24 imirkin_: with gaim being omnipresent and 50% transparent
21:25 imirkin_: but aim is gone now
21:25 imirkin_: (and it was never a good enough idea)
21:26 cosurgi: yeah. I see what you mean. Good thing is that with some keyboard shortcuts in sawfish it can become useful: Ctrl-Shift-Pause 'toggle-copton' (kills it or starts it), Meta-mouse-scroller 'change transparency a little', Ctrl-Alt-i 'toggle inverse colors'
21:27 imirkin_: i never went that far
21:27 cosurgi: I was sawfish maintainer about 10 years ago. Sme of this stuff stayed with me ;) Then I switched more into physics stuff.
21:31 AndrewR: nice error after I tried to use google maps a bit too much: https://pastebin.com/UQgv8G3C
21:32 imirkin_: the DMA_PUSHER stuff is the nv50-era flakiness =/
21:32 imirkin_: i really want to spend some time fixing it up
21:32 imirkin_: just need to get a good chunk of it + motivation
21:33 imirkin_: i think it's because we don't freeze the pfifo when processing sw interrupts
21:33 imirkin_: (that's not the reason for the WARN though - i think that's something else)
21:34 imirkin_: multiple instances of buffer 88 on validation list -- that usually points to the application doing GL multithreading
21:35 AndrewR: imirkin_, I think I can test patches, if/when yo will write some (even for debugging). But in general it works well - I use seamonkey + opengl acceleration inside it daily, and it only crash occasionaly ...
21:35 imirkin_: right
21:35 imirkin_: well i have a G84 plugged in now
21:35 imirkin_: just need to actually use it :)
21:35 imirkin_: GK208 is primary for now
21:36 AndrewR: imirkin_, it works via PRIME?
21:36 imirkin_: what does?
21:36 imirkin_: the G84? sure
21:36 imirkin_: but it won't hit those errors
21:37 imirkin_: the errors come from sw channel commands triggering interrupts
21:37 imirkin_: which would only happen if i display things off the chip directly and attempt to e.g. use vsync
21:37 imirkin_: i have a monitor plugged into it too, just not using it at all
21:38 AndrewR: imirkin_, I tested more up-to-date mesa on nv43 recently, it was working as before for me - simple GL apps via wine works, dx ones (3dmark) crash. But thanks for spending your time on those too ....
21:39 imirkin_: AndrewR: i improved something on there...
21:39 imirkin_: i forget what
21:39 imirkin_: oh yeah - 3d textures got fubar'd in some recent changes
21:39 imirkin_: so i fixed tha tup
21:39 imirkin_: and i also fixed some texture layout problems on nv3x, which wouldn't affect you
21:40 imirkin_: no substantial improvements
21:40 imirkin_: oh, and some cross-context management issues that steam and others were hitting
21:41 imirkin_: now the steam client loads
21:41 imirkin_: and xonotic mostly runs
21:41 imirkin_: although i got corruption with some frequency
21:41 imirkin_: which didn't reproduce on each retrace
21:42 AndrewR: also, sorry to hear your ppc machine died .... (as far as I remember from emails).
21:44 AndrewR: imirkin_, because you nearly solo long-standing programmer for nouveau now - I don't think overloading you is good idea ....
21:45 imirkin_: heh
21:45 imirkin_: well, ben and karol probably spend more time on nouveau than i do
21:45 imirkin_: anyways, yes, the ppc death was sad
21:45 imirkin_: i was going to some format-related stuff in nouveau kms for big-endian
21:46 imirkin_: but ... that's unlikely now
21:46 AndrewR: imirkin_, there was some effort for passing video cards into qemu-system-ppc, but videocards are tricky in this aspect ....
21:47 imirkin_: just a bit.
21:48 AndrewR: https://mail.coreboot.org/pipermail/openbios/2018-May/thread.html
21:50 imirkin_: oh, well they're trying to get the OF blob to work
21:50 imirkin_: that's even crazier
21:50 AndrewR: :}
21:51 imirkin_: anyways, i still have a NV34 plugged in as well, maybe i'll go back to it at some point
21:53 AndrewR: imirkin_, offtopic, but few days ago I found old article describing ATI card + ISA tv tuner setup ..image was moving via separate bus in this case.. i wondered if anyone saved any documentation on how it was programmed ...
21:54 imirkin_: dunno... my first tuner card was a radeon 7000 vivo
21:54 imirkin_: that was already AGP
21:56 AndrewR: imirkin_, well, and it was only able to record small image (384x288) due to agp traffic from card to sysmem and back? Or it was smarter, at least under windows?
21:56 imirkin_: well, it was an NTSC image, so, not exactly high-quality
21:56 imirkin_: YUV 4:2:0 i'm sure
21:57 airlied: if it was ISA there is no smarter
21:58 AndrewR: airlied, hi! no I was taled in my last sentence about AGP card imirking described. Moment will link to isa article
21:59 AndrewR: https://www.ixbt.com/video2/ati-isa.shtml - in russian but with images :}
21:59 AndrewR: i was thinking about on-card path from tuner to 2d core / memory controller, so it was like hw version of tee.
22:02 AndrewR: it seems video-in's basically disappeared from videocards (at least consumer ones), while pci-e shoould be more symmetrical than agp was ....
22:17 AndrewR: I think I'll go to bed..have good time, and thanks again.
22:18 HdkR: Video cards found their niche, they don't need to mix with video capture or NICs anymore
22:18 HdkR: :)
22:18 HdkR: https://www.engadget.com/2010/12/01/visiontek-killer-hd-5770-combo-nic-gpu-hikes-frame-rates-lowe/
22:18 HdkR: Such a dumb thing
22:20 airlied: HdkR: lols, needs more gold plating
22:20 airlied:likes the ones with SSDs on them
22:21 airlied: https://www.amd.com/en/products/professional-graphics/radeon-pro-ssg ftw
22:21 HdkR: Nothing like using a huge SSD as a cache
22:23 karolherbst: HdkR: too slow
22:23 HdkR: karolherbst: What's too slow in this instance?
22:27 HdkR: karolherbst: Talking with DL people, one of the slowest bits is loading data in to system ram and/or vram. If they can just have it dumped on to an ssd cache then I could see that lessening their burden
22:28 HdkR: Would probably help more if they weren't using python but I can't control that
22:30 karolherbst: HdkR: I don't think that those ssg GPUs have an actual fast path to the SSD
22:31 karolherbst: probably just DMA?
22:31 HdkR: Dunno. I've never played with them :P
22:33 karolherbst: HdkR: mhh AMD claims lower latencies
22:35 HdkR: karolherbst: Yea, that was the point which is why I assumed it had some sort of direct access
22:36 karolherbst: I guess you get one if you are on the same PCIe bus
22:36 karolherbst: or well, more direct
22:37 karolherbst: pmoreau: also, you wrote that patch: https://github.com/karolherbst/SPIRV-LLVM-Translator/commit/9e0b0bbee5edb6e1629e10c335fffeb42122fa74
22:37 karolherbst: I have no idea what is the best for sys vals
22:37 karolherbst: but constant is not good enough
22:37 karolherbst: uhm...
22:37 karolherbst: wrong channel
23:06 imirkin_: AndrewR: video in would have to be hdmi now, which is a lot more work than ntsc
23:07 imirkin_: and they used to come with tv tuners too, but an ASTC/DVB decoder is way too much to stick onto an already filled up gpu board
23:13 HdkR: imirkin_: Ingesting 8k HDR at 120Hz should really be on a dedicated card these days :P
23:16 imirkin_: HdkR: even if it's 4:2:0? :)
23:16 HdkR: Who needs 420 when you have DSC? :P
23:19 imirkin_: is that a thing on HDMI?
23:19 imirkin_: i thought it was for DP
23:20 HdkR: It's new in HDMI 2.1
23:20 imirkin_: oh ok
23:20 imirkin_: they're converging
23:20 imirkin_: DP-over-HDMI-over-DP-over-USBC-over-HDMI :)
23:22 HdkR: Although supposedly the next generation of DP was pushed off when HDMI 2.1 was announced in order to completely smash it in BW
23:22 HdkR: so eeh?
23:34 karolherbst: RSpliet: btw, I am back home and can do whatever experiment
23:41 imirkin_: karolherbst: if you feel like testing that image thing, that'd be nice
23:41 karolherbst: imirkin_: right.. currently wanted to look into that off chip stack thing
23:41 karolherbst: will check the image stuff tomorrow thten
23:41 imirkin_: sure
23:42 karolherbst: heh
23:42 karolherbst: it works
23:42 imirkin_: ?
23:42 karolherbst: like literally
23:42 karolherbst: just works
23:42 karolherbst: I set the off chip cache to 0x800
23:42 karolherbst: that frame causing the traps? not causing traps anymore
23:42 karolherbst: or well.. call
23:42 imirkin_: cool
23:42 HdkR: Such a tiny cache
23:42 karolherbst: HdkR: per warp
23:42 imirkin_: HdkR: i think that's per-thread
23:42 imirkin_: er eyah
23:43 karolherbst: The SPH field ShaderLocalMemoryCrsSize sets the additional (off chip) call/return stack size (CRS_SZ). Units are in Bytes/Warp. Minimum value 0, maximum 1 megabyte. Must be multiples of 512 bytes.
23:43 karolherbst: still getting traps
23:43 karolherbst: but...
23:43 HdkR: makes sense
23:43 karolherbst: I think there is no downside to have it
23:43 karolherbst: just more tls space required
23:43 imirkin_: i think the on-chip doesn't get used
23:43 karolherbst: mhhh
23:43 imirkin_: dunno
23:43 imirkin_: should check
23:43 karolherbst: I had worse rendering output with only 0x200
23:43 karolherbst: sadly.. frameretrace doesn't reload libGL :/
23:44 imirkin_: :)
23:44 imirkin_: esp not mid-frame
23:46 karolherbst: yeah... I am sure it will allocate at the end of l[] for the stack
23:46 karolherbst: otherwise it would be weird
23:46 karolherbst: mhhhhhh
23:46 karolherbst: imirkin_: what happens if we go out of bounds in l[]?
23:46 karolherbst: and is it considered out of bounds if we have our off chip stack there?
23:47 karolherbst: would we be able to dump whatever gets written there?
23:47 imirkin_: pretty sure we declare the size of l[]
23:47 karolherbst: sure
23:47 karolherbst: but without the off chip stack
23:47 imirkin_: there's also some weird positive and negative size to it
23:47 imirkin_: no clue what that's about
23:47 karolherbst: mhhh
23:53 karolherbst: at least if I go below 0x200 the shadedr just traps :)
23:56 karolherbst: and the on chip stack seems to be bigger than 0x200, but smaller than 0x400
23:56 imirkin_: has to be aligned to 0x200 iirc
23:57 karolherbst: imirkin_: and you are right, the on chip stack isn't used when I use the off chip one... otherwise why would every thread traps with 0x200
23:58 karolherbst: mhh
23:58 karolherbst: 0x200 doesn't really work at all