IRC Logs of #radeon on irc.freenode.net for 2023-10-04

01:42 soreau: mareko: http://ix.io/4HI5
01:43 soreau: though it seems instead of row_size, it needs to compute packers in convert_tiling_flags_to_modifier()?
01:50 soreau: or set packers to 0
05:22 mareko: no, packers are only relevant on gfx10.3 and gfx11.x
05:23 soreau: mareko: what else might the patch need?
05:24 mareko: add_gfx6_modifiers needs to list the modifiers supported by each chip, almost every chip is different
05:25 mareko: and the display code should use the modifier instead of tiling flags
05:27 soreau: mareko: where can I find the list of modifiers supported by each chip?
05:27 mareko: and somebody needs to analyze the differences between gfx6, gfx7, and gfx8 and find out if there are any differences
05:28 mareko: soreau: essentially nowhere, they could be extracted from addrlib by writing a test that iterates through all chips and records the tiling modes that addrlib selects
05:30 soreau: mareko: I noticed there wasn't a gfx6/7/8 dir in addrlib..
05:30 soreau: where could I put such a test?
05:30 soreau: and how would I run it
05:46 soreau: <mareko> and the display code should use the modifier instead of tiling flags -> http://ix.io/4HI5
07:25 mareko: it sets the modifier, but I don't see DC using the modifier
16:30 hakzsam: agd5f: running VKCTS with deqp-runner on RDNA3 seems to explode in AMDGPU https://pastebin.com/raw/sH1zV4kJ This is with amd-staging-drm-next (393101d6ad53182c69b1947eeb87b0ccf7b5e9f8) but also reproduced with stock 6.5.5 from archlinux. RDAN2 seems to be fine. Is this a known issue?
16:39 mceier: hakzsam: this looks similar to https://gitlab.freedesktop.org/mesa/mesa/-/issues/9915 mareko (fixes 9915) and leo (fixes 9916) have some patches
16:40 hakzsam: I don't think so
17:05 Venemo: with regards to the DISCARDABLE flag, can the discard happen within the same submission?
17:06 Venemo: in other words, does it happen between IBs from the same submission, or between different submissions?
17:11 agd5f: hakzsam, not seen that
17:21 soreau: mareko: DC using the modifier: http://ix.io/4HI5 and at the bottom of convert_tiling_flags_to_modifier() it already does `afb->base.modifier = modifier;`
17:22 soreau: but so far as listing the modifiers supported by each chip in add_gfx6_modifiers(), not sure how to extract this information from addrlib
17:24 mareko: Venemo: buffer contents can be discarded only when a buffer is completely idle
17:24 mareko: Venemo: it might be discarded between IBs of the same submission
17:25 Venemo: isn't that a contradiction?
17:25 mareko: well
17:25 Venemo: how is a buffer completely idle when it has further IBs to execute?
17:26 mareko: I think the kernel currently treats the whole submission as a single job
17:26 mareko: so it can't happen, but I don't know what the kernel does
17:27 Venemo: hmm okay
17:27 Venemo: the reason I'm asking, is because I wonder if I can afford to mark the buffer used for SDMA workarounds as discardable
17:28 mareko: the side effect is DISCARDABLE is that it guarantees VRAM placement, the intended uses are for the attribute ring, tessellation rings, and scratch
17:29 mareko: it's not that useful if you don't require VRAM placement
17:29 Venemo: well
17:30 Venemo: some image copies require two steps. first copying into a temporary buffer, then to the actual target (this is what they call a "scanline copy"). I think this would be incredibly slow if the temporary buffer is not in VRAM
17:30 mareko: slower, yes, but slow? SDMA isn't that fast
17:31 Venemo: I would prefer to avoid a roundtrip to system ram
17:32 Venemo: however, I'm uncertain if I can always ensure that all of the smaller copies will fit into the same IB, hence the above question
17:32 bnieuwenhuizen: TBH I'd expect most SDMA use to be system->vram or vram->system, so it wouldn't be the worst of things
17:34 Venemo: I think it would be a pretty bad roundtrip
17:35 mareko: depending on the GPU size, caches reuse, etc. a compute shader is probably 10x faster for VRAM->VRAM copies than SDMA
17:38 Venemo: CS is up to 4x faster in the benchmark results that I've seen
17:38 Venemo: but this is kind of irrelevant to the question
17:40 Venemo: who could answer exactly how discardable works? should I ask Christian?
17:44 mareko: yes
17:44 mareko: well
17:45 mareko: if you have preemption, an IB can be suspended in the middle and all memory can be evicted/discarded
17:45 Venemo: why is that not a problem for the attribute ring?
17:46 mareko: because the same draw writes and reads the data, and preemption waits for idle
17:46 Venemo: I see
17:46 mareko: even if it didn't wait for idle, it wouldn't suspend before the draw is finished
17:47 mareko: because it has to save pipeline stats, etc.
17:47 Venemo: got it
17:47 Venemo: so discardable is only useful when the buffer is used within a single command?
17:48 mareko: you could say that was the intended use
17:48 Venemo: makes sense
17:49 Venemo: thank you
17:49 mareko: the reality is we added it for the attribute ring because it has a massive bandwidth requirement
17:50 Venemo: I remember that, but I didn't take the time back then to fully understand it
17:54 Venemo: mareko: can you clarify what kind of metadata the GFX10+ SDMA supports? according to PAL it supports DCC and DS metadata (which I assume is the same as HTILE)
17:54 Venemo: I am trying to figure out what surface flags to set.
17:55 Venemo: I assume that on GFX9 and older, I need RADEON_SURF_DISABLE_DCC and RADEON_SURF_NO_HTILE
17:55 Venemo: not sure about FMASK
17:57 Venemo: in fact I couldn't find any info on what FMASK actually is
17:57 mareko: I think it can only do DCC and not the others.
17:58 mareko: SDMA doesn't support compression and uses GL2 for loads and stores just like shaders.
17:59 Venemo: pretty sure it at least supports HTILE: https://github.com/GPUOpen-Drivers/pal/blob/dev/src/core/hw/gfxip/sdma/gfx10/gfx10DmaCmdBuffer.cpp#L1408
18:00 mareko: Venemo: if it supports HTILE, it only supports loads just like shaders
18:01 Venemo: the above code at least doesn't indicate that
18:02 mareko: SDMA uses GL2 for compression, and GL2 can't do compressed Z/S stores
18:03 Venemo: what is GL2?
18:03 mareko: graphics L2 cache
18:04 Venemo: I see
18:05 Venemo: PAL doesn't seem to bother to validate that only image reads may have HTILE and writes don't
18:05 Venemo: maybe it disables it elsewhere
18:06 mareko: the only thing that enabled DCC in SDMA was that SDMA was moved from uncached to cached by GL2, which has the codecs
18:06 Venemo: I'm not arguing with you, just trying to understand how PAL handles these things
18:06 mareko: I guess it doesn't use SDMA to copy Z/S
18:07 Venemo: it seems to use it, at least it sets up the copy packets with the DS metadata VA
18:10 Venemo: I wish there was an easy way to test it
18:12 mareko: FMASK is documented next to lower_image_to_fragment_mask_load
18:13 Venemo: does SDMA support it? I guess not?
18:13 mareko: no
18:14 mareko: you could copy it as 16bpp or 32bpp in theory
18:15 mareko: but you would need to set up CMASK somehow as well, which isn't possible AFAIK
18:15 Venemo: I am content to just disable it
18:16 Venemo: I will also disable HTILE for now until I know for sure what the PAL authors wanted with it
18:20 agd5f: mareko, Venemo I just check the SDMA spec: SDMA supports depth/stencil decompression/compression. GL2 doesn’t support depth/stencil compression, but supports writing to compressed depth/stencil surface. In depth/stencil “compression” case, GL2 will read from destination surface, decompress it, merge with new destination data, and write to the destination surface in uncompressed format.
18:21 Venemo: agd5f: thank you. I assume this is only for GFX10 and newer?
18:21 agd5f: Venemo, yes
18:21 Venemo: awesome, thanks
18:23 Venemo: so essentially the result will be uncompressed, but will still work?
18:24 agd5f: yeah, I think so
18:24 Venemo: cool, thank you for looking into it
18:24 agd5f: also, Depth/Stencil decompression only supports SW_64KB_Z_X. Stencil support 8bpp, and Depth support 16bpp/32bpp
18:25 Venemo: can I rely on addrlib selecting the correct swizzle mode?
18:26 Venemo: or do I need to somehow force the surface to use a supported mode?
18:26 agd5f: looks like GFX11 also supports SW_256KB_Z_X
18:27 agd5f: not sure off hand
18:27 Venemo: okay
18:27 Venemo: maybe best to keep HTILE disabled for now, and we can play with it later if everything goes well
18:53 Venemo: agd5f, since you already have the spec in front of you, can you please clarify this comment? https://github.com/GPUOpen-Drivers/pal/blob/dev/src/core/hw/gfxip/sdma/gfx10/gfx10DmaCmdBuffer.cpp#L56 - it seems that this was last updated 4 years ago and they didn't bother to update it for GFX10.3 or GFX11
18:54 Venemo: "Some copies (which?) do require manual SW barriers which we don't do currently." -> would be nice to know
18:56 Venemo: AFAIU they settled for adding the barriers always
19:30 agd5f: Venemo, checking
19:55 agd5f: according to the spec, for gfx8 and older you need to insert a nop between ops which have dependencies. gfx9 and newer should automatically handle dependencies. RAW dependency check is supported for Linear Copy (non-Broadcast), Linear Sub-Window Copy, Tiled Copy (non-Broadcast/non-Frame2Field), Tiled Sub-Window Copy and Tiled to Tiled Sub-Window Copy. All other ops should insert a nop if there are dependencies.
19:55 agd5f: Venemo, ^
19:56 Venemo: the comment from PAL indicates that there is a bug with this RAW dependency check for certain image types
19:56 Venemo: it just doesn't say which
19:57 Venemo: is there any info about that?
19:58 agd5f: not in the spec, but there may be something in the errata. Might need to ask the SDMA hardware team.
19:58 Venemo: I'd appreciate that
19:58 Venemo: but then, we can also just do the same thing as PAL, and simply always insert a NOP
19:58 agd5f: probably won't get a response until next week since it's golden week in China
19:59 Venemo: it's not urgent
19:59 agd5f: ok
19:59 Venemo: PAL has lived with this workaround for 3 years, we should be fine doing the same too
21:04 Venemo: one more question, what is CMASK, and is it relevant to the SDMA at all?
21:30 agd5f: Venemo, metadata for the CB compression. Maybe pre-dates DCC? I don't remember off hand
21:31 Venemo: it seems to go hand in hand with FMASK but I can't find a good explanation for it
21:33 agd5f: yeah, they are related, but I haven't paged them into my head in a while so I can't recall the details
21:33 bnieuwenhuizen: Venemo: CMASK is to FMASK as DCC is to the main color surface, kinda
21:34 bnieuwenhuizen: + CMASK without FMASK is also a possibility on pre-DCC HW (gfx6/7 etc.)
21:34 Venemo: bnieuwenhuizen: sorry but can you elaborate on that?
21:34 bnieuwenhuizen: not exactly sure what is in there, but it can be used for e.g. fast clears
21:34 bnieuwenhuizen: Venemo: just see it as "generic" compression metadata for the fmask
21:35 Venemo: on GFX9 we use cmask without fmask
21:35 bnieuwenhuizen: yeah
21:35 Venemo: PAL seems to only do that on GFX6-8, so that is already confusing
21:36 bnieuwenhuizen: actually on all HW we can use it for fast clears when DCC isn't used
21:36 bnieuwenhuizen: though with MSAA there is still a case for the full gamut of main surface + DCC + FMASK + CMASK (at least until gfx11 which removed fmask)
21:37 Venemo: so I my question is, should my transfer queue interact with the cmask at all? or can I just ignore it?
21:37 bnieuwenhuizen: interact probably, since it has fast clear metadata
21:37 agd5f: right. IIRC, it's a metadata buffer which stores the "clear" value for a target buffer. If the value is set the clear value, the clear value is read from a register rather than going out to memory. so you "clear" the buffer, but nothing happens until you actually use it for something
21:37 Venemo: ah, interesting
21:37 bnieuwenhuizen: well, I think it can do more compression wrt fmask
21:38 bnieuwenhuizen: but yeah
21:38 Venemo: I don't think SDMA can use it, so that sounds like a problem when transitioning to and from the transfer queue
21:38 bnieuwenhuizen: I think we only use CMASK in COLOR_ATTACHMENT_OPTIMAL, because you generally can't sample CMASK as the clear colors are in a CB register and not in the CMASK
21:39 bnieuwenhuizen: so it might be very cheapo to just say that any image layout involving the transfer queue just disables/decompresses CMASK
21:39 bnieuwenhuizen: and that'd probably do just fine
21:40 Venemo: that is what radv_fast_clear_flush_image_inplace does, right?
21:40 bnieuwenhuizen: yeah
21:41 Venemo: sounds like I got that for free then, it is already executed when transitioning to a layout that doesn't support fast clears
21:42 Venemo: same goes for DCC decompression
21:42 Venemo: so looks like all I really need to be able to do is write the metadata on the transfer queue, which seems easy enough
21:43 bnieuwenhuizen: DCC is still enabled even when not supporting fast clears in that layout though
21:44 bnieuwenhuizen: see radv_layout_dcc_compressed
21:44 Venemo: I already added the consideration for the SDMA's capabilities to radv_layout_dcc_compressed
21:44 bnieuwenhuizen: ok
21:45 Venemo: this means that: on GFX10+ there is nothing to do since SDMA supports DCC, and on GFX9 the source queue will do the decompression for me
21:53 Venemo: bnieuwenhuizen: what should I do about the emulated ETC2?
21:53 bnieuwenhuizen: add more copies?
21:54 bnieuwenhuizen: I mean in the end we don't have to interpret the stuff in the compressed blocks no
21:54 bnieuwenhuizen: so we can just copy the "native" surface as is and then add copies for the extra decompressed planes
21:54 Venemo: I'm not sure how it works currently, looks like there is a compute shader that decodes it after each copy, but I don't see how the copy itself could work
21:55 bnieuwenhuizen: oh right, bleh
21:55 bnieuwenhuizen: the copy itself is just a copy of the equivalent rgba32_32_32_32
21:56 bnieuwenhuizen: the decode shader is annyoing I guess ....
21:56 Venemo: then what is there to decode?
21:57 bnieuwenhuizen: actually I'm thinking with format reinterpretation this could get more messy
21:57 bnieuwenhuizen: and with buffer->image you definitely need the decode
21:57 bnieuwenhuizen: with ETC2->ETC2 image copies you could probably skip the decode if you also just copy the extra planes
21:57 Venemo: I don't follow
21:58 Venemo: you just said that it copies an uncompressed rgba32_32_32_32
21:58 Venemo: so I don't even see what is there do decode
21:58 Venemo: or did I misunderstand what you said?
21:58 bnieuwenhuizen: you can see an ETC2 texture as RGBA32_32_32_32 using format reinterpretation
21:58 bnieuwenhuizen: but is still has the ETC2 data of course
21:59 bnieuwenhuizen: so if we want to sample as ETC2 ...
22:00 Venemo: so how does it work at all?
22:00 bnieuwenhuizen: we have plane 0 is the ETC2 data and then we decompress it into something we can actually sample from in plane 1
22:01 bnieuwenhuizen: and then on every change of the ETC2 data we decompress again
22:01 bnieuwenhuizen: it just so happens the only way to modify ETC2 images is by copies
22:02 Venemo: sounds like we are not going to be able to support that from the transfer queue then
22:03 bnieuwenhuizen: yeh, looks messy
22:04 Venemo: it would require a gang submit to execute the CS that can decode the thing. which is doable but too much effort for this initial work that I'm doing now
22:04 Venemo: same issue with sparse resources on GFX6-8