00:47pmoreau: Is there an easy way to run a specified GPU binary on Nouveau? For example, I’d like to test which combinations of an instruction are allowed by the hardware, by having some input data, retrieve some output data, and submit some handwritten GPU binary.
00:47pmoreau: Maybe hakzsam ? -^
00:50imirkin_: not easily
00:50imirkin_: which is unfortunate - i've wanted that in the past
00:56pmoreau: What would be the current way to do it? Change the emit function of the function we want to try and hardcode the emission result for it?
00:58pmoreau: I just realised, for operations that allow mem files as argument (but they have to be 4-bytes aligned), if they support it, one could use .B1/B2/B3 to access the other bytes, rather than completely preventing the load folding.
00:59pmoreau: No clue whether it makes sense performance-wise.
01:38pmoreau: Interesting: I had a kernel with the following signature, “char a, char b, global char* res” and I was getting (among others) “gr: M2MF 80000002 [PUSH_NOT_ENOUGH_DATA]”.
01:41pmoreau: I was assuming to have the following representation in memory: [0x0] = a, [0x1] = b, [0x8] = res (both on the CPU-side, and from my GPU kernel. But apparently the hardware was assuming [0x0] = a, [0x4] = b, [0x8] = res, as changing to that representation got rid of the error message.
02:12pmoreau: Nevermind, that wasn’t the issue: if I had “char a, char b, global char* res” then everything would be fine as implicit padding would take place since “res” is 8-bytes aligned, so the input array would have a size of 16 bytes, so 4 32-bit words (we divide by 4 the size when uploading the user input).
02:14pmoreau: Now, if I had the opposite “global char* res, char a, char b”, I end up with a structure which is no longer padded and has a size of 10 bytes, but due to the integer division by 4, we ended up saying it was only 2 32-bit words, when it should have been 3.
02:59bazzy: I noticed my kernel uses SWIOTBL for iommu, 64MB buffer. I was thinking of increasing this buffer for fun to 128MB, but I am aware that the driver needs to willfully use the increased space. Nvidia apparently has (had?) this setting http://us.download.nvidia.com/XFree86/Linux-x86_64/173.14.12/README/chapter-10.html Does nouveau have similar settings or what is its behavior regarding swiotlb?
03:00bazzy: related dmesg output:
03:00bazzy: [ 0.560017] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
03:00bazzy: [ 0.560023] software IO TLB [mem 0xa85c8000-0xac5c8000] (64MB) mapped at [ffff8800a85c8000-ffff8800ac5c7fff]
03:04bazzy: in other words, I don't want to allocate a bigger TLB and have nouveau not even use it