AMD RX 5700 XT will not validate

Message boards : Number crunching : AMD RX 5700 XT will not validate
Message board moderation

To post messages, you must log in.

AuthorMessage
Jord
Volunteer moderator
Help desk expert
Avatar

Send message
Joined: 24 Jun 20
Posts: 85
Credit: 207,156
RAC: 0
Message 82 - Posted: 28 Jun 2020, 4:17:33 UTC

All tasks on my AMD RX 5700 XT so far have not validated, not even one paired against an AMD RX 580. But seeing how I am so far the only RX 5700 XT around, it's difficult to put a finger on it.
I did say in the News thread that seeing everyone else's tasks use CPU plus GPU, whereas for me it runs only on GPU but produces no output about seeds or anything in stderr.txt.

Running BOINC 7.16.7
Windows 10 x64
Radeon drivers 20.5.1 -> was 20.3.1 but with those the same thing.
ID: 82 · Report as offensive
Profile Hy
Project administrator
Project developer
Avatar

Send message
Joined: 15 Jun 20
Posts: 74
Credit: 19,537,761
RAC: 0
Message 89 - Posted: 28 Jun 2020, 13:20:08 UTC - in response to Message 82.  

Which AMD drivers are you on? The issue with this problem is it's extremely unreliable to 'cause'

I have seen other RX 5700XT machines produce tasks with normal output (as coded into the binary) and
produce output for tasks with nothing just like you have. That machine had the same kaktwoos-cl version, the same task length for both the good and bad outputs.

AMD has officially confirmed that there was a printf bug in their openCL Drivers that may have been patched in 20.5.1. We print to stderr and that same code works flawlessly on some AMD GPUs and all Nvidia machines.

https://stackoverflow.com/questions/62545440/rootcausing-or-working-around-possible-amd-compiler-bug
https://community.amd.com/thread/244452

In the meantime, I had someone's RX 5700XT machine that had unreliable (successes and then empty-outputs in a row during a few hour period) behavior run a modified binary and has reported 8 tasks successfully that can be validated in a row. This is in comparison to 2 tasks invalid, or two tasks validated in a row.

I have not seen any RX 580s or RX Vegas (myself) have these failures with 1.12 yet, but I will keep going over the logs
ID: 89 · Report as offensive
Jord
Volunteer moderator
Help desk expert
Avatar

Send message
Joined: 24 Jun 20
Posts: 85
Credit: 207,156
RAC: 0
Message 90 - Posted: 28 Jun 2020, 14:52:17 UTC - in response to Message 89.  

Hy wrote:
Which AMD drivers are you on?

Jord wrote:
Radeon drivers 20.5.1 -> was 20.3.1 but with those the same thing.
ID: 90 · Report as offensive
Jord
Volunteer moderator
Help desk expert
Avatar

Send message
Joined: 24 Jun 20
Posts: 85
Credit: 207,156
RAC: 0
Message 93 - Posted: 28 Jun 2020, 15:27:48 UTC

Apropos, something I touched on in the News thread but is probably snowed under:
One thing I see though, the kaktwoos application runs at Low priority in Windows Task Manager, even in the wrapper. The wrapper runs at Normal priority. Seeing how we run on the GPU, you should try to set the priority to below_normal (case 2, Low priority at the Wrapper App).

You now have it set to lowest, case 1. This may give problems for GPUs as anything else will take cycles away. On some systems this may even be recognized as idle and the computer will go to sleep while the GPU is doing work, resulting in crashed apps.
Commands: https://boinc.berkeley.edu/trac/wiki/WrapperApp

(And I asked why run the OpenCL app in a wrapper? Is it in such an obscure language that it can't run natively?)
ID: 93 · Report as offensive
Profile chip
Project administrator

Send message
Joined: 14 Jun 20
Posts: 78
Credit: 1,321,619
RAC: 0
Message 98 - Posted: 29 Jun 2020, 0:20:34 UTC - in response to Message 93.  

Apropos, something I touched on in the News thread but is probably snowed under:
One thing I see though, the kaktwoos application runs at Low priority in Windows Task Manager, even in the wrapper. The wrapper runs at Normal priority. Seeing how we run on the GPU, you should try to set the priority to below_normal (case 2, Low priority at the Wrapper App).

You now have it set to lowest, case 1. This may give problems for GPUs as anything else will take cycles away. On some systems this may even be recognized as idle and the computer will go to sleep while the GPU is doing work, resulting in crashed apps.
Commands: https://boinc.berkeley.edu/trac/wiki/WrapperApp

(And I asked why run the OpenCL app in a wrapper? Is it in such an obscure language that it can't run natively?)


Use of the wrapper was initially because none of our developers had BOINC experience, so it reduced the learning curve by making it less of a challenge to get started.

Now of course, we realise the wrapper is more hassle than its worth for these apps and seriously limits us. So, a coming change will likely remove it altogether in favour of implementing the BOINC API into the app.
ID: 98 · Report as offensive
Profile Hy
Project administrator
Project developer
Avatar

Send message
Joined: 15 Jun 20
Posts: 74
Credit: 19,537,761
RAC: 0
Message 100 - Posted: 29 Jun 2020, 4:56:58 UTC
Last modified: 29 Jun 2020, 4:59:20 UTC

Some really good news:

I managed to do a bunch of fixes to our kaktwoos-cl code to make it C++ compliant, and finally compiled it with boinc_api headers included and the calls required to interface with BOINC.

I tested some functionality of it, and it appears to behave like a BOINC application and still works in standalone mode. I already added all the code required to interface with the checkpointing system,
and also the progress indicator will be updated on BOINC using the old internal percentage of seeds worked through. All program output goes to the right place now without question, and so that may finally fix
the unusual AMD empty-output bugs.

I have also tested resume / suspend, and it appears to work as intended as well because the checkpoint auto-save + manual call from BOINC is restoring the proper progress % (calculated from # of seeds in range, divide by current seeds progressed through) as well as resuming the GPU properly according to the logs.

More testing will need to be done of course, and we will need to plan the migration away from the wrapper we use to this new kaktwoos-cl-boinc.
Multi-GPU is 95%? finished as well. I've replaced all the old calls with calls to BOINC's opencl device ID function instead.

I'll just need to verify it on a system I know had manual device selection working anyways
ID: 100 · Report as offensive
Jord
Volunteer moderator
Help desk expert
Avatar

Send message
Joined: 24 Jun 20
Posts: 85
Credit: 207,156
RAC: 0
Message 140 - Posted: 1 Jul 2020, 23:37:32 UTC
Last modified: 1 Jul 2020, 23:38:08 UTC

Yay, finally. I have credit as the new OpenCL 2.0 AMD application ran a task and it validated correctly against one done on an Nvidia GTX 960.
CPU time is still just tens of seconds, not as with the Nvidia app where it's almost as long as the time it runs on the GPU. Now I wonder if it even runs on the GPU for Nvidia, and not on the CPU. But I can't check that as I don't have an Nvidia GPU. ;-)

https://minecraftathome.com/minecrafthome/workunit.php?wuid=1262529 output:
<core_client_version>7.16.7</core_client_version>
<![CDATA[
<stderr_txt>
Received work unit: 265366080896324
Data: n1: 119, n2: 615, n3: 103, di: 2, ch: 12
boinc gpu 0 gpuindex: 0 
No checkpoint to load
Speed: 41.08m/s 
Done
Processed 100000000000 seeds in 2434.086000 seconds
Found seeds: 
01:31:45 (4404): called boinc_finish(0)

</stderr_txt>

(And checkpointing works great!)
ID: 140 · Report as offensive

Message boards : Number crunching : AMD RX 5700 XT will not validate