Kaktwoos tasks always stall out and never finish if interrupted

Message boards : Number crunching : Kaktwoos tasks always stall out and never finish if interrupted
Message board moderation

To post messages, you must log in.

AuthorMessage
Keith Myers
Avatar

Send message
Joined: 8 Mar 21
Posts: 53
Credit: 243,832,973
RAC: 0
Message 536 - Posted: 22 Apr 2021, 17:13:39 UTC

Got a confirmation from another user of a problem with the kaktwoos tasks. If for some reason that your PC quits or crashes while running a kaktwoos task, upon restart the task will continue to increase clock time but will never finish.

Pausing or suspending the task and then restarting it does not restore the stalled task to normalcy like this process does with other projects tasks.

The only solution is to abort any previously running task upon restart.

So have the developers been made aware of this issue?
ID: 536 · Report as offensive     Reply Quote
vaughan

Send message
Joined: 30 Jun 20
Posts: 3
Credit: 17,230,083
RAC: 0
Message 537 - Posted: 25 Apr 2021, 3:52:10 UTC

I see the same issue on Win 10 pro.
ID: 537 · Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 8 Mar 21
Posts: 53
Credit: 243,832,973
RAC: 0
Message 538 - Posted: 25 Apr 2021, 8:32:08 UTC - in response to Message 537.  

The developers have implemented a code fix that just needs to be peer reviewed and then merged into the project software.

We should get a new application that properly resumes from a checkpoint in the near future. Crossing fingers.

But I have no exact timeline for you.
ID: 538 · Report as offensive     Reply Quote
Profile Hy
Project administrator
Project developer
Avatar

Send message
Joined: 15 Jun 20
Posts: 74
Credit: 19,537,761
RAC: 0
Message 540 - Posted: 25 Apr 2021, 21:55:11 UTC - in response to Message 538.  
Last modified: 25 Apr 2021, 22:01:48 UTC

Hello, we've pushed the code to our Github and are verifying it. We've had this issue for some time, but now that we're seeing it more often due to more users, we've decided to take some steps to solve it.

This should be pushed out to BOINC for all clients to update to in a day or two, I trust Neil's work and the changes don't seem to be large
ID: 540 · Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 8 Mar 21
Posts: 53
Credit: 243,832,973
RAC: 0
Message 542 - Posted: 26 Apr 2021, 16:26:08 UTC - in response to Message 540.  

Thanks for the update. I looked over the code and it all seems to be minimal changes. Looking forward to the new application soon.
ID: 542 · Report as offensive     Reply Quote
Profile Hy
Project administrator
Project developer
Avatar

Send message
Joined: 15 Jun 20
Posts: 74
Credit: 19,537,761
RAC: 0
Message 543 - Posted: 26 Apr 2021, 23:03:22 UTC
Last modified: 26 Apr 2021, 23:03:36 UTC

Additionally, we've found one of the situations which triggers this issue, potentially related to the checkpointing system (or how BOINC resumes from a major fault).
Basically, one of our developers shut down his system hard while a task was nearly complete, and it appears something happened with our checkpoint file which caused it to become corrupted or incorrectly saved. The issue is, because checkpointing is so fast it's almost impossible to have this occur with the timing, or situation required to cause the infinite-kaktwoos problem.

For reference, our normal checkpoint files are around 12kb in size, while this one was only 1kb. But, even if you kill Kaktwoos in various ways (as we tested before) this issue isn't reliably caused. We might make an alternative fix such as a basic verifier for the checkpoints that are being reloaded, but generally this issue is only due to a computer being either shut down before the task is properly suspended, or a random GPU crash causing kaktwoos-cl to freeze while it is writing a checkpoint *early on* and thus corrupting the most important data and leading to it checking unknown parameters.
ID: 543 · Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 8 Mar 21
Posts: 53
Credit: 243,832,973
RAC: 0
Message 544 - Posted: 27 Apr 2021, 1:08:18 UTC - in response to Message 543.  

Can you state whether the application adheres to the standard BOINC checkpointing mechanism whereby the Manager interface can override the default project specified checkpoint interval?
ID: 544 · Report as offensive     Reply Quote
Profile Hy
Project administrator
Project developer
Avatar

Send message
Joined: 15 Jun 20
Posts: 74
Credit: 19,537,761
RAC: 0
Message 549 - Posted: 29 Apr 2021, 18:23:07 UTC - in response to Message 544.  
Last modified: 29 Apr 2021, 18:24:57 UTC

In terms of 'overriding' the default checkpoint interval, that is set to 30 seconds (BOINC-side), but requesting a checkpoint by suspension and automatic checkpoints are a part of Kaktwoos-cl. Basically, whenever you see your progress bar tick forwards, a checkpoint has been made. I've honestly never seen any request or feedback, or usage of the custom *timer* for minimum checkpointing mind you

Coding wise, this is because I have set it to checkpoint (regardless of if it is run by BOINC or not) every 200 million seeds checked. Most GPUs are 25-50 mseed/s, meaning most GPUs checkpoint every 8-4 seconds per task, though this supposedly can be given an override by the BOINC checkpoint timer function and the manual checkpointing request sent to Kaktwoos-cl.

The BOINC documentation didn't appear to provide a standardized checkpoint format, just the API to know when to checkpoint which we use as another reference for when to checkpoint, say if your GPU is very slow (like Intel graphics or AMD APUs) and takes more than 30 seconds to finish 200 million seeds.
ID: 549 · Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 8 Mar 21
Posts: 53
Credit: 243,832,973
RAC: 0
Message 551 - Posted: 29 Apr 2021, 19:24:56 UTC - in response to Message 549.  

I only ask because I have in the past changed the default checkpoint timer in the Manager for other project tasks that ran for such short intervals it was best to restart from zero instead of checkpoint at 80% of 60 second task.
ID: 551 · Report as offensive     Reply Quote

Message boards : Number crunching : Kaktwoos tasks always stall out and never finish if interrupted