Message boards :
Number crunching :
Kaktwoos tasks always stall out and never finish if interrupted
Message board moderation
Author | Message |
---|---|
Send message Joined: 8 Mar 21 Posts: 53 Credit: 245,502,973 RAC: 141 |
Got a confirmation from another user of a problem with the kaktwoos tasks. If for some reason that your PC quits or crashes while running a kaktwoos task, upon restart the task will continue to increase clock time but will never finish. Pausing or suspending the task and then restarting it does not restore the stalled task to normalcy like this process does with other projects tasks. The only solution is to abort any previously running task upon restart. So have the developers been made aware of this issue? |
Send message Joined: 30 Jun 20 Posts: 3 Credit: 17,950,083 RAC: 63 |
I see the same issue on Win 10 pro. |
Send message Joined: 8 Mar 21 Posts: 53 Credit: 245,502,973 RAC: 141 |
The developers have implemented a code fix that just needs to be peer reviewed and then merged into the project software. We should get a new application that properly resumes from a checkpoint in the near future. Crossing fingers. But I have no exact timeline for you. |
Send message Joined: 15 Jun 20 Posts: 74 Credit: 19,537,761 RAC: 0 |
Hello, we've pushed the code to our Github and are verifying it. We've had this issue for some time, but now that we're seeing it more often due to more users, we've decided to take some steps to solve it. This should be pushed out to BOINC for all clients to update to in a day or two, I trust Neil's work and the changes don't seem to be large |
Send message Joined: 8 Mar 21 Posts: 53 Credit: 245,502,973 RAC: 141 |
Thanks for the update. I looked over the code and it all seems to be minimal changes. Looking forward to the new application soon. |
Send message Joined: 15 Jun 20 Posts: 74 Credit: 19,537,761 RAC: 0 |
Additionally, we've found one of the situations which triggers this issue, potentially related to the checkpointing system (or how BOINC resumes from a major fault). Basically, one of our developers shut down his system hard while a task was nearly complete, and it appears something happened with our checkpoint file which caused it to become corrupted or incorrectly saved. The issue is, because checkpointing is so fast it's almost impossible to have this occur with the timing, or situation required to cause the infinite-kaktwoos problem. For reference, our normal checkpoint files are around 12kb in size, while this one was only 1kb. But, even if you kill Kaktwoos in various ways (as we tested before) this issue isn't reliably caused. We might make an alternative fix such as a basic verifier for the checkpoints that are being reloaded, but generally this issue is only due to a computer being either shut down before the task is properly suspended, or a random GPU crash causing kaktwoos-cl to freeze while it is writing a checkpoint *early on* and thus corrupting the most important data and leading to it checking unknown parameters. |
Send message Joined: 8 Mar 21 Posts: 53 Credit: 245,502,973 RAC: 141 |
Can you state whether the application adheres to the standard BOINC checkpointing mechanism whereby the Manager interface can override the default project specified checkpoint interval? |
Send message Joined: 15 Jun 20 Posts: 74 Credit: 19,537,761 RAC: 0 |
In terms of 'overriding' the default checkpoint interval, that is set to 30 seconds (BOINC-side), but requesting a checkpoint by suspension and automatic checkpoints are a part of Kaktwoos-cl. Basically, whenever you see your progress bar tick forwards, a checkpoint has been made. I've honestly never seen any request or feedback, or usage of the custom *timer* for minimum checkpointing mind you Coding wise, this is because I have set it to checkpoint (regardless of if it is run by BOINC or not) every 200 million seeds checked. Most GPUs are 25-50 mseed/s, meaning most GPUs checkpoint every 8-4 seconds per task, though this supposedly can be given an override by the BOINC checkpoint timer function and the manual checkpointing request sent to Kaktwoos-cl. The BOINC documentation didn't appear to provide a standardized checkpoint format, just the API to know when to checkpoint which we use as another reference for when to checkpoint, say if your GPU is very slow (like Intel graphics or AMD APUs) and takes more than 30 seconds to finish 200 million seeds. |
Send message Joined: 8 Mar 21 Posts: 53 Credit: 245,502,973 RAC: 141 |
I only ask because I have in the past changed the default checkpoint timer in the Manager for other project tasks that ran for such short intervals it was best to restart from zero instead of checkpoint at 80% of 60 second task. |