Checkpoint(Pausing) might be causing Invalid Results

Message boards : Number crunching : Checkpoint(Pausing) might be causing Invalid Results
Message board moderation

To post messages, you must log in.

AuthorMessage
match123f

Send message
Joined: 11 Aug 24
Posts: 1
Credit: 152,500
RAC: 3,322
Message 992 - Posted: 19 Jan 2025, 8:13:26 UTC

In short:
So this is my conclusion: interrupting the calculation (either by pause or reboot) will clear its result file(seeds.txt), and finally leads to calculation error, I think.

The program might be improved by not clearing the file during the startup, but checking if it exists before appending to the file's end, in my opinion. After all, the BOINC program will automatically clear the task's slot after it's finished.


Hey guys!
For project 1.20 Find seeds with zero villages within a radius v1.01 (cuda), I got my first task error, so I decided to take a look at others errors.
Having checked 3 users' tasks and task units they belong to, I noticed that every task that have been paused at least once goes failed, and all of those succeeded tasks take no pause.

Although I didn't do a complete research, I assume that incontinuous calculations would finally become invalid results. I had tracked my BOINC data directory, and had found that after pause or reboot, the seeds.txt file will be cleared, and the stderr.txt will append "checkpoint loaded..." text. I didn't find the original seeds anywhere, so I guess the program might just uploaded the seeds.txt file as the result.

Now I'm conducting a test on my laptop, with a schedule backuping the seeds file every 5 minutes. I imported the first two seeds that were cleared, and now it was nearly completed. Let's see if it's right this time or not...

[1.19,3:49 p.m. UTC+8] Yes! I got it! Please look at https://minecraftathome.com/minecrafthome/workunit.php?wuid=5093723, where Task 10832714 is mine.
Task 10310763 paused 2 times and got wrong, while Task 10310764 never paused. Task 10832714 is mine. Although it paused 4 times, it still succeeded by importing the 2 missing seeds that were cleared by checkpoint.

So this is my conclusion: interrupting the calculation (either by pause or reboot, etc) will clear its result file(seeds.txt), and finally leads to calculation error, I think. Surprisingly, busy CPU won't cause checkpoint, as the program might be suspended through signals, rather than killing it and log a checkpoint.

The program might be improved by not clearing the file during the startup, but checking if exists before appending to the file's end, in my opinion. After all, the BOINC program will automatically clear the task's slot after it's finished.
ID: 992 · Report as offensive     Reply Quote
boysanic
Project administrator
Project developer

Send message
Joined: 15 Jun 20
Posts: 32
Credit: 101,415,555
RAC: 110,632
Message 993 - Posted: 19 Jan 2025, 23:16:23 UTC - in response to Message 992.  

Hi!

Thanks for the bug report! I think you're correct.
https://github.com/MinecraftAtHome/LoneliestSeed/blob/main/main.cu#L4867

Here's where we're opening seeds.txt. I believe this should be "a", not "w+" as w/w+ would replace the file in some fashion.

I'll update that and push it up to the server once it's finished. I appreciate the heads up about this, and apologies for any inconvenience caused by the validation errors.


Thank you!
ID: 993 · Report as offensive     Reply Quote
AnandBhat

Send message
Joined: 9 Sep 24
Posts: 3
Credit: 25,487,500
RAC: 2,680
Message 995 - Posted: 1 Feb 2025, 4:47:14 UTC - in response to Message 993.  
Last modified: 1 Feb 2025, 4:47:24 UTC

I see a validation error in one of my tasks (running v1.07) that had to checkpoint. Will this be fixed in v1.08?

https://minecraftathome.com/minecrafthome/result.php?resultid=10340374
<core_client_version>7.18.1</core_client_version>
<![CDATA[
<stderr_txt>
boinc gpu 0 gpuindex: 0 
No checkpoint to load
boinc gpu 0 gpuindex: 0 
Checkpoint loaded, task time 1097 s, seed pos: 119
checked = 1073741824
time taken = 9809.283000
seeds per second: 124341.598919
07:08:31 (24849): called boinc_finish(0)
</stderr_txt>
]]>
ID: 995 · Report as offensive     Reply Quote

Message boards : Number crunching : Checkpoint(Pausing) might be causing Invalid Results