Print stopped at 83% for seemingly no reason

dc42

I'm sorry, I think you have found a firmware bug. This line from your report:

Scheduled moves: 402064, completed moves: 401876

has a difference of 188, which is much greater than the size of the move queue. So it must be failing to count some completed moves. That has caused the code queue to become full with M106 commands; so the firmware is waiting for the completed move count to catch up.

I think your only hope of saving the print is to pause it, check that a resurrect.g file has been created, use Emergency Stop to reset the Duet, and use the resurrect feature (M916) to restart the print. ~~I think firmware 2.02RC3 inserts a G92 command in resurrect.g prior to calling resurrect-prologue.g, and if that is the case then you probably only need a M116 command in resurrect-prologue.g.~~

EDIT: I just checked the code. To avoid resetting the Duet (which will power the motors down, potentially losing position) you can pause the print, back up the resurrect.g file, then cancel the print. Then use a text editor to remove the G92 command from resurrect.g. Then use M916 to resume the print (as before you will need M116 in the resurrect-prologue.g file).

dc42

PS - I'm trying to track down how the move counts became so different. Did you pause the print a large number of times?

shogran

@dc42 said in Print stopped at 83% for seemingly no reason:

I'm sorry, I think you have found a firmware bug. This line from your report:

Scheduled moves: 402064, completed moves: 401876

has a difference of 188, which is much greater than the size of the move queue. So it must be failing to count some completed moves. That has caused the code queue to become full with M106 commands; so the firmware is waiting for the completed move count to catch up.

I think your only hope of saving the print is to pause it, check that a resurrect.g file has been created, use Emergency Stop to reset the Duet, and use the resurrect feature (M916) to restart the print. ~~I think firmware 2.02RC3 inserts a G92 command in resurrect.g prior to calling resurrect-prologue.g, and if that is the case then you probably only need a M116 command in resurrect-prologue.g.~~

EDIT: I just checked the code. To avoid resetting the Duet (which will power the motors down, potentially losing position) you can pause the print, back up the resurrect.g file, then cancel the print. Then use a text editor to remove the G92 command from resurrect.g. Then use M916 to resume the print (as before you will need M116 in the resurrect-prologue.g file).

@dc42 said in Print stopped at 83% for seemingly no reason:

PS - I'm trying to track down how the move counts became so different. Did you pause the print a large number of times?

Thanks for the feedback dc, much appreciated.

I did pause and resume the file a few times when I noticed it wasn't printing but not while the print was working. I do have stall detection enabled which I know triggered a few times; could that have caused it/acted the same way as a pause?

dc42

@shogran said in Print stopped at 83% for seemingly no reason:

I do have stall detection enabled which I know triggered a few times; could that have caused it/acted the same way as a pause?

Thanks, that may be where the problem is. What action do you have configured when a stall is detected?

shogran

@dc42 said in Print stopped at 83% for seemingly no reason:

@shogran said in Print stopped at 83% for seemingly no reason:

I do have stall detection enabled which I know triggered a few times; could that have caused it/acted the same way as a pause?

Thanks, that may be where the problem is. What action do you have configured when a stall is detected?

This is my stall detection line:

M915 X Y S10 R3 F0 ; Stall detection, set X and Y to sensitivity 10, pause print, rehome, unfiltered

and this is the rehome.g file it's using:

G91                       ; relative mode
G1 Z5 F200 S2; lower bed
G1 S1 X-305 Y274.5 F3000   ; course home X or Y
G1 S1 X-305              ; course home X
G1 S1 Y274.5               ; course home Y
G1 X4 Y-4 F600             ; move away from the endstops
G1 S1 X-305                ; fine home X
G90
G1 Y274.5             ; fine home Y
G91
G1 Z-5 F200 S2; lower bed

dc42

Thanks.

It looks like it occured at layer 127 (Z height 17.66mm)

The GCode file says that layer 127 has Z height 12.700mm. So do you mean layer 127, or 12.700mm height? It's possible that DWC is displaying the wrong layer count.

I've reviewed the part of the firmware that handles an emergency pause (which is done when there is power loss or a stall is detected), and I found two issues:

When an emergency pause occurs, an off-by-one error meant that the number of scheduled moves ends up one higher than it should be;
After an emergency pause, the queue of deferred commands was not being resynced to allow for moves that were discarded from the queue.

What I think happened was that you had a large number of repeated stalls around the same point in the print, during or shortly before bridging moves. Each one caused the number of scheduled moves to increase by one, and some deferred M106 commands to be left in the queue instead of being purged. This explains the following part of the M122 report:

Code queue is not empty:
Queued 'M106 S204' for move 401967
Queued 'M106 S102' for move 401968
Queued 'M106 S204' for move 401970
Queued 'M106 S102' for move 401975
Queued 'M106 S204' for move 401978
Queued 'M106 S102' for move 401979
Queued 'M106 S204' for move 401981
Queued 'M106 S102' for move 401986
Queued 'M106 S204' for move 401989
Queued 'M106 S102' for move 401990
Queued 'M106 S204' for move 402016
Queued 'M106 S102' for move 402017
Queued 'M106 S204' for move 402019
Queued 'M106 S102' for move 402024
Queued 'M106 S204' for move 402027
Queued 'M106 S102' for move 402028

The move numbers are so close together that I think the multiple pairs M106 moves must repetitions of the same sequence of bridging moves.

Do you think it possible that the stall detection was triggered 150 or more times, with at least the last 8 being all around the same place? If you have logging enabled, then the stalls may be recorded in the log file.

shogran

@dc42 said in Print stopped at 83% for seemingly no reason:

Thanks.

It looks like it occured at layer 127 (Z height 17.66mm)

The GCode file says that layer 127 has Z height 12.700mm. So do you mean layer 127, or 12.700mm height? It's possible that DWC is displaying the wrong layer count.

Ah, that would make sense actually as it was printing with a 0.1mm layer height. DWC may have been displaying the wrong info in this instance.

I've reviewed the part of the firmware that handles an emergency pause (which is done when there is power loss or a stall is detected), and I found two issues:

When an emergency pause occurs, an off-by-one error meant that the number of scheduled moves ends up one higher than it should be;

After an emergency pause, the queue of deferred commands was not being resynced to allow for moves that were discarded from the queue.

What I think happened was that you had a large number of repeated stalls around the same point in the print, during or shortly before bridging moves. Each one caused the number of scheduled moves to increase by one, and some deferred M106 commands to be left in the queue instead of being purged. This explains the following part of the M122 report:

Code queue is not empty:
Queued 'M106 S204' for move 401967
Queued 'M106 S102' for move 401968
Queued 'M106 S204' for move 401970
Queued 'M106 S102' for move 401975
Queued 'M106 S204' for move 401978
Queued 'M106 S102' for move 401979
Queued 'M106 S204' for move 401981
Queued 'M106 S102' for move 401986
Queued 'M106 S204' for move 401989
Queued 'M106 S102' for move 401990
Queued 'M106 S204' for move 402016
Queued 'M106 S102' for move 402017
Queued 'M106 S204' for move 402019
Queued 'M106 S102' for move 402024
Queued 'M106 S204' for move 402027
Queued 'M106 S102' for move 402028

The move numbers are so close together that I think the multiple pairs M106 moves must repetitions of the same sequence of bridging moves.

Do you think it possible that the stall detection was triggered 150 or more times, with at least the last 8 being all around the same place? If you have logging enabled, then the stalls may be recorded in the log file.

Unfortunately I didn't have logging enabled, though I have now added it to my config.g for future use. I think it's possible that it was triggered that many times as I've only recently enabled it and am still tuning the sensitivity. I've definitely been in the room a few times when it has triggered a few times in the space of a few seconds on the same point before moving past it.

dc42

Note, the sensitivity to stalls increases with motor temperature. So a stall sensitivity in the M915 command that was OK when the motors were cold may be too sensitive when the motors are hot due to printing for some time.

shogran

@dc42 Thanks, I've restarted the print with a reduced sensitivity on the stall detection and logging enabled. I'll update here with the outcome.

shogran

@dc42 Printed without any issues after reducing the sensitivity and not having the stall detection trigger at all that I could see. Looks like that may have been the cause.

dc42

@shogran said in Print stopped at 83% for seemingly no reason:

@dc42 Printed without any issues after reducing the sensitivity and not having the stall detection trigger at all that I could see. Looks like that may have been the cause.

i'm glad it worked. I've put a fix for the off-by-one issue in the forthcoming 2.02RC6 release. Please accept my apologies for the failed print.