[SBC mode] Random errors in macros with 3.5.2
-
@crpalmer are you running in standalone or SBC mode?
-
@dc42 sbc for both machines.
-
@crpalmer I ran your macro from a print file 10 times in a row with v3.5.2 and it completed every time as expected. Can you please share a print file and files
/sys/homexy-if-needed.g
and/sys/retractprobe-forced.g
as well? -
@chrishamm / @dc42 The two machines I am running 3.5.2 configs are:
https://github.com/crpalmer/3d-printing/tree/master/veho
https://github.com/crpalmer/3d-printing/tree/master/ender5For whatever reason, it happens much more frequently on "ender5" (don't be confused by the name, all that is left of the original ender-5 is the frame). It doesn't happen all the time and I'm starting to wonder if it happens more frequently when it has been running for awhile (perhaps some kind of subtle memory corruption?). It did happen again yesterday:
9/10/2024, 6:27:28 AM Error: in file prepare-to-print.g line 32: unknown variable 'probing_temp' 9/10/2024, 6:27:28 AM Error: in file prepare-to-print.g line 34: unknown variable 'probing_temp' 9/10/2024, 6:27:12 AM Warning: Macro file has been started on channel File but none was requested
and like before after the print finished (including M2 in the end.gcode) the machine thought that the print was still running and it was impossible to cancel the print. I end up just doing an Emergency Stop when this happens.
-
@crpalmer Can you please upload a demo G-code job as well? Then I could prepare a bench setup and attempt to reproduce it.
-
@chrishamm Here's a gcode file that I was running at around the time I last saw the error. I can't be sure it's the one I was running but the exact object probably doesn't matter given that the error happens before the print itself starts.
-
@chrishamm I was just browsing the "random restarts on mini" thread and wondered if my problems could be related. However, both of my machines are running Duet 3 Mini 5+ (Ethernet not WiFi).
Both machines have CAN boards attached. The one that encounters this error more frequently is using a EXP3HC and the other a TOOL1LC.
-
@crpalmer I think I've got a bug fix ready which will be part of v3.5.3.
-
@chrishamm Thank you but unfortunately I think that's not the (entire?) fix. Was it this commit that you were hoping fixed the problem?
I find that the error occurs more frequently on one of my machines so I upgraded it to 3.5.3 to test. I just had the error occur when starting a print. FYI:
I have 3 machines that were running 3.5.2 and I'm seeing errors with very different frequencies on these machines. The one I upgraded to 3.5.3 has the error occur 10% of the time. That machine has a EXP3HC board that runs a Y motor an U motor and an extruder. The second machine runs a toolboard and I see the error here less frequently (I don't have a percentage estimate but definitely less frequently). The last machine has yet to report an error and it doesn't use any CAN expansion boards.
I don't see anything interesting in the differences in the SBC. All three machines are running Raspberry PI 4 boards (the two that have had the errors are Rev 1.5 and the one that hasn't is 1.4). The most machine that most commonly has errors and the one that has never had an error are running Raspbian 10 and the one other is Raspbian 12.
9/21/2024, 4:12:58 PM Error: in file prepare-to-print.g line 32: unknown variable 'probing_temp' 9/21/2024, 4:12:58 PM Error: in file prepare-to-print.g line 34: unknown variable 'probing_temp' 9/21/2024, 4:12:43 PM Warning: Macro file has been started on channel File but none was requested
-
@crpalmer That was only one of the commits for the full fix, there was at least one more in DSF, too. So far I've been unable to reproduce that issue again since I made those changes. If you can come up with a simpler and reliable way to reproduce this problem, I can have a look at it again. Unfortunately, I can't reproduce your full setup here and AFAIK we haven't got similar reports from other customers either.
-
@chrishamm This problem is most definitely not fixed. Yesterday I had the error occur twice. FWIW, random and explained behaviour by the software used to control industrial machines is something that I think should still be investigated -- even if it's difficult.
Whenever I get the random error message, I click "Emergency Stop" in the interface and restart the job. I do so just to be on the safe side and since I know I'll have to click emergency stop eventually anyway because the print never terminates, and the interface fails to let me cancel it.
Yesterday, the error occurred in 2 prints jobs in a row (with an emergency stop in between). Memory corruption, while still a possibility, sounds very unlikely.
I did try to find a reproducible example by having macros running macros running macros, but I was not able to reproduce anything that doesn't involve actually running print jobs. One difference in my tests and the actual print jobs is that some macro invocations are generated by M401 / M402 gcodes. I see in the RRF code that when system macros are run, it takes not of that (runningSystemMacro). Is there any difference there that could be part of the problem?
-
@crpalmer I suspect something closes one of your macro files prematurely - that might explain those error messages. I need to have a look at this again provided I can reproduce it somehow.
Meanwhile you could enable debug logging by changing the log level in
/opt/dsf/conf/config.json
frominfo
todebug
and share the corresponding journal when it happens again. If it is what I think it is, it might be possible to reduce the frequency of this issue by reducing the number of nested macros.I've also logged this issue here: https://github.com/Duet3D/DuetSoftwareFramework/issues/205 Note that I'm not available next week but I'm happy to have a look at this again when I'm back.
-
@chrishamm Thanks.
I changed the log level to debug and reboot the PI to be sure everything was restarted. I see debug messages from the DuetControlServer and DuetWebServer in /var/log/syslog. If the error happens again (I fear the debug logging is going to change the timing that causes this to happen) should I attach the output of running "grep -i duet /var/log/syslog" to the github bug report?
(Let me know next week when you're back)
-
@chrishamm I have a syslog that has the same gcode run twice in a row, once with no errors and then once with the errors. Hopefully the syslog is the "journal" that you were referring to. Either way, it's good to know that the error still occurs with the debug logging enabled.