1LC Multiple Disconnects

curieos

@curieos Look like a "Hard fault"

M122 B21
Diagnostics for board 21:
Duet TOOL1LC rev 1.1 or later firmware version 3.4.6 (2023-07-21 14:17:33)
Bootloader ID: SAMC21 bootloader version 2.8 (2023-07-25)
All averaging filters OK
Never used RAM 2016, free system stack 62 words
Tasks: Move(notifyWait,0.0%,155) HEAT(notifyWait,0.2%,101) CanAsync(notifyWait,0.0%,57) CanRecv(notifyWait,0.0%,76) CanClock(notifyWait,0.0%,65) ACCEL(notifyWait,0.0%,61) TMC(delaying,3.0%,57) MAIN(running,91.9%,351) IDLE(ready,0.0%,26) AIN(delaying,4.9%,142), total 100.0%
Last reset 00:17:42 ago, cause: power up
Last software reset at 2024-01-30 16:35, reason: HardFault, available RAM 2016, slot 0
Software reset code 0x0060 ICSR 0x00000003 SP 0x20002ed8 Task MAIN Freestk 801 ok
Stack: 00000156 00000001 68104a2f 00000016 00000000 0001caf7 0000ceca 21000000 00000000 00000000 00000000 00000000 43ca6148 00000000 200048c8 00001401 00000000 11a81500 0000002c 030014d4 00000000 0000f01f a5a5a5a5 00248b5a 00248b5a 000001d6 000001f4
Driver 0: pos 0, 404.8 steps/mm,standstill, SG min 0, read errors 0, write errors 0, ifcnt 12, reads 7026, writes 12, timeouts 0, DMA errors 0, CC errors 0, steps req 0 done 0
Moves scheduled 0, completed 0, in progress 0, hiccups 0, step errors 0, maxPrep 0, maxOverdue 0, maxInc 0, mcErrs 0, gcmErrs 0
Peak sync jitter 0/4, peak Rx sync delay 213, resyncs 0/0, no step interrupt scheduled
VIN voltage: min 27.3, current 27.5, max 27.6
MCU temperature: min 51.2C, current 53.8C, max 53.9C
Last sensors broadcast 0x00018004 found 3 187 ticks ago, 0 ordering errs, loop time 0
CAN messages queued 21514, send timeouts 0, received 17783, lost 0, free buffers 37, min 37, error reg 0
dup 0, oos 0/0/0/0, bm 0, wbm 0, rxMotionDelay 0
Accelerometer: LIS3DH, status: 00
I2C bus errors 0, naks 3, other errors 0
=== Filament sensors ===
Interrupt 4 to 9us, poll 8 to 505us
Driver 0: pos 2160.00, errs: frame 0 parity 0 ovrun 0 pol 0 ovdue 0

curieos

@curieos Anyone? Looking at other posts with this issue, this might be a firmware bug?

In case it's not, I tried some additional troubleshooting today. The only things that changed between the toolboard working fine and it having issues were a grounding wire run to the toolhead plate, and a new CAN cable with crimped terminations instead of the soldered pigtails. A genuine JST crimping tool for ZH terminals was used to create the terminations. I don't think it's the cable, as the last thing in the CAN loop is a second 6XD, not the 1LC. I believe, if the cable had issues, the second 6XD would have communication problems tool. Also, the toolboard was locked up, not having CAN connection issues.

Something I noticed during the lockup issues was the last reported MCU temperature, it was always 54.7C. Today I tried recreating this temperature with both the old CAN cable and new CAN cable to see if I could recreate the issue with either cable. I thought it was either a temp issue or a time issue. Neither cable had a lockup event occur when the temperature reached/exceeded 54.7C, so I don't think it's a temperature issue.

I can continue testing variables in case it's a hardware issue, but for my sanity's sake I'd like to know if I'm barking up the wrong tree.

T3P3Tony

@curieos I have highlighted this to @dc42

dc42

@curieos looking at your M122 reports and other observations, I think the cause is that the tool board is losing power. Possible reasons for this include:

A bad crimp connection in the JST VH connector that provides power to the tool board
One of the two power wires has fractured internally, most likely just above the crimp connection to the JST VH connector
Something that you have connected to the 5V rail of the tool board (most likely to the OUT0 connector) is drawing excessive current and causing the 5V regulator to go into thermal shutdown.

curieos

@dc42 The VIN and 5V LEDs never cut out or dip in intensity, do you still think it could be a power issue? I can check the connections, though I did already try disconnecting all connectors besides CAN and VIN. I wasn't super thorough about it admittedly, as I was limited on time.

T3P3Tony

@curieos said in 1LC Multiple Disconnects:

The red status light started out blinking brightly, but then gradually faded in intensity

Did the 5V power LED stay lit to the same intensity throughout?

curieos

@T3P3Tony From what I recall, yes.

curieos

@T3P3Tony @dc42 Checking now with a volt meter (the board is currently locked up), VIN is reading fine, 27V. 3.3V from the IO 1 connector reads 3.3V, and 5V from the IO 2 connector reads just shy of 5V (around 4.95), which I believe is also fine.

The only way I can get the board to display the symptoms is to start a print. So far any print I start causes the issue. Idling does not cause the issue. I don't have an exact amount of time it takes to lock up, the console logs don't report the error until functions assigned to that toolboard are used, but I'd estimate it takes roughly 15 minutes after a print is started. The board does not return to normal until the whole machine is restarted.

dc42

@curieos said in 1LC Multiple Disconnects:

Checking now with a volt meter (the board is currently locked up), VIN is reading fine, 27V. 3.3V from the IO 1 connector reads 3.3V, and 5V from the IO 2 connector reads just shy of 5V (around 4.95), which I believe is also fine.

Yes those are fine; but that doesn't prove that they are always fine. If you have a bad crimp or a fractured wire then the power is likely to disconnect occasionally depending on the movement of the power cable, which likely depends on the position of the print head. If the disconnection is short then the capacitors in the power circuit may supply sufficient power for long enough until the power connection is restored.

The fact that you saw the red LED gradually dim is a sure sign that the 5V supply was gradually lost. Once power is lost, when it is restored the configuration parameters will be lost, so you will need to reboot the machine or at least run config.g to restore normal operation.

Have you provided any strain relief on the power cable, after it has exited the JST VH connector?

Are you powering anything from OUT0 ?

curieos

@dc42 The red status light, not the red 5V light. The red 5V light stays brightly lit the whole time. The board is locked up, I can not communicate with it in any form. When I send M122 B21 it does not respond. This isn't a momentary issue.

Yes, there is plenty of strain relief. A zip tie secures the power and CAN cables after they exit the chain for both toolheads. There's also cable management to make sure the cover doesn't squish any wires.

OUT0 is hooked up to two 50 watt Slice heaters. During this last print that caused this issue, those heaters were never activated. That tool was unused and idle the whole time.

curieos

@dc42 Here's some pictures of the toolboard, right now. Both voltage lights are lit up. The CAN activity and status lights are off. It's difficult to illustrate this without taking a video.

~~I'm trying to get some thermal shots of it to see if there's any hotspots I can't detect, but our thermal camera is refusing to cooperate.~~ edit: I got a thermal camera image. There's a bit of an offset due to how close I have the camera, but you can get an idea of where the hot spots are based on the ghost of the wires and mounting holes:

I believe the hotspot is U3, which is the 12V buck converter.

dc42

@curieos thanks for the photos. If the hotspot is that area indicated by the red circle and cross, that's diode D5 which feeds the output of the 12V regulator to the VOUT pin of the OUT1 and OUT2 connectors. It would be odd if that component was generating a lot of heat. What I suspect is that the surrounding area is at more or less the same temperature and the heat is coming from both the 12V buck regulator and the 5V linear regulator.

Can you confirm that you are not using the IO_0 connector to power anything? It's not connected in your photo.

It's nor normal for the Status lights to be off and the power lights to be on. That condition should only occur briefly after power up when the crystal oscillator is starting and the board is booting up. So I suspect that the processor has gone into some sort of locked up state that even the watchdog timer can't recover from, or the oscillator has stopped. The only reasons I can think of for it locking up in this way are static discharge, a a loss of power that wasn't quite enough to cause a full shutdown, or a faulty tool board. We know that static discharge is common on hot ends, but we haven't know of it affecting a TOOL1LC in this way. Nevertheless, please ensure that the hot end metalwork is grounded. You can ground it to one of the mounting screws of the tool board.

What was the reason for putting a heatsink on top of the MCU? How did you attach it? The MCU generates very little heat and certainly doesn't need a heatsink. The components that generate heat on the tool board are the 5V linear regulator, the TMC2209 driver (depending on motor current), and the 12V buck regulator (depending on VIN voltage and the current draw from OUT1 and OUT2).

I'd suggest that we replace your tool board, but you said you have already swapped it for a different one.

curieos

@dc42 Correct, IO_0 is not connected. You said OUT0 earlier, which was confusing.

The hotend metalwork and toolboard are connected via the toolplate. The hotend has continuity with the plate, and the toolboard has a connection to the toolplate via the green wire in the upper right corner.

I put the heatsink on the MCU because I noticed the previous toolboard's MCU was very hot to the touch before I swapped it out. I put it on that toolboard, and then transferred it to the new one when I swapped it out. Originally I thought it was just an overheating issue. The heatsink has preattached thermal adhesive. Not shown is a thin ceramic heatsink on the rear of the toolboard, mounted with some 3M thermal transfer adhesive. That covers the MCU and stepper driver.

curieos

@dc42 Any other suggestions? I just replaced the power cable for the toolboard with brand new wire and this issue is still occurring.

curieos

@dc42 I tried updating that machine to 3.5 rc3 and the issue persists. What physically can cause a 1LC to lock up until a full power cycle?

Phaedrux

Are the mainboard and toolboard powered from a common PSU? If not, are the grounds of the PSU tied together?

curieos

@Phaedrux Yes, all of the 24V system is connected to one supply.

curieos

@dc42 I'm reporting in that this issue is still occurring. I have changed out the CAN data and power cable runs to this toolhead yet again, in fact the run is a little different this time. The tool distribution board is now mounted to the gantry, so the power and data cables are shorter than before.

The same toolboard locks up at around the same point in any print I've tried, typically around the 1 hour mark. It's been replaced once already. The board does not recover until a full power cycle of the machine is performed.

I have an M122 output from this latest event:

=== Diagnostics ===
RepRapFirmware for Duet 3 MB6XD version 3.5.0-rc.3 (2024-01-24 17:59:29) running on Duet 3 MB6XD v1.01 or later (SBC mode)
Board ID: 0JD2M-999AL-D25SW-6JKD0-3S06R-95ZB1
Used output buffers: 1 of 40 (40 max)
=== RTOS ===
Static ram: 153600
Dynamic ram: 92392 of which 0 recycled
Never used RAM 96400, free system stack 122 words
Tasks: SBC(2,ready,14.4%,407) HEAT(3,nWait 6,0.2%,322) Move(4,nWait 6,8.7%,213) CanReceiv(6,nWait 1,28.2%,770) CanSender(5,nWait 7,0.1%,325) CanClock(7,delaying,0.1%,346) MAIN(2,running,47.7%,444) IDLE(0,ready,0.6%,30), total 100.0%
Owned mutexes: HTTP(MAIN)
=== Platform ===
Last reset 03:30:24 ago, cause: power up
Last software reset at 2024-03-07 09:55, reason: User, Gcodes spinning, available RAM 95784, slot 0
Software reset code 0x6003 HFSR 0x00000000 CFSR 0x00000000 ICSR 0x00400000 BFAR 0x00000000 SP 0x00000000 Task SBC Freestk 0 n/a
Error status: 0x04
Aux0 errors 0,0,0
MCU temperature: min 30.9, current 48.1, max 49.9
Supply voltage: min 25.8, current 26.0, max 26.1, under voltage events: 0, over voltage events: 0, power good: yes
12V rail voltage: min 12.1, current 12.1, max 12.2, under voltage events: 0
Heap OK, handles allocated/used 99/38, heap memory allocated/used/recyclable 2048/1388/712, gc cycles 4
Events: 3 queued, 3 completed
Driver 0: ok
Driver 1: ok
Driver 2: ok
Driver 3: ok
Driver 4: ok
Driver 5: ok
Date/time: 2024-03-14 14:58:02
Slowest loop: 1000.79ms; fastest: 0.05ms
=== Storage ===
Free file entries: 20
SD card 0 not detected, interface speed: 37.5MBytes/sec
SD card longest read time 0.0ms, write time 0.0ms, max retries 0
=== Move ===
DMs created 125, segments created 31, maxWait 3494446ms, bed compensation in use: none, height map offset 0.000, max steps late 1, ebfmin 0.00, ebfmax 0.00
no step interrupt scheduled
Moves shaped first try 9729, on retry 4260, too short 16445, wrong shape 17913, maybepossible 1481
=== DDARing 0 ===
Scheduled moves 59311, completed 59311, hiccups 0, stepErrors 0, LaErrors 0, Underruns [0, 0, 4], CDDA state -1
=== DDARing 1 ===
Scheduled moves 0, completed 0, hiccups 0, stepErrors 0, LaErrors 0, Underruns [0, 0, 0], CDDA state -1
=== Heat ===
Bed heaters 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1, chamber heaters -1 -1 -1 -1, ordering errs 0
=== GCodes ===
Movement locks held by null, null
HTTP* is doing "M122" in state(s) 0
Telnet is idle in state(s) 0
File* is idle in state(s) 0
USB is idle in state(s) 0
Aux is idle in state(s) 0
Trigger* is idle in state(s) 0
Queue* is idle in state(s) 0
LCD is idle in state(s) 0
SBC* is idle in state(s) 0
Daemon is idle in state(s) 0
Aux2 is idle in state(s) 0
Autopause* is idle in state(s) 0
File2* is idle in state(s) 0
Queue2 is idle in state(s) 0
Q0 segments left 0, axes/extruders owned 0x0000000
Code queue 0 is empty
Q1 segments left 0, axes/extruders owned 0x0000000
Code queue 1 is empty
=== Filament sensors ===
check 0 clear 0
Extruder 0: no data received, errs: frame 0 parity 0 ovrun 0 pol 0 ovdue 0
Extruder 1: no data received, errs: frame 0 parity 0 ovrun 0 pol 0 ovdue 0
=== CAN ===
Messages queued 189586, received 21212642, lost 0, errs 586337, boc 48
Longest wait 208ms for reply type 6033, peak Tx sync delay 1409, free buffers 50 (min 47), ts 63121/63096/0
Tx timeouts 0,0,0,0,0,0
=== SBC interface ===
Transfer state: 5, failed transfers: 0, checksum errors: 0
RX/TX seq numbers: 57670/57670
SPI underruns 0, overruns 0
State: 5, disconnects: 0, timeouts: 0 total, 0 by SBC, IAP RAM available 0x2600c
Buffer RX/TX: 0/0-0, open files: 0
=== Duet Control Server ===
Duet Control Server version 3.5.0-rc.3 (2024-01-26 12:34:19)
File+ProcessInternally:
> Busy
Failed to deserialize the following properties:
- Board -> BoardState from "timedOut"
Code buffer space: 4096
Configured SPI speed: 8000000Hz, TfrRdy pin glitches: 1
Full transfers per second: 39.34, max time between full transfers: 73.4ms, max pin wait times: 69.8ms/18.6ms
Codes per second: 13.12
Maximum length of RX/TX data transfers: 5088/3144

I'm going to try disconnecting all connections to the board except for the endstop required for homing, and then I'll run another print. Let me know if there's any additional steps I should try.

edit: Here's the M122 for the board that locked up after a restart

Diagnostics for board 21:
Duet TOOL1LC rev 1.1 or later firmware version 3.5.0-rc.3 (2024-01-24 17:55:14)
Bootloader ID: SAMC21 bootloader version 2.8 (2023-07-25)
All averaging filters OK
Never used RAM 2936, free system stack 104 words
Tasks: Move(3,nWait 7,0.0%,135) HEAT(2,nWait 6,0.2%,127) CanAsync(5,nWait 4,0.0%,51) CanRecv(3,nWait 1,0.0%,71) CanClock(5,nWait 1,0.0%,59) ACCEL(3,nWait 6,0.0%,53) TMC(2,delaying,3.4%,53) MAIN(1,running,91.7%,315) IDLE(0,ready,0.0%,27) AIN(2,delaying,4.6%,112), total 100.0%
Owned mutexes:
Last reset 00:02:17 ago, cause: power up
Last software reset at 2024-02-01 11:32, reason: HardFault, available RAM 2016, slot 0
Software reset code 0x0060 ICSR 0x00000003 SP 0x20002f08 Task MAIN Freestk 813 ok
Stack: 200048c8 00000000 00000001 4aa36799 00000000 0000f3cb 4aa36798 21000000 20001488 0000f00d a5a5a5a5 0080bdc1 00000089 000001f4 20001ce0 0001285f 00000000 00000000 20004370 a5a5a5a5 a5a5a5a5 a5a5a5a5 a5a5a5a5 a5a5a5a5 a5a5a5a5 a5a5a5a5 a5a5a5a5
Driver 0: pos 0, 404.8 steps/mm, standstill, SG min 0, read errors 0, write errors 0, ifcnt 12, reads 2957, writes 12, timeouts 0, DMA errors 0, CC errors 0, steps req 0 done 0
Moves scheduled 0, completed 0, in progress 0, hiccups 0, segs 0, step errors 0, maxLate 0 maxPrep 0, maxOverdue 0, maxInc 0, mcErrs 0, gcmErrs 0, ebfmin 0.00 max 0.00
Peak sync jitter 0/4, peak Rx sync delay 207, resyncs 0/0, no timer interrupt scheduled
VIN voltage: min 26.7, current 26.8, max 26.8
MCU temperature: min 34.8C, current 41.4C, max 41.4C
Last sensors broadcast 0x00018004 found 3 35 ticks ago, 0 ordering errs, loop time 0
CAN messages queued 2530, send timeouts 0, received 2038, lost 0, errs 0, boc 0, free buffers 18, min 18, error reg 0
dup 0, oos 0/0/0/0, bm 0, wbm 0, rxMotionDelay 0
Accelerometer: LIS3DH, status: 00
I2C bus errors 0, naks 3, contentions 0, other errors 0
=== Filament sensors ===
Interrupt 5726621 to 0us, poll 6 to 542us
Driver 0: no data received, errs: frame 0 parity 0 ovrun 0 pol 0 ovdue 0

dc42

@curieos said in 1LC Multiple Disconnects:

The only things that changed between the toolboard working fine and it having issues were a grounding wire run to the toolhead plate, and a new CAN cable with crimped terminations instead of the soldered pigtails.

What is the grounding wire connected to? If it's connected to mains ground, is Duet ground also connected to mains ground?

Is the extruder motor body also grounded?

curieos

@dc42 The grounding wire is connected to earth ground, and it screws into an aluminum toolhead plate. The grounding wire connected to the 1LC via one of the mounting holes is also connected to the toolhead plate. I can see if there's a low impedance connection between DC ground and earth ground when I get into work today, I assume this would be an issue.

The extruder motor body is physically touching said toolhead plate, and the mounting screws go through the toolhead plate. The toolhead plate is anodized, so it's not the most solid connection, but I've checked with a multimeter and there is continuity. The extruder is an LGX Pro, so the hotend mount is made of anodized aluminum, and the hotend metalwork is connected to the toolhead plate through the extruder mounting. I've checked with a multimeter and it detects continuity. If this isn't good enough I can run a discrete grounding wire from the toolhead plate to the coldside of the hotend.