1LC Multiple Disconnects

T3P3Tony

@curieos said in 1LC Multiple Disconnects:

The red status light started out blinking brightly, but then gradually faded in intensity

Did the 5V power LED stay lit to the same intensity throughout?

curieos

@T3P3Tony From what I recall, yes.

curieos

@T3P3Tony @dc42 Checking now with a volt meter (the board is currently locked up), VIN is reading fine, 27V. 3.3V from the IO 1 connector reads 3.3V, and 5V from the IO 2 connector reads just shy of 5V (around 4.95), which I believe is also fine.

The only way I can get the board to display the symptoms is to start a print. So far any print I start causes the issue. Idling does not cause the issue. I don't have an exact amount of time it takes to lock up, the console logs don't report the error until functions assigned to that toolboard are used, but I'd estimate it takes roughly 15 minutes after a print is started. The board does not return to normal until the whole machine is restarted.

dc42

@curieos said in 1LC Multiple Disconnects:

Checking now with a volt meter (the board is currently locked up), VIN is reading fine, 27V. 3.3V from the IO 1 connector reads 3.3V, and 5V from the IO 2 connector reads just shy of 5V (around 4.95), which I believe is also fine.

Yes those are fine; but that doesn't prove that they are always fine. If you have a bad crimp or a fractured wire then the power is likely to disconnect occasionally depending on the movement of the power cable, which likely depends on the position of the print head. If the disconnection is short then the capacitors in the power circuit may supply sufficient power for long enough until the power connection is restored.

The fact that you saw the red LED gradually dim is a sure sign that the 5V supply was gradually lost. Once power is lost, when it is restored the configuration parameters will be lost, so you will need to reboot the machine or at least run config.g to restore normal operation.

Have you provided any strain relief on the power cable, after it has exited the JST VH connector?

Are you powering anything from OUT0 ?

curieos

@dc42 The red status light, not the red 5V light. The red 5V light stays brightly lit the whole time. The board is locked up, I can not communicate with it in any form. When I send M122 B21 it does not respond. This isn't a momentary issue.

Yes, there is plenty of strain relief. A zip tie secures the power and CAN cables after they exit the chain for both toolheads. There's also cable management to make sure the cover doesn't squish any wires.

OUT0 is hooked up to two 50 watt Slice heaters. During this last print that caused this issue, those heaters were never activated. That tool was unused and idle the whole time.

curieos

@dc42 Here's some pictures of the toolboard, right now. Both voltage lights are lit up. The CAN activity and status lights are off. It's difficult to illustrate this without taking a video.

~~I'm trying to get some thermal shots of it to see if there's any hotspots I can't detect, but our thermal camera is refusing to cooperate.~~ edit: I got a thermal camera image. There's a bit of an offset due to how close I have the camera, but you can get an idea of where the hot spots are based on the ghost of the wires and mounting holes:

I believe the hotspot is U3, which is the 12V buck converter.

dc42

@curieos thanks for the photos. If the hotspot is that area indicated by the red circle and cross, that's diode D5 which feeds the output of the 12V regulator to the VOUT pin of the OUT1 and OUT2 connectors. It would be odd if that component was generating a lot of heat. What I suspect is that the surrounding area is at more or less the same temperature and the heat is coming from both the 12V buck regulator and the 5V linear regulator.

Can you confirm that you are not using the IO_0 connector to power anything? It's not connected in your photo.

It's nor normal for the Status lights to be off and the power lights to be on. That condition should only occur briefly after power up when the crystal oscillator is starting and the board is booting up. So I suspect that the processor has gone into some sort of locked up state that even the watchdog timer can't recover from, or the oscillator has stopped. The only reasons I can think of for it locking up in this way are static discharge, a a loss of power that wasn't quite enough to cause a full shutdown, or a faulty tool board. We know that static discharge is common on hot ends, but we haven't know of it affecting a TOOL1LC in this way. Nevertheless, please ensure that the hot end metalwork is grounded. You can ground it to one of the mounting screws of the tool board.

What was the reason for putting a heatsink on top of the MCU? How did you attach it? The MCU generates very little heat and certainly doesn't need a heatsink. The components that generate heat on the tool board are the 5V linear regulator, the TMC2209 driver (depending on motor current), and the 12V buck regulator (depending on VIN voltage and the current draw from OUT1 and OUT2).

I'd suggest that we replace your tool board, but you said you have already swapped it for a different one.

curieos

@dc42 Correct, IO_0 is not connected. You said OUT0 earlier, which was confusing.

The hotend metalwork and toolboard are connected via the toolplate. The hotend has continuity with the plate, and the toolboard has a connection to the toolplate via the green wire in the upper right corner.

I put the heatsink on the MCU because I noticed the previous toolboard's MCU was very hot to the touch before I swapped it out. I put it on that toolboard, and then transferred it to the new one when I swapped it out. Originally I thought it was just an overheating issue. The heatsink has preattached thermal adhesive. Not shown is a thin ceramic heatsink on the rear of the toolboard, mounted with some 3M thermal transfer adhesive. That covers the MCU and stepper driver.

curieos

@dc42 Any other suggestions? I just replaced the power cable for the toolboard with brand new wire and this issue is still occurring.

curieos

@dc42 I tried updating that machine to 3.5 rc3 and the issue persists. What physically can cause a 1LC to lock up until a full power cycle?

Phaedrux

Are the mainboard and toolboard powered from a common PSU? If not, are the grounds of the PSU tied together?

curieos

@Phaedrux Yes, all of the 24V system is connected to one supply.

curieos

@dc42 I'm reporting in that this issue is still occurring. I have changed out the CAN data and power cable runs to this toolhead yet again, in fact the run is a little different this time. The tool distribution board is now mounted to the gantry, so the power and data cables are shorter than before.

The same toolboard locks up at around the same point in any print I've tried, typically around the 1 hour mark. It's been replaced once already. The board does not recover until a full power cycle of the machine is performed.

I have an M122 output from this latest event:

=== Diagnostics ===
RepRapFirmware for Duet 3 MB6XD version 3.5.0-rc.3 (2024-01-24 17:59:29) running on Duet 3 MB6XD v1.01 or later (SBC mode)
Board ID: 0JD2M-999AL-D25SW-6JKD0-3S06R-95ZB1
Used output buffers: 1 of 40 (40 max)
=== RTOS ===
Static ram: 153600
Dynamic ram: 92392 of which 0 recycled
Never used RAM 96400, free system stack 122 words
Tasks: SBC(2,ready,14.4%,407) HEAT(3,nWait 6,0.2%,322) Move(4,nWait 6,8.7%,213) CanReceiv(6,nWait 1,28.2%,770) CanSender(5,nWait 7,0.1%,325) CanClock(7,delaying,0.1%,346) MAIN(2,running,47.7%,444) IDLE(0,ready,0.6%,30), total 100.0%
Owned mutexes: HTTP(MAIN)
=== Platform ===
Last reset 03:30:24 ago, cause: power up
Last software reset at 2024-03-07 09:55, reason: User, Gcodes spinning, available RAM 95784, slot 0
Software reset code 0x6003 HFSR 0x00000000 CFSR 0x00000000 ICSR 0x00400000 BFAR 0x00000000 SP 0x00000000 Task SBC Freestk 0 n/a
Error status: 0x04
Aux0 errors 0,0,0
MCU temperature: min 30.9, current 48.1, max 49.9
Supply voltage: min 25.8, current 26.0, max 26.1, under voltage events: 0, over voltage events: 0, power good: yes
12V rail voltage: min 12.1, current 12.1, max 12.2, under voltage events: 0
Heap OK, handles allocated/used 99/38, heap memory allocated/used/recyclable 2048/1388/712, gc cycles 4
Events: 3 queued, 3 completed
Driver 0: ok
Driver 1: ok
Driver 2: ok
Driver 3: ok
Driver 4: ok
Driver 5: ok
Date/time: 2024-03-14 14:58:02
Slowest loop: 1000.79ms; fastest: 0.05ms
=== Storage ===
Free file entries: 20
SD card 0 not detected, interface speed: 37.5MBytes/sec
SD card longest read time 0.0ms, write time 0.0ms, max retries 0
=== Move ===
DMs created 125, segments created 31, maxWait 3494446ms, bed compensation in use: none, height map offset 0.000, max steps late 1, ebfmin 0.00, ebfmax 0.00
no step interrupt scheduled
Moves shaped first try 9729, on retry 4260, too short 16445, wrong shape 17913, maybepossible 1481
=== DDARing 0 ===
Scheduled moves 59311, completed 59311, hiccups 0, stepErrors 0, LaErrors 0, Underruns [0, 0, 4], CDDA state -1
=== DDARing 1 ===
Scheduled moves 0, completed 0, hiccups 0, stepErrors 0, LaErrors 0, Underruns [0, 0, 0], CDDA state -1
=== Heat ===
Bed heaters 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1, chamber heaters -1 -1 -1 -1, ordering errs 0
=== GCodes ===
Movement locks held by null, null
HTTP* is doing "M122" in state(s) 0
Telnet is idle in state(s) 0
File* is idle in state(s) 0
USB is idle in state(s) 0
Aux is idle in state(s) 0
Trigger* is idle in state(s) 0
Queue* is idle in state(s) 0
LCD is idle in state(s) 0
SBC* is idle in state(s) 0
Daemon is idle in state(s) 0
Aux2 is idle in state(s) 0
Autopause* is idle in state(s) 0
File2* is idle in state(s) 0
Queue2 is idle in state(s) 0
Q0 segments left 0, axes/extruders owned 0x0000000
Code queue 0 is empty
Q1 segments left 0, axes/extruders owned 0x0000000
Code queue 1 is empty
=== Filament sensors ===
check 0 clear 0
Extruder 0: no data received, errs: frame 0 parity 0 ovrun 0 pol 0 ovdue 0
Extruder 1: no data received, errs: frame 0 parity 0 ovrun 0 pol 0 ovdue 0
=== CAN ===
Messages queued 189586, received 21212642, lost 0, errs 586337, boc 48
Longest wait 208ms for reply type 6033, peak Tx sync delay 1409, free buffers 50 (min 47), ts 63121/63096/0
Tx timeouts 0,0,0,0,0,0
=== SBC interface ===
Transfer state: 5, failed transfers: 0, checksum errors: 0
RX/TX seq numbers: 57670/57670
SPI underruns 0, overruns 0
State: 5, disconnects: 0, timeouts: 0 total, 0 by SBC, IAP RAM available 0x2600c
Buffer RX/TX: 0/0-0, open files: 0
=== Duet Control Server ===
Duet Control Server version 3.5.0-rc.3 (2024-01-26 12:34:19)
File+ProcessInternally:
> Busy
Failed to deserialize the following properties:
- Board -> BoardState from "timedOut"
Code buffer space: 4096
Configured SPI speed: 8000000Hz, TfrRdy pin glitches: 1
Full transfers per second: 39.34, max time between full transfers: 73.4ms, max pin wait times: 69.8ms/18.6ms
Codes per second: 13.12
Maximum length of RX/TX data transfers: 5088/3144

I'm going to try disconnecting all connections to the board except for the endstop required for homing, and then I'll run another print. Let me know if there's any additional steps I should try.

edit: Here's the M122 for the board that locked up after a restart

Diagnostics for board 21:
Duet TOOL1LC rev 1.1 or later firmware version 3.5.0-rc.3 (2024-01-24 17:55:14)
Bootloader ID: SAMC21 bootloader version 2.8 (2023-07-25)
All averaging filters OK
Never used RAM 2936, free system stack 104 words
Tasks: Move(3,nWait 7,0.0%,135) HEAT(2,nWait 6,0.2%,127) CanAsync(5,nWait 4,0.0%,51) CanRecv(3,nWait 1,0.0%,71) CanClock(5,nWait 1,0.0%,59) ACCEL(3,nWait 6,0.0%,53) TMC(2,delaying,3.4%,53) MAIN(1,running,91.7%,315) IDLE(0,ready,0.0%,27) AIN(2,delaying,4.6%,112), total 100.0%
Owned mutexes:
Last reset 00:02:17 ago, cause: power up
Last software reset at 2024-02-01 11:32, reason: HardFault, available RAM 2016, slot 0
Software reset code 0x0060 ICSR 0x00000003 SP 0x20002f08 Task MAIN Freestk 813 ok
Stack: 200048c8 00000000 00000001 4aa36799 00000000 0000f3cb 4aa36798 21000000 20001488 0000f00d a5a5a5a5 0080bdc1 00000089 000001f4 20001ce0 0001285f 00000000 00000000 20004370 a5a5a5a5 a5a5a5a5 a5a5a5a5 a5a5a5a5 a5a5a5a5 a5a5a5a5 a5a5a5a5 a5a5a5a5
Driver 0: pos 0, 404.8 steps/mm, standstill, SG min 0, read errors 0, write errors 0, ifcnt 12, reads 2957, writes 12, timeouts 0, DMA errors 0, CC errors 0, steps req 0 done 0
Moves scheduled 0, completed 0, in progress 0, hiccups 0, segs 0, step errors 0, maxLate 0 maxPrep 0, maxOverdue 0, maxInc 0, mcErrs 0, gcmErrs 0, ebfmin 0.00 max 0.00
Peak sync jitter 0/4, peak Rx sync delay 207, resyncs 0/0, no timer interrupt scheduled
VIN voltage: min 26.7, current 26.8, max 26.8
MCU temperature: min 34.8C, current 41.4C, max 41.4C
Last sensors broadcast 0x00018004 found 3 35 ticks ago, 0 ordering errs, loop time 0
CAN messages queued 2530, send timeouts 0, received 2038, lost 0, errs 0, boc 0, free buffers 18, min 18, error reg 0
dup 0, oos 0/0/0/0, bm 0, wbm 0, rxMotionDelay 0
Accelerometer: LIS3DH, status: 00
I2C bus errors 0, naks 3, contentions 0, other errors 0
=== Filament sensors ===
Interrupt 5726621 to 0us, poll 6 to 542us
Driver 0: no data received, errs: frame 0 parity 0 ovrun 0 pol 0 ovdue 0

dc42

@curieos said in 1LC Multiple Disconnects:

The only things that changed between the toolboard working fine and it having issues were a grounding wire run to the toolhead plate, and a new CAN cable with crimped terminations instead of the soldered pigtails.

What is the grounding wire connected to? If it's connected to mains ground, is Duet ground also connected to mains ground?

Is the extruder motor body also grounded?

curieos

@dc42 The grounding wire is connected to earth ground, and it screws into an aluminum toolhead plate. The grounding wire connected to the 1LC via one of the mounting holes is also connected to the toolhead plate. I can see if there's a low impedance connection between DC ground and earth ground when I get into work today, I assume this would be an issue.

The extruder motor body is physically touching said toolhead plate, and the mounting screws go through the toolhead plate. The toolhead plate is anodized, so it's not the most solid connection, but I've checked with a multimeter and there is continuity. The extruder is an LGX Pro, so the hotend mount is made of anodized aluminum, and the hotend metalwork is connected to the toolhead plate through the extruder mounting. I've checked with a multimeter and it detects continuity. If this isn't good enough I can run a discrete grounding wire from the toolhead plate to the coldside of the hotend.

curieos

@dc42 I may have fixed it.

Turns out on this unit, there was a zero impedance connection between DC ground and Earth ground. I found the culprit, the cutout for the ethernet extension plugged into the pi was made upside down. This meant the shielding on the plug was in contact with the aluminum frame panel. The panel is powder coated, but it appears the coating wore away in this area. I flipped it upside down and now the grounds aren't shorted.

I ran a 6 hour print last night and there were no issues. The longest I've seen it go without locking up was 3-4 hours, so I think this means the problem is solved. I'm running a slightly longer print (22.5 hours) now, so that will definitely be definitive.

dc42

@curieos I'm glad you appear to have found the cause! Ground loops can cause various odd behaviour, just like ESD can.

I'll mark this thread as solved but feel free to revert it if the problem recurs.