Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Papers

Post History

77%
+5 −0
Papers Phone fix adventure case study

posted 11mo ago by Olin Lathrop‭

Article case-study
#1: Initial revision by user avatar Olin Lathrop‭ · 2024-02-16T23:47:14Z (11 months ago)
Phone fix adventure case study
I've had a Nexus 5X smart phone since 2016, so now in Feb 2024 it's almost 8 years old.  I like the phone and have had no problems with it until three days ago.  I had it in the car plugged in and connected as usual when I noticed it rebooting.  I thought that was a bit odd, but figured maybe that was due to a software update or something.

Several minutes later when I got where I was going, I noticed it was still rebooting.  Then I realized it was actually stuck in a reboot loop.  The screen would show "Google", wait a few seconds, vibrate like it was shutting off, then repeat the process.  Pressing the button didn't make any difference.

This post is about my adventure to eventually get the phone working again.  It might be useful for anyone else with this problem, but I think there are also some broader lessons here that go beyond a particular phone model, or even phones.

I'll relate the adventure as it unfolded for me, including the dead ends, dumb things I did, and how I eventually got it working.

<h2>Maybe it just needs a full power-down reset</h2>

My first thought was this was some weird software problem that would be fixed if I could only do a full power-down reset.  I'd never taken a phone apart, so was frankly scared and intimidated of trying.  Instead of opening it up and removing the battery, I thought I'd let the battery run down all the way.

I left it in the reboot loop overnight, and the next morning it was completely dead.  OK, great, now all I have to do is plug it into the charger and it will wake up.  It did - but straight back into the reboot loop.

<h2>It will go away on its own if I'm patient enough</h2>

Let's see what happens if I just let it continually reboot for as long as it wants to.  I plugged the phone into the charger and let it do its thing while I got on with work.  After about an hour, I checked on the phone and found it had booted far enough so that it was asking me to enter my pattern for swiping across the nine dots.  I did that, and it continued to boot up all the way.  It's working!

Not for long.  While I was looking at the phone, it suddenly want back to rebooting even though I hadn't touched it.

<h2>Thermal clue</h2>

I noticed that a particular spot near the top middle of the screen got unusually warm.  Hmm, that seems like a clue but I don't know what it means.  What if I cooled it?

I put the phone in the fridge and came back maybe 20 minutes later.  It was cold to the touch, and was asking for the swipe pattern to be entered again.  Apparently cold somehow helps.  It booted all the way up and worked for a few minutes.

Then it went back into the reboot loop.  This time I wasn't surprised.  But still, the cold temperature made a difference.

<h2>Somewhat random</h2>

I put the phone aside, but noticed that it would occasionally boot up further, but would always eventually go back to the reboot loop.  Sometimes it got farther than other times.  There was no obvious pattern.

<h2>It must be the battery!</h2>

The battery is 8 years old, so it wouldn't be out of line for it to start failing.  Maybe it's starting to leak charge.  That might even explain the unusual hot spot.

The initial booting doesn't take as much power as full running.  The battery is able to allow the device to boot.  Then when it comes up further and suddenly demands more current, the battery chokes, and the device punts back to a cold-start boot.  With the battery partially failed, different temperatures can make it behave differently.  This explains all the symptoms!

I got a new replacement battery, opened up the phone, and swapped in the new one.

Same problem.  It wasn't the battery.

<h2>The internet has the right answer (and many wrong answers)</h2>

At this point it was clear the phone was truly broken.  It's time for a new one.  I hate shopping, but my wife likes to, so she set about finding what phone I should get.  In the process, she discovered that a reboot loop was a common problem for my phone.  There was even a lawsuit about this problem, and LG (the actual manufacturer) extended the warranty 30 months because of this issue.

My phone was well past the warranty either way, but there was a lot of talk about the reboot problem on the internet.  Surprisingly, there were also a lot of home-grown fixes.

One must always be skeptical about a supposed "fix" to a problem on the internet.  Just because someone's phone came back to life after they danced on one foot while howling at the moon doesn't necessarily mean that's actually a fix.  The internet is full of one-off anecdotes with no control case and the assumption that correlation is causation.  Usually, it's best not to wade into this cesspool.

This time I waded in anyway.  Any one claim is questionable, but maybe a trend can be discerned.  My wife had sent me a few links, with one of them supposedly about how the solder balls of a BGA (ball grid array) chip got cracked due to many thermal cycles.  Hmm, that sounds interesting.

<h3>Firmware fix</h3>

Trying to investigate the BGA problem actually turned up several web pages describing the step-by-step process of installing a firmware fix.  So it's not a hardware problem after all?

All these software fix web pages gave basically the same instructions, about booting the phone into a very low level mode, connecting it to a computer via USB, and eventually flashing new firmware into the phone that fixed the boot loop.  After looking at several carefully, they all pointed to the same firmware created by one person.  There were various pages in different places by different people with comments by users that said the procedure worked for them.  There seemed to be enough redundancy, variation, and elapsed time that the firmware update really did work for a number of independent users.

But, there were also comments from users that said it didn't work for them.  While I think the successes were genuine, the method didn't work in all cases.  But what does this firmware do?  Why does it work at all?

Apparently the firmware patch disables some processor cores in the main processing chip.  It's these cores that don't work right, which causes the boot to abort and eventual re-try.  But if these cores are really bad, how did the phone function perfectly with the existing firmware for nearly 8 years.  Something doesn't add up.  Those for whom the fix worked ended up with less processing power.  They explained that this also meant less electrical power and therefore longer battery run-time, so dismissed it as an advantage.  Basically, I think these people were happy with the result since the alternative was a completely dead phone.  They were better off than before.  They felt lucky and weren't going to ask a lot of questions.

<h3>hardware fix</h3>

I still wanted to find out more about this BGA cracked solder issue.  There were fewer web pages about that out there, and they varied a lot more than the one firmware fix.  The explanation about lots of thermal cycles putting stress on the solder balls and eventually cracking some seemed plausible enough.  I've personally run into cracked solder joints from repeated mechanical stresses before.

While the theory behind the failure was plausible enough, some of the supposed "fixes" ranged from questionable to downright ridiculous.

One guy said he took the circuit board out of the phone and blasted it with a hair dryer on max temperature.  Hair shrivels up and melts well before solder does, so this couldn't have reflowed the solder.  If I remember right, he eventually admitted it didn't work.

Others said they baked the circuit board in a home oven for 30 minutes at 450&deg;F.  That's not going to melt solder either, but could seriously damage some components.  Some of these did claim a complete fix.  My guess is that the ones that worked used toaster ovens where the circuit board was directly irradiate by the heating elements.  The controller may have regulated the <i>air</i> temperature, but the board got much hotter for short periods when the heating elements were on.

One video showed heating a single chip using a hot air soldering station.  OK, that can reflow the solder, but he only did one chip out of 3 or so that were clearly visible in the video.  There was no consideration mentioned why the problem was supposed to be under that chip but not the others that were also clearly BGA.  His first attempt didn't work.  Then he replaced the memory chip with that from a different phone.  That did work, but or course his phone now had the data in it the other phone had.

<h2>Proper solder reflow</h2>

Despite being buried under a pile of voodoo science, the concept of reflowing cracked solder connection still made sense.  I had learned what I could from the internet, and it was time to apply some real science and engineering to the problem.

I'm an EE and have a hot air soldering station in my lab.  That's the right tool for reflowing solder.  I disassembled the phone to get down to the bare circuit board.  That included removing a snap-on shield over what looked like the computing core with several BGA chips.  I put a 7 mm round nozzle on the reflow station, and set it 350 &deg;C (660 &deg;F) and the lowest possible air flow.  I put a watch with a second hand were I could see it, and held the nozzle over each BGA chip for 30 seconds.

I had previously tested wire solder in front of the nozzle and found that it melted in 5 seconds.  The extra time allowed for the heat to get thru the chip to the solder balls below it, while still being within a time and temperature range such chips should be able to handle.  For good measure, I slowly moved the nozzle over other parts of the board so that most solder joints should have gotten re-melted.

After giving the board a minute or so to cool, I plugged in the battery and tried to boot the phone.  Nothing.  I guess I killed it completely.

Then I noticed that parts of the board were still hot.  Maybe the processor shut down to protect itself.

I put the unit in the fridge for about 10 minutes.  It was cold to the touch when I took it out, definitely colder than ambient office temperature.  I plugged in the battery again and started the phone.  Still nothing.

Maybe the battery was dead.  I did let it run for quite a while in the reboot loop.  I plugged the phone with battery into a USB charger, waited a few minutes, and tried again.  Success!  The phone came up all the way and worked as it should.  The battery level showed at 1%, so the battery really was empty when I tried it before.

After fully reassembling and buttoning up the phone, I plugged it into a charger and let it sit for a couple of hours until the battery level showed 100%.  I tried the phone a few times along the way, and it worked.  Just after I noticed the battery at 100% I happened to get a phone call.  Everything worked great, including the speaker phone.  Yay!

<h2>What really happened</h2>

I think the cracked solder joint theory is correct.  The true fix is to reflow the solder joints.

The firmware "fix" is merely a bandaid that address the symptoms some of the time.  It's plausible that the same specific solder balls often cracked on different phones.  The thermal stresses may get concentrated in some areas.  Some but not all solder failures resulted in the extra CPU cores not working.  The particular result was common enough that firmware that disabled these cores appeared to fix the problem in a statistically significant fraction of cases.

The solder reflow fixes described on the internet were unreliable because of the greatly varying methods people dreamed up to reflow the solder.  Some worked because they did actually reflow the solder, although usually not for the right reasons.  Some flat out didn't work.  Others appeared to work, at least for a while, just due to relative motions caused by the large temperature change.  Solder is quite soft and malleable.  Moving two rigid surfaces with solder squished between can reshape the solder somewhat.  Sometimes that's enough to make a broken contact connect again, at least for a while.

<h2>The lessons</h2>

Given a large enough sample (like a phone that sold in the millions), various people will find fixes to problems that come up.  Especially for highly technical problems, most of these "fixes" are one-off anecdotes that have little meaning by themselves.  They are usually not grounded in real science nor executed with solid engineering.

However, looking at all the various fixes and wild theories still has some value in the aggregate.  If there are enough independent reports of a particular fix working, there is probably something to it, even if none of the explanations make sense.

Once you've learned what you can from the aggregate mess, attack the problem properly.  Come up with theories based on real science that explain <i>all</i> the observed symptoms, then act on them accordingly.  Either that fixes the problem, or you observe new symptoms that allow refining the theory.  Repeat until a solution is confirmed.

In this case, a number of people had what in the end seemed to be the correct theory, which is that cyclic stresses cracked solder joints over time.  Amazingly, none of them then attacked the problem accordingly.  Either they used methods that couldn't have reflowed the solder, did so more by accident than design, or didn't apply the fix to all the parts that could be affected.