How Good of a Leader are you? Part 2
The agile method to building a lego sports car build with your 3 and 5 year old and coming back from a catastrophic failure along the way.
Welcome back to the agile method of building a lego sports car with your kids! And good news, we found success and we built it! Along with some challenges.
For part 1, read it here:
How good of a leader are you?
If you want to see how much (or little) patience you have I would recommend trying to build some legos with young kids and see how you do. In the ever failing quest to keep the kids away from tv, we’ve hit on a great, but challenging new …
Part two of building a lego sports car
The second half for our agile method proved both easy and challenging. As we got back into it, both kids were now used to the motion of gathering up the supplies and then building, so we had a great continuous motion going. So page by page, one kid gathered the materials and the other built with my role moving to facilitating the gathering and coaching the kids to follow the excellent directions built out in the lego guide:
Three year old finds the right blocks → 5 year old builds → 5 year old finds the right blocks → 3 year old builds; rinse and repeat.
Resources started to get busier when other household priorities got added to the mix. Our three year old saw mommy folding laundry and then true to her helpful nature, went over to help fold.
O, no, we’re down a dev resource! Not to worry, with our agile process firmly in place, I pivot to looking a page ahead in the directions and help gather the pieces to ensure our 5 year old is queued up and has the pieces needed to continue on. I also made sure to coach him on looking at the instructions as was placing the pieces so that we didn’t introduce any “bugs” into the car and ensure there was no motion waste and rework needed.
Once the three year old got back, we had a new challenge, boredom! A three year old can only hold attention for so long, so the project had to work with our junior engineer to help keep her motivated and on track. The three year old was coming and sitting in my lap, playing and putting together pieces that were needed in the main design and generally being silly.
In these cases, you need to work with the junior engineer on their level. So while helping the five year old stay on pace with his building, I had to embrace part of the silly with our three year old and then get their help for gathering pieces or building whenever they wanted it. Overall, the goal here was flow and the developer experience, if we wanted more lego cars in the future we all needed to have a good time!
And finally, the last wheel was put on, success! We built the Koenigsegg!
It came out great and the kids were super happy. We come to the next part of our process, user testing. To make sure this product was good to go, we needed to take it for a test drive and do it with a previously made car, a Toyota Supra. In the initial testing, the Koenigsegg drove like a beaut side by side with the Supra and passed all the kids' stress testing until…a catastrophic failure.
Along the way while we’ve been building these cars, which are fantastic toys, I’ve been helping the kids to understand while these are awesome cars, they are more fragile than their hot wheels and other toys, so while we can play with them, we really need to be careful with how we do it and smashing them/crashing them will probably lead to it not being usable anymore.
Catastrophic Failure
Queue, present day and time when the 5 year old decides that the box the Koenigsegg came in would be a garage. Awesome imagination and reuse for the cardboard! But, that garage became a flying one and then the feature did not pass the biggest stress test of all, falling three feet to the floor, BOOM! The front bumper flew off, went one direction and the tail shattered into several pieces and went another.
Now we’ve entered the space of catastrophic failure and recovery. We don’t have a script now for how the pieces go together, where all the pieces are to make sure that we can recover and how to make sure we come together as a team so that we can learn from this.
How do we recover?
First off, just like on my engineering teams, we have an on call engineer who once an issue is identified moves to a role of issue lead. This engineer is working to identify a root cause for the defect and then pilot the team to recovery. This role also is the focal point of communication in and out of the team. The goal here is to captain a fix through to recovery and not to do everything themselves.
This role also helps keep things calm in a high pressure situation. It’s important to have someone in this seat during a fix, because they will have the birds eye view of what’s going on and the knowledge of the issue to help direct other engineers towards a fix.
In this situation, I put my Issue Lead hat back on and finding the root cause was easy here, the lego car dropped from three feet up, so then it moved to gathering up the team and starting to swarm for a fix.
The Swarm
Swarming during issue triage and fix resolution is a path that I promote on my teams and has taken hold industry wide as a standard because it helps with a few things. First off, it aligns all your engineers to get on the same page that there is an issue happening and alerts them that there is a critical defect being worked on.
Secondly, as the fix is going into place, it allows them to check more of the code base and application for impact, which helps the team gain knowledge of the full scope of impacted systems.
Thirdly, it allows for better communication outside of the team to other teams and the business. Lastly, when everyone swarms on a critical defect, it allows everyone to gain knowledge of that defect to help identify and fix issues like this in the future and work towards hardening the application so this defect doesn't happen again.
In our case, we started to get our swarm on by helping to calm everyone down by saying, “It’s ok, accidents happen and now we learned lego cars can’t survive a three foot drop. Let’s start gathering all the pieces up.” Given that we needed to swarm, the five year old went north of the crash site to gather up all the pieces there and the three year old went south. Once we were reasonably sure we had all the pieces we needed, we started to build the fix.
We found that the two impacted areas were the front bumper and the back tail fin. Working with the five year old, we went back to the design to make sure we had an understanding of where all the pieces should go even though the script for how to put them together was different.
The Fix
He started to put the bumper back together, but when it came time to reattach, we couldn’t. Luckily, just like in a resilient software architecture, legos easily come in and out of place so we were able to decouple a few more blocks and reattach the front bumper.
Next we examined the tail and started to reassemble it. Since it was simpler, I was able to work with the three year old on it and we had it reattached in no time…with one problem, there was a missing piece left out of the fix?!
We then went back and reviewed the design and found this was part of the back bumper, found the place it belonged and voila, we had our car fixed!
Testing
At this point in defect resolution, once a fix is in, testing the system to make sure that not only is the defect resolved, but just as importantly you did not introduce any regression or new bugs into the platform.
This is where building automated testing is very helpful. End to end testing, with tools like Cypress or Browserstack, will allow you to make sure there are no front end regressions. Building out a mix of integration and unit testing will also help speed up the testing process. The goal here is to test as quickly as possible to get a fix for a defect out to production.
For our testing, we had the following test plan, since it had to be a manual smoke test in our case. We made sure the car matched the spec. After we reviewed the fix against the spec, the five year old took it for a test drive and the three year old took the Supra again so they could race and success, the Koenigsegg drove like it should! Lastly, we didn't drop it from any height (that usage is out of scope and unsupported for lego cars).
Issue Resolution & Communication
The last parts of issue resolution are communicating out to stakeholders the defect is resolved and the fix is in production, so that they can all have peace of mind that there are no issues and it’s back to business as normal. Communication and regular updates are critical to issue resolution for a couple reasons.
It helps prevent panic from the rest of the business, because they are receiving regular updates and knowledge of what the issue is.
The more you communicate in these situations the more it builds trust that when these issues come up, they will be handled in an expedited manner.
It also builds trust in the application and engineering team that these issues will get surfaced when they happen and looking over the history of them it will hopefully show that they don’t happen frequently so they as the business teams shouldn’t be nervous that there will be constant outages. For this reason, tracking DORA metrics, specifically Change Failure Rate (CFR) and Mean time to recovery (MTTR) are critical in building trust and confidence in your platform.
Post Mortem
The final step in the process is an after action report called a post mortem. A post mortem in software engineering is a retrospective analysis conducted after the occurrence of an issue or incident. It involves gathering relevant stakeholders, including developers, testers, and project managers, to review the problem, its impact, and the actions taken to resolve it. The post-mortem aims to identify the root causes of the issue, assess the effectiveness of the response, and extract valuable lessons for future improvement.
By analyzing the incident in a structured manner, documenting findings, and implementing corrective measures, the post-mortem facilitates continuous learning, enhances system reliability, and helps prevent similar issues from recurring in the future.
In our case, we realized that dropping a lego car from almost any height will cause a catastrophic failure and we will not have an amazing car to play with. We also extended this unsupported use case to trying to roll it down the stairs as well.
Thanks for learning with us today and would love to see any cars you build in the future!