Why I hate the “fix it in software” mentality, and why reducing risk is everyone’s job.
I want to tell you a story about my freshman year robotics experience.
I was working on autonomous code with another freshman. It wasn’t behaving like we were expecting. Not knowing anything better, we played around with variables without consideration for the risk. Eventually, we got a barely functional system. We then tried a different path, and it stopped working again. Eventually, we had to go to competition, without a working autonomous. By some miracle, we qualified for Championships at our first competition. Great! Now, we can work on the autonomous! But the robot went back into the bag, and we worked on our secondary robot.
We did not go to Championships, but naive me was determined to get a working autonomous. At Championships, we realized that we had assigned the wrong Can ID to the motors to the wrong motors. We also had other wiring issues. Our robot was even more screwed up. After about an hour on the verge of tears during the lunch break, the code started to work again. We went on the field, and didn’t perform terribly. However, our post-mortem revealed the depth of our problems, and fundamentally changed my view of project development forever.
Table of contents
- The Post-Mortem
- Every Decision has Risks and Consequences
- Take Steps to Mitigate Risks
- “Fix It In Software”
- The Impact of Risk Mitigation Failures
Our robot looked beautiful, but it was a facade around a deeply flawed machine. Every subteam made poor decisions that piled up to a disaster of a machine.
We had too many degrees of freedom. Our left and right master motor controllers were controlling the opposite motors. There was insane backlash that we had not realized existed. Our sensors were unreliably wired, and some of our limit switches never triggered. These are just some of the problems that we had experienced.
Our code was 100 times worse. In an effort to compensate for all of these failures, we ended up controlling the wrong motor controller with the wrong feedback. This was partially due to the wiring issues, and it was a miracle that anything worked. I won’t go into extreme detail, but our robot was the definition of janky.
Every Decision has Risks and Consequences
The problem with our robot was that small decisions were made that had negative consequences. Every stage of our robot development had inattention and short-term decision making. Each of these decisions ended up coming back to bite us.
Robot development is a waterfall. The impacts of the decisions are often not felt by the people who made the decision. Rather, the people later in the process have to try to compensate. People often make other poor decisions unintentionally, and try to slap together a patch to make the robot “work.”
The problem with patches is that they don’t fix the problem, they only make the symptoms disappear. The reality is that the last subteams in the chain have to deal with the patches with more patches. As a result, reliability suffers.
Take Steps to Mitigate Risks
There will always be problems with any machine. Some of them are unexpected, others are easy to plan out. Every time you find a problem, you need to take steps to reduce problems in the future.
Every team and subteam needs to figure out what they can do to mitigate risks. There may be a fair amount of bureaucracy involved in this process. This bureaucracy is not necessarily a bad thing. The software subteam is by far the most bureaucratic on my team. All of that bureaucracy came from a position of mitigating risks. Below are some of the steps that we took to mitigate risks.
Document Every Failure
I have a spreadsheet of every medium-to-major failure that we have made in the last two years. Each one has a category assigned, a summary, steps to mitigate in the future, and an approximate time impact that this failure had.
It may be annoying to document every medium-to-major failure, but I promise you that the approximately 43 seconds that it takes to fill out the form is nothing compared the weeks that some of these issues caused. When you have documentation, it is easier to take steps to mitigate in the future.
Have Trusted People Involved
This year, we let our rookies contribute to the robot project, but not in the main repository. They were responsible for a fair amount of the code, but it was only added to the production repository by people who were deeply familiar with the existing codebase. This meant that rookies could contribute, but that a lot of the risk of incompatible codebases were unified into one process (a large reason for this was that many of our rookies weren’t familiar with our existing structure). We then had implementation days when veterans would either assist rookies in adding it to the production repository, or add it themselves.
I think that it is good for rookies to have experience in the production repo, but especially because of time constraints, that wasn’t feasible. However, by keeping their code in a separate repo, we helped mitigate the risks associated with a patchwork of code layout. It also allowed us to quickly restructure our entire project (which we had to do a couple of times), because everything was in the same format.
It is certainly good for rookies to work on code, and we still did that. But there is an undeniable reality that rookies who have very little training (one of our rookies joined Week 2 of Build Season) are an added risk due to their lack of training. What works for your team may not be what our team did, but I encourage you to put structures in place for critical projects to remain working.
Design, Plan, Implement, Test
I mentioned this in a previous post, but you need to design, plan, and test any aspect of your robot, especially your code. BSing your way through will add extra oversights and increase risk for failure. In fact, here are some blog posts of mine that talk about reliability in code, many of which focus on these aspects:
Create Procedures and Checklists
Procedures and Checklists are huge helpers to mitigating risks, especially the unintentional oversights. I bring a binder with me to every competition that has everything that I could need in terms of resources, including several checklists (lunch checklist, in-pit and on-field pre-match checklists, debugging checklist, packlist, etc.), and we physically check off each box every time we run the checklist. Once we started doing this, we saw a durastic reduction in the number of on-field issues.
Proactive Identification and Mitigation
All of these systems help prevent past failures, but there always will be new ones. Your team needs to have a culture of proactive identification and mitigation of risks to help curb catastrophes during the season. That means that every decision that is made has risk evaluation as part of it, and when a risk slips through the cracks, your post-mortem not only goes back to what caused the risk, but why wasn’t it identified beforehand. Proactive Mitigation can be combined with the other solutions to help ensure that you minimize the risks with your robot.
“Fix It In Software”
I have heard this countless times. It often is said as a joke, but it is a philosophy that many teams, my team included, embody. Fixing something in software is not fixing it. It is patching it. When you say “fix it in software,” you are automatically creating more risk. Mechanical solutions tend to be more reliable, because they play by the laws of physics, while software really has fairly little control over a robot.
There are times to fix it in software. Our 2019 robot is a perfect example.
We had conflicting subsystems (check out the photo at the link above), and had to solve that in software. The other option was a complete robot redesign at the end of Week 4 of Build Season. While we did make it work, it added about three weeks to our development time and completely restart of our robot project (we literally deleted the repo and started over from scratch). It is very easy to just say “fix it in software,” but unless you have a representative from the Software subteam there to give an honest analysis, I would highly caution against it. Our codebase is fairly rigid, so we don’t want to have to make huge changes, and if there is a simple mechanical or electrical solution, that may be worth it.
My team embodies a student run, student built philosophy. Our mentors have started promoting risk mitigation, especially some of our design mentors, but in my opinion, I don’t think overall our mentors are doing enough. I have worked on promoting it in the Software subteam, but I am worried that it is not sustainable. I have been working with some of the other people on Software to help them understand why it is important and why it matters, but I think that all of these changes will roll back once I am gone.
Mentor support is critical, especially over a long term. Some of the things, especially the problem record, isn’t sustainable in a purely student-run world. For example, there is some amount of infrastructure that needs to be put in place, and as a student, it can be intimidating to set up infrastructure without mentor endorsement. It also takes cultural development over time, which means that it needs support from adults to sustain.
So I would suggest that students get excited and support risk mitigation efforts, but nothing will be successful without the support of mentors. Even though we are going through a pandemic right now, I think that it would be worth for teams to start talking through a plan on how to mitigate risks in the future. I would take a look at some of the thinks that I put in the “Take Steps to Mitigate Risks” section.
The Impact of Risk Mitigation Failures
I cannot say this enough, risks that aren’t mitigated early on affect other subteams. As someone on software, I get to deal with the impact of it. In my experience, it often takes a lot longer for problems to be effectively patched down the line than at the source. Also, as you go down the waterfall, you have less direct control over the physical aspects of the robot.
Everyone Looses When Risks Aren’t Mitigated
Your team needs effective risk mitigation across the board in order to be successful. As I saw in 2018, when you leave holes for people downstream to patch, you end with a colossal disaster of a robot that is barely functioning. At the end of the day, robots are designed to compete, and if you don’t mitigate risks and have honest discussions, your robot probably won’t win. So don’t fix it in software, fix it now.