The integration nightmare is familiar to many: deadlines are fast approaching, and a product is far behind schedule. You have a to-do list a mile long, and whenever you check one thing off, twenty more pop up. It feels like whack-a-mole: crises arrive from every direction, and your team is too constrained to have enough hammers. I have been in this exact situation. My team was three weeks into development with only five weeks until a hard deadline. All of a sudden, the project specification had changed entirely, requiring “rapid-fire” innovation. In other words, we were running behind schedule and now having to reinvent the wheel.
We were full of fear: twenty other people were relying on our team of five to make a working, highly automated robot codebase with over 10,000 lines of code. None of us had any experience with automation. Despite all of the challenges, we finished ahead of schedule and with the best product ever. We found a secret ingredient that reinvented our organization and could change yours too.
However, before I share the secret, I want you to imagine the possibilities of product development. Your team is innovating at ten times the pace of other similarly-sized organizations. The tension in the office and on Microsoft Teams is gone, replaced by exuberance and creativity. Customers become loyal advocates, demanding and receiving excellence. They share your product with others, creating a marketing force that your competitors cannot match.
Unfortunately, that reality is not yours. However, it could be. Leaders often overlook the secret sauce: reliability. If your organization’s culture emphasizes reliability, your product development could shift from the first reality to the second. I know this because my team made the transition, and a new era began. Now, I will share our transformation with you to help your team achieve its full potential.
The 10x Rule
I am basing my advice on my experience in software, specifically control systems, development. However, the principles apply to any product development team.
Our transformation began with a simple realization: we spent 10 minutes writing code and 100 minutes debugging for every minute spent on software system conceptualization. If we could reduce the 100 minutes of debugging, we could spend exponentially more time on innovation in software design. Not only did we nearly eliminate the 100 minutes of debugging, but our new philosophy also halved implementation time.
Thanks to the new framework, my team made more innovations in five weeks than in the past ten years. The software had gone from the robot’s Achilles heel to its most powerful asset.
3 Principles of Reliability
Reliability consists of three principles:
- Start Early, Evolve Continuously
- Standardize Development, Mitigate Risks
- Do it Once, Do it Right
What do these mean? Let us look deeper into each to understand how they influence your product development cycle.
Start Early, Evolve Continuously
I cannot overemphasize the importance of getting started with reliability as soon as possible. Once you get too far down the process, it takes a lot more time and incurs a lot more risk. Imagine trying to take a car in a junkyard and making it the “most reliable vehicle in the world.” You could apply these principles to push that vehicle to become the epitome of reliability. Still, you would be looking at spending more time than it would take to make a new car reliable, and the value proposition diminishes.
On the other hand, take a look at an organization like NASA. I based many of my ideas on NASA’s history and culture. The spacecraft NASA builds today are building on 63 years of reliability programs. NASA’s ability to fly a helicopter on a planet is only possible thanks to its 63 years of development in a stability-oriented culture.
Standardize Development, Mitigate Risks
Humans are fallible creatures. Reliability issues often stem from human error. When a team encounters a mistake, a standardized system makes it easier to fix across the board. It avoids forgetting to fix the problem in a particular place. However, if followed, standards, procedures, and best practices significantly reduce the risk of mistakes.
Do it Once, Do it Right
This phrase became my motto, and I even considered printing t-shirts with this phrase on them. I cannot count how many times I have said, “I’ll fix it later,” and three weeks later, I still would not have fixed the issue. Adopting the do it once, do it right approach radically cut down on our bugs.
Changes to existing components can create significant issues. All parts connect in modern systems and codebases, and you have no idea where variables pass information. For physical products, any change means that individual pieces could now be incompatible with each other.
You want to avoid making significant changes to a structure. More importantly, you want to ensure that a class acts as you expect and that everyone else understands how it works. Implementation plans can help ensure that everything works consistently, that good documentation exists for all classes and that the author implements all necessary features. They are documents that outline what a class needs to have and the expected behavior or value for each variable and method. Once documented, programmers can quickly implement the documents, reducing mistakes and ensuring no functionality is lost.
Reliability Makes or Breaks Your Organization
Windows Vista stands out to me as a perfect example of why reliability matters. Vista has become infamous even among people who ordinarily do not know or care what operating system they use. Even a decade after its discontinuance, people still use it as a comparison for lousy software. Vista’s failure likely significantly contributed to the rise of Apple’s laptop and desktop market share.
Here is the thing, it is effortless to argue that Vista was one of the most impactful and best pieces of software released in the 2000s. It introduced design paradigms that companies still use to this day. It introduced widgets to the desktop, a feature that Microsoft re-introduced in Windows 11. In many ways, Vista is almost identical to Windows 7, which many Microsoft fans consider the best version of Windows ever.
So, what separates one of Microsoft’s biggest flops and possibly its most notable success? The answer is reliability. Many Vista users could not experience Microsoft’s industry-leading design system due to rampant instability and blue screens. Even if Vista was usable, support for printers and input devices was hit-or-miss. In other words, instability overshadowed all of Vista’s unique features. Apple capitalized on Vista tropes with the “I’m a Mac” advertisement series. Google still relies on the same tropes for Chromebook ads today, even though they are not present on more modern versions of Windows.
Do you want your product to be seen as a disaster or as a leader? If you’re going to be a leader, a reliability-oriented culture is your solution. Do not have instability overshadow your features. Instead, make stability a core part of your brand, as it could be your most significant advantage, as we will now explore.
To Americans, the Toyota Land Cruiser is an insignificant, now-defunct SUV. To other parts of the world, the Land Cruiser is the ultimate car. The fascinating thing about the Land Cruiser is that it isn’t the prettiest, most technologically advanced car. Toyota sold the LC 70 from 1984 to 2020. Think about how far technology has come in that time: Apple announced the original Mac in 1984. Now we have quantum computers in data centers and powerful computers on our wrists.
So, why are people still buying a car from a bygone era? The answer is, once again, reliability. Just think about how insane it is that cars from 1984 still run on the road? The legendary reliability means unimaginably strong brand loyalty to the Land Cruiser, especially in the Middle East. Ford, Land Rover, and many others have been unable to dethrone the Land Cruiser as the #1 selling brand in UAE, Bahrain, and elsewhere. The Land Cruiser is the epitome of what reliability can do to a brand: carry so much weight that consumers will never trust competitors. Once again, Toyota has been developing its reliability program over the years, paying significant dividends for them.
As you can see, reliability issues hold teams back and prolong development. In contrast, strong reliability is an impenetrable moat for your brand. Thankfully, there is a simple process to introduce a culture that emphasizes reliability to your organization. Reliability is not a switch that you can turn on and off. Remember that NASA has spent over 60 years developing its reliability and continues to this day. So rather than deciding to have a reliable product tomorrow, taking the time to make a long-lasting impact in your organization will make more sense. Thankfully, teams may be able to see results fairly quickly. In my example, we went from constant catastrophes to no significant incidents in a couple of weeks. Think of it like climbing a mountain: you have your peak objective, but your path is not instantaneous. Instead, you must continue climbing and finding the optimal way towards the top.
You can remember the steps with the mnemonic CLIMBING.
Collect Data about Failures
It is hard to take action when you have no data for evaluation. Whenever anything goes wrong, take note. We said an error went in the sheet if the compiler did not catch it. You need to collect simple data:
- What symptoms showed up
- What steps resolved the issue
- How can this issue be mitigated in the future
- Any comments
- Categorization information
- Relevant logs, videos, or other files
At first, many of my team members were concerned about how long the process would take. As a result, I began on my own experience and later rolled it out to the team. The results were astonishing: I spent, on average, 31 seconds filling out a bug report with enough information to take action. We collectively spent around an hour on data collection, saving weeks in future debugging. That is a good return on investment if I have ever seen one.
Look for Commonalities
From the three times I have run through this process, I have had the same trend happen. Many of our issues share a lot in common. In particular, most of our critical problems at any given point stem from either the same mistake in multiple places or a poorly built component.
Identify Mitigatable Root Causes
A technique to find commonalities is the “five whys.” Ask yourself: why did error A happen? Then, take that answer, and figure out why it happened. After repeating the process several times, you end at your “root cause,” actionable to mitigate. Failure analysis teams generally use five iterations to arrive at the answer.
Do any of your issues have similar whys throughout the chain? If so, those are easy to implement small changes that address a myriad of conflicts. I will go into this more in-depth later, but figuring out these root causes and mitigating them is the secret to highly stable yet innovative systems in the future.
Make Standards, Procedures, and Best Practices to Mitigate Your Root Causes
Once your team identifies the root causes that apply to a wide variety of issues, stop those root causes from happening! There are a million formats for guidelines to help your team build reliability into the future. I will cover some of my personal favorites later on. What matters is choosing something that works with your culture and that your team will follow. Ignored guidelines will not help your group and will only serve to give a false veneer of confidence.
Benefit from Embracing Continuous Improvement
If your product has no reliability issues after creating standards your first time, something has gone wrong. I cannot reiterate enough times: system stability is an ongoing process. I recommend having a system like Agile sprints for handling reliability. Though my team did not adopt agile paradigms for project management, we had “sprints” of reliability analysis. Every week, we would look over the new issues in our spreadsheet and either create new procedures or modify existing ones to mitigate our risks.
Every week is likely too fast for most teams, and we were on a fast iteration cycle and a short development window that made sense for our system. However, I would recommend either having your reliability sprints coincide with project management sprints. If your team does not use sprints, go at a pace where you make significant development between sprints but still feasibly go back and change (only if necessary).
Incorporate All Teams, Employees, and Customers
Everybody has the responsibility to make sure that your product is reliable. Ensure that everyone within your organization has a voice to share concerns and ideas on mitigating risks. Hands-on employees often have a deeper understanding of the intricacies of a project than managers, so their unique insights will be highly constructive.
Make sure to create an environment where everybody feels comfortable sharing concerns. Admitting to stability failures, especially in one’s project, is incredibly difficult. When those difficulties compound with fear of repercussion, staying silent becomes easier than solving the issues. Be sure to thank team members when they speak up about concerns and consistently reinforce the idea that a technology failure is not a failure of themselves. Over time, people will become more willing to speak about their problems, increasing reliability in all components.
Communicating with customers is also instrumental to success. No matter how much internal testing happens, customers will find a way to use products in an unintended way. Encourage customers and technicians to report any issues they encounter to a tracking database. If you have a technician network, ask them to submit a root cause analysis to further aid development.
Normalize Testing and Automation
In-house testing is essential during the development period. While you should test on physical products, digital testing tools can also be part of your arsenal:
- Use simulation tools built into CAD tools for testing physical components.
- Write unit tests as part of your implementation specification for classes or methods. As developers work on the components, they can test against the unit tests, ensuring they pass.
- Use software like AutoHotKey or Selenium WebDriver to test user interfaces.
Automation can also superpower your testing infrastructure. While I am not familiar with solutions for mechanical engineering, I am more familiar with software tools. Jenkins is a fantastic tool for automatically running scripts when someone writes a commit to version control.
Why is automation so great? Automate as many repetitive tasks as you can, and it will save you in the long run! It forces employees to run tests, even when it is late, and they want to go home. It also makes sure that employees do not forget to run a specific test or task. In general, automation increases standardization and reduces the risk for human error.
Get Started Early
I have said before, and I will repeat: get started on reliability efforts as early as possible. As you go through more iterations, your product will become more reliable. Maximizing the number of iterations will mean your deadline has a much more reliable end product. You will, in the long run, appreciate the early start. However, be careful to make sure that you deploy intentionally: it could fall apart if you deploy too quickly. If you exercise good judgment, I am confident that it will be a success.
The Crisis Fractal
You may be asking, how many iterations does it take to have a reliable product? Unfortunately, reaching absolutely zero issues is nearly impossible. At the same time, each iteration significantly narrows the scope of your problems. Your problems become so small that they become unnoticeable at a certain point, and your customers overlook the remaining minor mistakes.
I think of it as a fractal: there is infinite (previously indistinguishable) detail as you continue to zoom in, but with little impact on the shape as a whole. Your goal is to continue to zoom in and fix your problems, continuously nearing a perfect product. By addressing the significant issues first, you can continue to whittle down your to-do list.
Now, to answer the original question. It is difficult to tell how long it will take since there are countless variables in play. Complex and late-stage products with long iterative cycles will likely take longer. Similarly, large, bureaucratic, resistant, and decentralized teams will likely experience more iterations.
Keeping that in mind, I was thrilled and surprised by how quickly our product improved. I first implemented my approach when we were behind schedule and on an astonishingly short timeline. I thought we would fail spectacularly. Yet, after only three or four cycles, the software was rock solid. You may not see results as quickly, but it should not be long before reliability improves if implemented.
A Guide to Guidelines
There are many ways to mitigate risks, and what will work well for your team will likely differ from what works well for other groups. I usually refer to mitigation efforts as guidelines since there are various distinct ways to solve problems. However, if you have different ideas of what may work well for your team, try them out! For this section, I will outline some of my successful techniques and some details I learned along the way.
Process-oriented thinking is not a specific way of implementing but rather is combined with other guideline methodologies. Standardized, easy-to-follow processes can reduce the risks associated with repetitive tasks. There are plenty of functions that we can turn into standardized procedures:
- Creating classes
- Deploying software to production
- Auditing financials
I am a colossal checklist nerd. Checklists are such simple yet powerful tools. They are one of the most effective tools for improving reliability since they are simple to follow. If you have a repeated process, a good checklist could be the perfect solution. Some things to keep in mind about checklists:
- Keep them short: Fewer items tend to be better: I try to aim for less than ten items.
- Keep them small: I try to keep checklists at A5 or half of Letter/A4; A small size forces shorter lists and makes checklists more convenient.
- Check-off tasks: physically checking off an item makes it less likely to be skipped over.
- Trust but Verify: If it is a mission-critical task, have two people go through the checklist and verify that the other did not cut corners.
- Sign off: Having people sign a checklist gives the list an extra sense of authority and, in my experience, makes people significantly more likely to follow through without cutting corners
There are two main ways to write checklists: action-derived and objective-derived. Action-derived checklists tell a person to do something. At the same time, objective-derived has a person make sure an objective is complete. Which one is better depends on the situation, although I have found that telling someone to do something means they are more likely to follow through.
Do you have a complicated process? Flowcharts offer many of the benefits of checklists but scale better for longer-term thinking. I love building flow charts with tasks and then crossing out an action when I finish. If you have a workflow automation system like Power Automate, digitizing these flow charts can make life easier for your employees.
So why would you not just use flowcharts for everything? They carry an extra burden for shorter procedures and have much more detail. A5 checklists are incredibly portable, meaning that the chance of getting left behind is minimal. Lastly, if you do not have diverging pathways, lists make more sense. You may have a pre-launch checklist, but a crisis response flowchart since crisis response has many diverging paths depending on the context.
Having a well-understood and well-documented set of standards and conventions to follow is helpful outside of procedures. You can use an intranet tool like Confluence, Google Sites, or Sharepoint. However, git repositories and even shared drives can be effective. The most important thing to emphasize is making it easy to find and understand conventions. If it takes significant time to understand the expectations, people are less likely to follow said expectations.
Make Your Own
While flowcharts and checklists are great, your organization may have creative solutions of your own. That is great, as the most critical component to reliability is a system that works for you and your team. My one piece of advice is to evaluate the effectiveness of your strategies. Just like you need to determine how reliability efforts, in general, improve the product, your system should undergo similar scrutiny. If something does not work, look into whether it needs to be refined or replaced, and make changes accordingly.
A methodical approach to risk mitigation will rocket your product to the most reliable it can be.