Building effective product support teams
Modern product teams are responsible for the entire software delivery life cycle: discovery, building, shipping, and keeping the application running for end users. Building and running a product instils a sense of ownership and responsibility on team members. It also drives the quality of the solutions and provides context for contribution when discovering new ideas for improvements on future versions of the product. Building high quality software products and getting it in front of users is one thing. Supporting the product and knowing what issues users are facing is another. This article expresses some thoughts on principles for an effective support system for product teams.
Shared Understanding of System Architecture
Your team must have a clear understanding of where your services lie in the overall system context. What categories of users depend on your service? What user journeys does your service help fulfil? What other services does your service depend on? What are the failure modes of those services, and what impact do they have on the overall system?
Diagram formats that might help communicate the system here are context diagrams (from C4). In addition, any diagrams showing how requests travel from your clients to the compute running your services would help. An example is a sequence diagram. When used, make sure things like load balancers, firewalls, caches, or proxies are visible. Most of the time these work as expected, but when they go wrong, it is crucial to have a way to determine if they are the source of the issues.
Most observability or app performance management (APM) tools (e.g., newrelic, dynatrace, datadog) have the ability to automatically generate a service map or a backtrace of how your service has been called over time. Use this where possible, as it provides near real-time (and real-life) data and is unlikely to go stale as the service changes. Be wary of sampling bias when deriving an understanding of the overall system from non-production environments, as this might only be executing and highlighting the easy paths through your systems that humans usually find when running test scenarios.
Clear understanding of responsibility
Once your team has a reasonable understanding of the services they own and their dependencies, they also need to know the team or party responsible for those dependencies. Different organizations refer to this with various terms, including owners, stewards, caretakers or custodians. These are all terms that might be used to describe those tasked with the responsibility for ensuring a service is running as expected in different environments in an organization. Have a clear way to reach out to those responsible for different services and have a shared place to look up who is on-call for any given service. Tools like pagerduty or opsgenie help to manage such rotas.
Tools and access
In addition to clearly knowing who is responsible, ensure your team have the right access to the tooling that provides visibility into the services you support. This primarily includes access to logs from different services or infrastructure health metrics e.g., cpu, memory, threads etc. Your team should also have access to pipelines that enable the deployment of changes across environments. The ability to accurately and easily determine what version of the application is running in a given environment allows engineers to reproduce and address any issue that might occur. The right access to the right tools would make it easier to find the root cause of the problem. Is it an error on the server-side? Is it a client-side browser version issue? Could it be a misconfigured firewall? Or is it a network issue? If your service is run using containers, ensure the team is able to access the container images and run them locally when reproducing issues.
Build dashboards that provide easily digestible information about the momentary state of your services, with an option to look at historical trends. Leverage built-in configurations for anomaly detection and alerting, and ensure this is refined frequently enough to be a useful signal and not noise. Fine tune alerts to remove false positives, and ensure your dashboards show at a glance if something is not working as expected.
Know the limits of your tools
Finally, ensure you understand the limits of log retention for your service. How far back in time can you see your service logs? Longer retentions are better but typically more expensive. Work with your team to find a good balance. Your log retention period should ideally span multiple deployments of your service. The reason for this is to allow comparison of multiple application versions when investigating an issue.
Invest in growing knowledge in the team
Set aside time in the team to ensure everyone knows how to navigate the tools required to support the system. Encourage updates and contributions to fine-tuning your alerts and the dashboards. Ensure every engineer in the team has the opportunity to be part of the rotation to learn from incidents when they occur.
In a world where software products are increasingly complex and integral to operations, building an effective software engineering support team is critical. Such a team not only understands the intricacies of the system architecture and has a clear comprehension of their responsibilities, but is also equipped with the right tools and knowledge to troubleshoot and resolve issues efficiently.
A good product team cultivates a culture of continuous learning and improvement. Each incident is an opportunity to learn and fortify the system against future issues. Doing so ensures the smooth running of operations and earns the trust of users, which is an invaluable asset in today's digital era.
This post was inspired by questions from Siva Subramanian - an insightful, inquisitive and perceptive technologist.