Data, Telemetry, and Metrics

  

Bug Data

The simplest form of quality data would be bug reports. As issues are reported, they are logged to a bug database and then burn-down rates can be tracked to project whether engineering is on track to release on schedule. There are best practices in regard to what bug data is captured and how it helps drive engineering system improvements by knowing what types of issues have occurred, how they were discovered, and when in the SLDC the issues were found.


Test Pass Data

Test automation, manual test passes, and bug bashes can all capture data on achieved coverage (code or feature level), configurations tested, and bugs or successful results.


Pre-Release Telemetry and Flight Rings

Applications such as Microsoft Windows, have made extensive use of telemetry that enables it to deploy using flight rings. The first step is to add the telemetry code to the application such that as that code executes, it reports to a server, application insights. This insights data can log failure conditions or successful feature completions. This data can log performance measures and the only limit to what can be measured is based on the developer imagination in conjunction with privacy laws and regulations. With good telemetry, flight rings can begin to be used to gauge quality and unlock deployment gates. 


The way flight rings can work is that a small group (perhaps the local development and test organization) uses the new build. As they do, telemetry flows and analyzers of the data can ascertain a build quality that means the build is deemed worthy of being deployed to the next ring of users (larger population) or it is scrapped for low quality. The Windows Insiders program is a quality ring type system and there are fast track and slow track rings of adoption. The inner rings get the changes the fastest and live with more risk of quality issues. Outer rings get code changes slower and deemed less risky as the code has achieved a quality metric bar before being released to them. The final ring is general population and the number of rings and complexity of metrics can evolve over time. Occasionally, bad issues can make it to the outer rings, and this is mainly for two reasons. First would be that the issue is in a narrow configuration/scenario that inner ring users did not attempt. Second is that the failure telemetry data was missing or was drowned out in the sea of other data.


Post-Release, Rapid Responsive

Even after availability to the general population, telemetry can continue to report usage and issues, and this can allow for rapid reactive servicing fixes. Certain critical software processes may be architected with a rollback or fallback code path that can be switched to via a back end telemetry response. This is similar to A/B testing but can go so far as to order a rollback of a release for all or a subset population of users impacted by a particularly bad issue.


Summary

Data, Telemetry, and Metrics are a great quality indicator and action taking mechanism. There must be a well thought through plan for what data is collected, how it is used, and ensuring it keeps to privacy laws.

image13