2 comments

  • nialse 6 minutes ago

    Versioning the cache is a poor band-aid and introduces failure modes not considered.

  • locknitpicker 2 hours ago

    I feel this is a poor article whose main premise is patently false, and misrepresents the nature of a very crude mistake in cloud engineering: rolling out breaking changes.

    As the blogger failed to identify and understand the root cause of a problem,the proposed solution also makes no sense.

    The underlying issue has nothing to do with versioning cache. It has everything to do with failing to understand what a breaking change is, that pushing a breaking change to a contract does create problems, and that lack of any effective testing process will allow crude mistakes to slip into production.

    To be blunt, versioning cache would not solve the problem. The blogger already stated that they failed to understand they were pushing a breaking change to the contract. If you are not changing the contract, you are not going to go through the trouble of versioning your cache too. Therefore the failure mode is not addressed and the problem is still present.

    From the description, the problems were noticed when new instances failed to deserialize data saved by old instances. This spells an entirely different failure mode that the blogger failed even to understand: why is the system retaining cached data that caused the system to throw errors? If those entries were purged then the failure would be mitigated and only transient, proportional to the rollout rate. Purging the whole cache would also completely fix the issue after the full rollout. Moreover, if the cache wasn't purshed, rolling back changes wouldn't get the system back in a consistent state.