12 comments

  • AlbinoDrought 3 minutes ago

    Since this is about DO managed Postgres: if you're using it with replicas, they use async replication and RPO can be greater than 15 minutes. Since failover is triggered during upgrades, there ends up being a lot of periods where you can lose multiple minutes of committed data.

  • mmh0000 33 minutes ago

      > I chose managed services specifically to avoid ops emergencies
    
    You may not be spending enough time on HN reading all the horror stories =P

    The benefit of a managed service isn't that it doesn't go down; though it probably goes down less than something you self-manage, unless you're a full-time SRE with the experience to back it.

    The benefit of a managed service is you say: "It's not my problem, I opened a ticket, now I'm going to get lunch, hope it's back up soon."

  • ebiederm 22 minutes ago

    I don't know if this is realistic but as a general rule if I was contracting with someone so that my business would have higher reliability, I would ask for a service level agreement with a agreed upon amount the vendor will pay you for every unit of time there service is not up.

    At least then your pain is their pain, and they are incentivesed to prevent problems and fix them quickly.

  • cadamsdotcom an hour ago

    100% uptime is impossible of course, a 100% reliable service would survive the next ice age.

    But reliability at the holy grails of 4 and 5 nines (99.99%, 99.999% uptime) means ever greater investment - geographically dispersing your service, distributed systems, dealing with clock drift, multi master, eventual consistency, replication, sharding.. it’s a long list.

    Questions to ask: could you do better yourself - with the resources you have? Is it worth the investment of a migration to get there? Whats the payoff period for that extra sliver of uptime? Will it cost you in focus over the longer term? Is the extra uptime worth all those costs?

  • kevin_nisbet an hour ago

    > I chose managed services specifically to avoid ops emergencies. We're a tiny startup paying the premium so someone else handles this. Instead, I spent late night hours debugging VPC routing issues in a networking layer I don't control.

    This happens with managed services and I understand the frustration, but vendors are just as fallible as the rest of us and are going to have wonky behaviour and outages, regardless of the stability they advertise. This is always part of build vs buy, buy doesn't always guarentee a friction free result.

    It happens with the big cloud providers as well, I've spent hours with AWS chasing why some VMs are missing routing table entries inside the VPC, or on GCP we had to just ban a class of VMs because the packet processing was so bad we couldn't even get a file copy to complete between VMs.

  • solaris2007 an hour ago

    AWS designs and implements their foundational services holistically. I can understand that the services "higher up the stack" may not feel this way to AWS customers sometimes. However, the foundation of VPCs, EC2, EBS and S3, are very strong.

    If the word "production" is suppose to really mean something to you, move your workload to Google Cloud, or move it to AWS, or on https://cast.ai

    Disclaimer: I have no commercial affiliation with Cast AI.

  • cosmin800 2 hours ago

    Lower prices come with a cost. I am not a fan of AWS but they higher reliability.

      delish an hour ago

      The font color implies this comment is downvoted, but I earnestly encourage readers to take very seriously the difference in SLOs and SLAs between high-cost vendors like AWS and GCP and low-cost vendors like DigitalOcean. Read their docs; do not assume DO is "the same, but lower cost."

        deathanatos 31 minutes ago

        … are the published SLAs worth more than use as toilet paper?

        I think it boils down to who offers the highest quality / $, and that's an impossible metric to really measure except via experience.

        But with a number of the "big" clouds, there's what the SLA says, and then the actual lived performance of the system. Half the time the SLA weasels out of the outage — e.g., "the API works" is not in SLA scope for a number of cloud services, only thinks like "the service is serving your data". E.g., your database is up? SLA. You can make API calls modify it? Not so much. VMs are running? SLA. API calls to alloc/dealloc? No. Support responded to you? SLA. The respond contains any meaningful content? Not so fast. Even if your outage is covered by SLA, getting that SLA to action often requires a mountain of work: I have to prove to the cloud vendor that they've strayed from their own SLA¹, and force them to issue a credit, and often then the benefit of the credit outweight my time in salary. Oftentimes the exchanges in support town seem to reveal that the cloud provider has, apparently, no monitoring whatsoever to be able to see what actual perf I am experiencing. (E.g., I have had tickets with Azure where they seem blithely unaware their APIs are returning 500s …)

        So, published is one thing. On paper, IDK, maybe Azure & GCP probably look pretty on par. In practice, I would laugh at that idea.

        ¹AWS is particularly guilty of this; I could summarize their support as "request ID or GTFO".

  • calvinmorrison 14 minutes ago

    At my work we pay a boring, regional VPS host that is not fancy. In fact its maybe a few levels above "your 2000's web host, with a LAMP stack, a FTP login and a bad admin panel". Just a bit above that.

    However, they ALWAYS pick up the phone on the 3rd ring with a capable, on call linux sysadmin with good general DB, services, networking, DNS, email knowledge.

  • sethops1 2 hours ago

    Obligatory, do you actually need kubernetes? I struggle to imagine any tiny startup that does.

      osigurdson an hour ago

      Running Kubernetes in a managed environment like DO is no harder than using docker compose.