Interesting how they have kept their ops team the same but now run an entire datacentre.
Overworked teams? I just can’t see how this is possible.
Not defending cloud hosting/costs etc. You generally pay more for cloud to then not have to deal with hardware maintenance, datacentre management. I didn’t see this directly in their post. Other than keeping the same size Ops team
I’m running both physical hardware and cloud stuff for different customers. The problem with maintaining physical hardware is getting a team of people with relevant skills together, not the actual work - the effort is small enough that you can’t justify hiring a dedicated network guy, for example, and same applies for other specialities, so you need people capable of debugging and maintaining a wide variety of things.
Getting those always was difficult - and (partially thanks to the cloud stuff) it has become even more difficult by now.
The actual overhead - even when you’re racking the stuff yourself - is minimal. “Put the server in the rack and cable it up” is not hard - my last rack was filled by a high school student in a part of an afternoon, after explaining once how to cable and label everything. I didn’t need to correct anything - which is a better result than many highly paid people I’ve worked with…
So paying for remote hands in the DC, or - if you’re big enough - just order complete racks with racked and pre-cabled servers gets rid of the “put the hardware in”.
Next step is firmware patching and bootstrapping - that happens automatically via network boot. After that it’s provisioning the containers/VMs to run on there - which at this stage isn’t different from how you’d provision it in the cloud.
You do have some minor overhead for hardware monitoring - but you hopefully have some monitoring solution anyway, so adding hardware, and maybe have the DC guys walk past and inform you of any red LEDs isn’t much of an overhead. If hardware fails you can just fail over to a different system - the cost difference to cloud is so big that just having those spare systems is worth it.
I’m not at all surprised by those numbers - about two years ago somebody was considering moving our stuff into the cloud, and asked us to do some math. We’d have ended up paying roughly our yearly hardware budget (including the hours spent on working with hardware we wouldn’t have with a cloud) to host a single of one of our largest servers in the cloud - and we’d have to pay that every year again, while with our own hardware and proper maintenance planned we can let old servers we paid for years ago slowly age out naturally.
They’re using a third party called deft to manage the hardware. Which is a reasonable middleground between cloud and self-operated, the more I think about it.
I haven’t seen a lot of info on what the cost of that management is though but it’s likely to be leagues less than AWS/GCP
It’s not just the hardware. “The cloud is expensive” is usually touted by people not understanding why managed services (like Aurora RDS and OpenSearch as suggested in the article) ‘cost more than running it themselves’ by not accounting the management costs.
A database service needs management not only in hardware (I.e. replace dead drives) but also in software (I.e. monitor cluster performance, tweak system settings to fit usage pattern, manage cluster health, etc etc). These management requires time from the ops team, often in multiple roles like SysAdmin, DBA, and Ops engineers. Fact that they claim to have moved to their own hardware without being on new talents to their ops team makes it questionable as to whether or not they actually understand the cost and If they’re overworking their existing ops team.
Or it could be that they haven’t run into problems yet. If you overbuild your hardware or your software is efficient enough, you don’t need as much tweaking.
“yet” is the keyword there for sure. It’s not a matter of if, but a matter of when. Even if they’re flushed with cash and grossly over provision their systems, sooner or later, a huge vulnerability will roll around and someone will need to setup / update the OS, ensuring quorum is available for their cluster, fail over traffic during update windows, etc etc etc.
The stacks are getting so insurmountably huge, it’s not possible to just drop a new cluster at their described scale without significantly increasing the workload for an existing team.
Yup. By moving out, they already let go of a lot of security services that came with their cloud subscription like CASB, automated patching, DB maintenance, security/network monitoring, etc. You have to replace all of that with people and on-prem tools/systems.
“An entire data center” is 8 rented racks in two enterprise data centers (4 racks in each). They’re paying $60K/month for racks, cooling, and location.
That seems extremely unlikely, and almost unheard of. If I wget the page I’m a container, I get the same as in browser, so that would suggest this isn’t the case.
Interesting how they have kept their ops team the same but now run an entire datacentre.
Overworked teams? I just can’t see how this is possible.
Not defending cloud hosting/costs etc. You generally pay more for cloud to then not have to deal with hardware maintenance, datacentre management. I didn’t see this directly in their post. Other than keeping the same size Ops team
I’m running both physical hardware and cloud stuff for different customers. The problem with maintaining physical hardware is getting a team of people with relevant skills together, not the actual work - the effort is small enough that you can’t justify hiring a dedicated network guy, for example, and same applies for other specialities, so you need people capable of debugging and maintaining a wide variety of things.
Getting those always was difficult - and (partially thanks to the cloud stuff) it has become even more difficult by now.
The actual overhead - even when you’re racking the stuff yourself - is minimal. “Put the server in the rack and cable it up” is not hard - my last rack was filled by a high school student in a part of an afternoon, after explaining once how to cable and label everything. I didn’t need to correct anything - which is a better result than many highly paid people I’ve worked with…
So paying for remote hands in the DC, or - if you’re big enough - just order complete racks with racked and pre-cabled servers gets rid of the “put the hardware in”.
Next step is firmware patching and bootstrapping - that happens automatically via network boot. After that it’s provisioning the containers/VMs to run on there - which at this stage isn’t different from how you’d provision it in the cloud.
You do have some minor overhead for hardware monitoring - but you hopefully have some monitoring solution anyway, so adding hardware, and maybe have the DC guys walk past and inform you of any red LEDs isn’t much of an overhead. If hardware fails you can just fail over to a different system - the cost difference to cloud is so big that just having those spare systems is worth it.
I’m not at all surprised by those numbers - about two years ago somebody was considering moving our stuff into the cloud, and asked us to do some math. We’d have ended up paying roughly our yearly hardware budget (including the hours spent on working with hardware we wouldn’t have with a cloud) to host a single of one of our largest servers in the cloud - and we’d have to pay that every year again, while with our own hardware and proper maintenance planned we can let old servers we paid for years ago slowly age out naturally.
Thank you for the very detailed response!
They’re using a third party called deft to manage the hardware. Which is a reasonable middleground between cloud and self-operated, the more I think about it.
I haven’t seen a lot of info on what the cost of that management is though but it’s likely to be leagues less than AWS/GCP
It’s not just the hardware. “The cloud is expensive” is usually touted by people not understanding why managed services (like Aurora RDS and OpenSearch as suggested in the article) ‘cost more than running it themselves’ by not accounting the management costs.
A database service needs management not only in hardware (I.e. replace dead drives) but also in software (I.e. monitor cluster performance, tweak system settings to fit usage pattern, manage cluster health, etc etc). These management requires time from the ops team, often in multiple roles like SysAdmin, DBA, and Ops engineers. Fact that they claim to have moved to their own hardware without being on new talents to their ops team makes it questionable as to whether or not they actually understand the cost and If they’re overworking their existing ops team.
Or it could be that they haven’t run into problems yet. If you overbuild your hardware or your software is efficient enough, you don’t need as much tweaking.
It’s questionable, but I don’t think implausible.
“yet” is the keyword there for sure. It’s not a matter of if, but a matter of when. Even if they’re flushed with cash and grossly over provision their systems, sooner or later, a huge vulnerability will roll around and someone will need to setup / update the OS, ensuring quorum is available for their cluster, fail over traffic during update windows, etc etc etc.
The stacks are getting so insurmountably huge, it’s not possible to just drop a new cluster at their described scale without significantly increasing the workload for an existing team.
Yup. By moving out, they already let go of a lot of security services that came with their cloud subscription like CASB, automated patching, DB maintenance, security/network monitoring, etc. You have to replace all of that with people and on-prem tools/systems.
“An entire data center” is 8 rented racks in two enterprise data centers (4 racks in each). They’re paying $60K/month for racks, cooling, and location.
Warning. This site claims you’ve been blocked and asks for your email to verify you. Do not provide it. Reloaded and it worked. Just be safe out there
This didn’t happen for me
Nor i
That isn’t happening for me, nor has it ever when I’ve visited DHH’s blog. It’s possible your browser is compromised.
I have strong privacy settings enabled. I believe it might be because they can’t fingerprint me or similar, so are checking for bot activity
That seems extremely unlikely, and almost unheard of. If I wget the page I’m a container, I get the same as in browser, so that would suggest this isn’t the case.