Site Reliability Engineering (SRE)

Like in traditional operations SRE is about keeping important, revenue-critical systems up and running despite hardware failure, bandwidth outages, and configuration errors. Unlike traditional operations groups, we view software as the primary tool through which our systems are managed, maintained, and minded.

Get Started

guiding your way motif gold only

The Steamhaus take on SRE

Our SRE service combines both 24×7 monitoring and reactive response with proactive work. This ensures that everything’s in place for your infrastructure to be managed as software, making continual improvements to improve reliability and loading time.

Our aim is for your site or application to be available 100% of the time and this is the result we regularly achieve for our customers. That’s why we include a 100% SLA on platform availability.

What’s included

We’ve included the best combination of people and smart tooling to enable you to get the best service possible whilst maintaining cost-efficiency.

  1. 100% SLA on your platform with financial penalties if this isn’t achieved. There are obvious exclusions for underlying issues with your code that we’re not able to fix – we’ll make you aware of these when they come up.
  2. 24×7 monitoring of your infrastructure for reactive response to unexpected failures.
  3. 15 minute response to emergency support tickets from our team of UK-based AWS Certified site reliability engineers (SREs).
  4. Retainer time for proactive, incremental improvements to your platform and general advice (see below).
  5. Access to CloudCheckr which gives you a wealth of tools and reports on cost management, security and compliance, resource utilisation and self-healing automation on the AWS platform (with over 400 best practice checks performed against your account). This information can be used to proactively manage and improve all aspects of your account, either by your team, or by us if you have a monthly cloud management retainer with us.
  6. Access to ScaleFT, a cloud-native security platform for managing access to protected resources. No more credential sharing or issues gaining access to your AWS account or instances.
  7. AWS and DevOps workflow advice – access to AWS Certified Site Reliability Engineers (SREs) to assist you with all aspects of your AWS account and development lifecycle to ensure that the full principles of DevOps and infrastructure as a service are being applied to your platform.
  8. NewRelic Application Performance Monitoring (APM) to give you instant end-to-end visibility across your customer experience, application performance, and dynamic infrastructure (charges apply for the premium version of this software).
  9. Tech account manager who understands both your technical needs and your business needs.

The onboarding process

If we’re doing a new build or migration to AWS, we’ll already know everything we need to about your platform and will be able to commence our SRE service immediately.

Most customers choose to get our full value after a consultancy project like this by taking our SRE service as our engineers will already know your platform and have a close working relationship with your team.

If you’re new to us, we’ll begin by performing a full audit of your platform. This will ensure that everything’s in place for us to be able to support the platform and apply the SLA to it.

Once that’s complete then we can commence the SRE service. During a kick-off meeting prior to the service starting we’ll take detailed notes of your business priorities and what’s critical to you.

We’ll need to set up appropriate monitoring checks to ensure we capture issues before they happen, and also to perform reactive repair if something unexpected happens. We’ll obviously need your help with this as you’ll know your application’s code better than we will.