Site Reliability Engineer with ref. RF-3

Job description

As a Site Reliability Engineer, you will help us achieve our goals by continuously improving our SaaS offering’s features and robustness. You will participate in designing, developing, deploying, monitoring, supporting, documenting, and troubleshooting our SaaS solution.
This is an exciting opportunity to collaborate closely with the Cloud Operations team, the wider organization, and external vendors and customers.
This is a hybrid role based in our Cambridge or London office, so you will ideally be comfortable coming into the office once or twice a week. If you’re interested in the role but require more flexibility, please speak to us!
Key Responsibilities:

Deploying, maintaining, monitoring, and upgrading production deployments of our SaaS solutions
Building software and systems to manage platform infrastructure and applications
Continually evaluating and improving our technology and processes to increase quality, decrease costs, and improve time-to-market
Periodically testing the service with predictable and unpredictable failures
Providing 2nd-line operational support for our SaaS customers
Gathering data and generating reports on the service performance
Developing and documenting internal processes
Working with engineering/data science to drive and develop new capabilities
Providing out-of-hours support for critical service issues as part of our on-call engineer rota

Preferred Skills/Experience:
While not all are essential, ideally you will have experience with the following:

Administering cloud infrastructure or developing cloud applications (preferably in AWS)
Configuration management, including Infrastructure as Code
Linux, shell-scripting, and command-line tools
Programming in one or more high-level programming languages (e.g. Python)
Networking (e.g. DNS, routing, firewalls)
Source-control management (e.g. Git)
Continuous Integration / Continuous Deployment (CI/CD)
Monitoring, metrics, and alerting
Containerization (e.g. Docker)
Administering, developing applications for, or deploying applications to Kubernetes
Using or developing applications with service mesh (e.g. Istio)
Object-oriented programming and design
Operating production-grade services
Providing technical support
Building serverless or cloud-native applications
Writing technical documentation
Developing processes and procedures
Securing applications, services, and data (e.g. authentication, authorization, encryption, and TLS)
Experience with any of the following: Terraform, SaltStack, MongoDB, Elasticsearch, Kafka, Prometheus, Grafana, HashiCorp Vault

We are looking for candidates who are passionate about technology, keen to continuously learn, and excited to contribute to a dynamic team environment. If you have the required skills and are looking for a challenging and rewarding role, we encourage you to apply.

Site Reliability Engineer

Consultant

Ryan Fisher

Principal Recruitment Consultant

Site Reliability Engineer

Job description