Senior Site Reliability Engineer
- Worldwide
Gretel's mission is to automate privacy engineering. We enable developers, researchers, and scientists to quickly create safe versions of data that can be used for pre-production environments, machine learning workloads, and be shared across teams and organizations.
As a Site Reliability Engineer (SRE) at Gretel you will ensure the safety, security, and reliability of our cloud infrastructure. This includes our compute infrastructure, container orchestration platform, deployment pipelines, and observability stack.
-
Build and maintain Gretel's observability stack. Measure and monitor Gretel's availability, latency, and overall system health
-
Scale systems sustainably with automation and continuously improve and evolve systems
-
Manage and lead incident response, recovery, and blameless postmortems
-
Partner with software engineers to troubleshoot production issues
-
Build tools and frameworks that help Gretel engineers be more productive
-
Ship complex ML/AI models in partnership with Gretel's applied science and engineering teams
-
Experience with at least one cloud platform (we use AWS heavily)
-
Experience with Docker and Kubernetes
-
Ability to write software and tools in Python or Go
-
Experience with monitoring, alerting and operations
-
Experience operating highly available distributed systems in the cloud
-
Experience identifying, diagnosing, and responding to operational outages
-
Experience with infrastructure as code (Terraform, CloudFormation, etc)
-
Experience with build systems such as Bazel
-
Experiencing shipping application with complex dependencies (Pytorch, Tensorflow)
-
Software engineering skills beyond script writing (TDD, design patterns, etc)
-
Experience with DevOps or CI/CD pipelines
