Site Reliability Engineer
Role summary
Seeking a Site Reliability Engineer with 7-8 years of experience in SRE, infrastructure, or operations for large-scale systems. The role requires expertise in supporting IaaS platforms and infrastructure for GenAI applications. Candidates must possess strong programming/scripting skills in Python, Go, or Java, and experience with containerization (Docker) and orchestration (Kubernetes) tools. Proficiency in Infrastructure as Code (IaC) tools like Terraform, Helm, CloudFormation, and Ansible is essential. Knowledge of GPU/AI compute clusters and experience with monitoring/alerting tools (Prometheus, Grafana, ELK/EFK, Datadog) are also required. Strong networking and systems engineering knowledge, including TCP/IP, DNS, routing, load balancing, and distributed storage, is necessary.
7-8 years of experience in SRE / Infrastructure / ops for large-scale systems
Experience in supporting IaaS platforms
Exp. in infrastructure supporting GenAI applications
Should have strong programming/scripting skills (Python, Go, Java)
Experience with containerization (Docker) and orchestration (Kubernetes, etc.) tools
Exp. with IaC (Terraform, Helm, CloudFormation, Ansible, etc.)
Knowledge of GPU / AI compute clusters
Exp. with monitoring/ alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
Similar roles
- Senior Site Reliability EngineerParallel Domain · Madrid, Comunidad de Madrid, Spain · Remote
Senior Site Reliability EngineerBasis Theory · United States · Remote- Senior Site Reliability EngineerBlock Inc · New York, New York, United States · Remote
- Senior Site Reliability EngineerBlock Inc · Bay, California, United States · Remote
- Senior Site Reliability EngineerUplink · United States · Hybrid