Senior Site Reliability Engineer
Role summary
We are seeking a Senior Site Reliability Engineer to architect, scale, and ensure the self-healing capabilities of our Brokerage-as-a-Service platform. This role focuses on reducing toil through engineering, designing and developing internal SRE platforms, and automating complex workflows within a Kubernetes-based ecosystem. You will implement modern CI/CD, GitOps, and observability standards, manage infrastructure as code, and lead incident response. The ideal candidate has a strong background in Linux, networking, production Kubernetes, AWS, and programming in Python or Golang, with experience in regulated financial environments.
### Who you are
- Linux & Networking Mastery: Proficient in Linux administration with a deep understanding of the TCP/IP stack, OSI model, DNS, and network troubleshooting
- FinTech Background: Experience working in highly regulated financial environments or with FIX/API connectivity
- Production Kubernetes: Hands-on experience managing production-grade clusters, including RBAC, autoscaling, Helm, and multi-cluster patterns
- Cloud Native Expertise (AWS): Strong grasp of AWS core services, security, and high-availability patterns. Proficiency with boto3 and AWS CLI for automation
- Modern CI/CD & GitOps: Experience building secure, automated delivery pipelines and operating GitOps workflows (ArgoCD)
- Code Proficiency: Strong scripting and development skills in Python or Golang, along with Bash and Ansible
- Security Mindset: Experience with secrets management, vulnerability scanning, and securing the software supply chain
- AI & Prompt Engineering: Familiarity with using LLMs, Public MCPs, or Bedrock Agent Core to enhance SRE workflows
- Data & Middleware: Experience managing Kafka, MQ, SQS, or orchestration tools like Airflow and Rundeck
### What the job involves
- As a Senior Site Reliability Engineer, you won’t just be "keeping the lights on." You will be an engineering force responsible for the architecture, scalability, and self-healing capabilities of our Brokerage-as-a-Service platform
- This role is centered on reducing toil through engineering
- You will design and develop internal SRE platforms, automate complex workflows, and ensure our Kubernetes-based ecosystem can handle the demands of global financial markets. While this role includes critical on-call responsibilities to support our 24/7 global operations, your primary mission is to build and modernize systems that make manual intervention obsolete
- Engineering & Automation: Design and develop internal tools and SRE platforms to eliminate repetitive tasks (toil) and improve developer velocity
- Infrastructure as Code: Architect and maintain modular, reusable IaC using Terraform and manage GitOps workflows via ArgoCD
- Observability & Reliability: Implement OpenTelemetry standards and the Grafana stack (Alloy, Loki, Tempo, Mimir) to provide deep insights into system health. Define and manage SLIs, SLOs, and Error Budgets
- Platform Governance: Review software architecture and Kubernetes metrics to ensure high availability, capacity planning, and cost-optimization across AWS regions
- Incident Engineering: Lead incident response, perform complex root-cause analysis (RCA), and champion a blameless post-mortem culture
- Collaboration: Partner with engineering teams to foster the adoption of new tools, security standards, and reliability best practices
### Benefits
- Health & wellness packages: We’ve built our benefits offering as a holistic package. We provide multiple health insurance carrier options and plans (medical, dental, vision) with access to HSA and/or FSA tax savings tools. We also provide income protection including life insurance, AD&D, short- and long-term disability, as well as extended coverage resources such as fertility benefits and mental wellness resources
- Vacation & time off: To build a successful organization, employees need time away to rest and recharge, so we provide paid time off including paid holidays and unlimited vacation time. We also strongly believe in work-life balance and, therefore, offer fully remote and hybrid work positions based on role requirements
- Professional developmnet: We believe one of the greatest contributions we can make is investing in our employees’ professional development. Therefore, we offer financial support toward continuing education courses, academic coursework, professional conferences, earning professional certifications, and membership fees to professional organizations
Similar roles
- Senior Site Reliability EngineerParallel Domain · Madrid, Comunidad de Madrid, Spain · Remote
- Site Reliability EngineerPacer Group · Montreal, Quebec, Canada · Hybrid
- Senior Site Reliability EngineerBlock Inc · New York, New York, United States · Remote
- Senior Site Reliability EngineerBlock Inc · Bay, California, United States · Remote
- Senior Site Reliability EngineerUplink · United States · Hybrid