
SRE Principal Engineer
Role summary
We are seeking a Principal SRE Engineer to provide technical leadership for our Site Reliability Engineering team, focusing on managing and ensuring the 99.999% availability of critical NG911 call routing and handling systems hosted across public, private, and multi-cloud environments (AWS and Azure). This role involves driving technical direction, architectural standards, and implementing reliability best practices. You will mentor the team, oversee high-availability architecture design, develop observability strategies, lead incident response, and facilitate advanced reliability practices like FMEA and Chaos Engineering. The ideal candidate has 8+ years of experience, including 5+ years in a technical leadership role for SRE/DevOps/cloud infrastructure, with expertise in distributed systems, automation tools, and cloud platforms.
### Who you are
- Proven track record as a technical leader for an SRE, DevOps, or cloud infrastructure teams in complex environments
- Experience with mission-critical systems, ideally in emergency call management (NG911) or public safety solutions
- Hands-on experience in designing, analyzing, and troubleshooting large-scale distributed systems
- Expertise in public and multi-cloud platforms (AWS and Azure)
- Familiarity with geographically dispersed, cross-functional team collaboration
- Strong knowledge of site reliability engineering principles, including monitoring, alerting, and incident management
- Proficiency in automation tools and frameworks (e.g., Terraform, Ansible, Jenkins, GitHub Actions)
- Experience with distributed systems, predictive monitoring, self-healing mechanisms, and high-availability architectures
- Practical knowledge of technologies such as Java (preferred), .Net Core/C#, Angular, PostgreSQL, MS SQL Server, RabbitMQ/Kafka, Redis (preferred)
- Excellent communication skills for technical and non-technical audiences
- Strong problem-solving mindset and a focus on continuous improvement
- Familiarity with public safety communication standards, such as NENA i3 standards for Next-Generation 911
- Knowledge of hybrid cloud architecture and advanced deployment techniques (e.g., canary releases, blue/green deployments, feature flags)
- Bachelor's degree
- 8+ years of experience, including 5+ years as a technical leader for SRE, DevOps, or cloud infrastructure teams in complex environments with practical experience with automation tools and frameworks
### What the job involves
- We are seeking a skilled and motivated Technical Lead with a passion for technology and leadership to guide our Site Reliability Engineering (SRE) team in managing NG911 call routing and handling systems
- Hosted in public, private, and multi-cloud environments (AWS and Azure), these life-critical systems require achieving and maintaining 99.999% availability
- Provide comprehensive technical leadership for the entire SRE team
- Drive the technical direction, architectural standards, and implementation of reliability best practices
- Mentor and guide the team in advanced technical problem-solving and continuous technical improvement
- Oversee the design and implementation of high-availability (HA) architectures
- Ensure systems meet the target of 99.999% availability
- Develop and enforce strategies for observability, monitoring, and automated health issue detection
- Lead incident response efforts, including triage, troubleshooting, and communication with stakeholders
- Maintain robust incident playbooks and ensure readiness for on-call support
- Facilitate Failure Mode and Effects Analysis (FMEA) and Chaos Engineering activities
- Provide the technical overview and direction for all SRE team projects, ensuring consistent architectural excellence and reliability standards across the board
- Act as the key technical liaison, clearly articulating the SRE team's technical strategy and reliability status to engineering teams, product management, and executive leadership
- Collaborate with development teams to drive technical alignment, promote best practices, and communicate system performance and technical achievements
- Drive automation initiatives to enhance system resilience and reduce manual intervention
- Track and report key metrics such as SLOs, error budgets, MTTD, and MTTR
- Stay informed about emerging technologies and best practices to continuously improve reliability processes
- Key Success Metrics:
- Achievement of 99.999% system availability targets
- Reduction in incident frequency and impact over time
- Improved team efficiency through increased automation coverage
- Positive feedback from internal stakeholders and engineering partners
- Join us in leading a team dedicated to ensuring the reliability and resilience of systems that support critical public safety operations
### Benefits
- Flexible work models
- Paid time off
- Paid parental and family leave
- Health care benefits
- Global wellness resources
- Employee assistance programs
- Rotation programs
- Mentor relationships
- Learning and development opportunities
- Retirement benefits
- Employee bonuses
- Stock grants & employee stock purchase plans