
Site Reliability Engineer
Role summary
We are seeking a Site Reliability Engineer local to the Cleveland, OH area for a hybrid role, requiring 3 days onsite per week. The successful candidate will develop complex solutions to enhance application service monitoring in a large-scale environment, suggest improvements to existing tools, and provide technical assistance for optimal production performance. Responsibilities include ensuring high availability, reliability, and performance through robust monitoring, alerting, and notification systems, designing and implementing new observability tools, and automating repetitive tasks. The role involves identifying key operational metrics, implementing dashboards, and ensuring infrastructure components meet performance and capacity standards. A Bachelor's degree and a minimum of 5 years of related experience are required.
Site Reliability Engineer
\*\*Must be local to the Cleveland, OH area\*\*
\*\*Hybrid role, onsite 3 days a week\*\*
Responsibilities:
•Develops highly complex solutions (utilizing available tech stack) to improve ability to effectively monitor application services in a large-scale and complex environment. Suggests improvement of existing tools and monitoring thresholds.
•Provides highly complex technical assistance and operational guidelines for business operations and application development to ensure applications are running optimally in production, test, and development environments.
•Ensures that supported application services are highly available, reliable, and performant through monitoring, alerting, and notification. Design, implement, and maintain as necessary new Observability tools to ensure this coverage. Implements and maintains dashboard, bots and other automation based on the current operational needs and current release changes. Evaluate improvement of the dashboards, bots, and other automation.
•Identifies repetitive, manual, and scalable tasks and automates them using scripting/programming languages or tools.
•Identifies key operational metrics and the data necessary to create them. Implements and maintains dashboards based on the current operational needs. Test and ensure that all infrastructure components meet proper performance and capacity standards.
Knowledge and Skill Areas:
•Advanced baseline knowledge of AWS Cloud Platform technologies, infrastructure, and practices in production environment including CloudWatch, Cloud Trail, EKS, Lambda, Canaries, DynamoDB, RDS, PostgreSQL, S3, API Gateway, Elastic Load Balancer, OpenSearch, Grafana, AWS X-Ray, SQS, Fault Injection Service (AWS FIS).
•GitLab, CDK (preferred), Terraform, Grafana, OpenSearch, Docker and CI/CD pipeline.
•Coding languages, such as Python, Typescript, NodeJS, .Net, Java; Infrastructure as Code, Configuration as Code, Alerts and Monitoring as Code.
•Familiar with Deployment patterns and version control, ITIL framework, Resiliency concepts and Disaster Recovery, and Chaos Engineering.
Education and Experience:
Bachelor’s degree and a minimum 5 years of related work experience
AWS Certifications
Equal Opportunity Employer. All qualified applicants will receive consideration for employment and will not be discriminated against based on race. color, religion, sex, sexual orientation, gender identity, national origin, protected veteran status, disability, age, pregnancy, genetic information or any other consideration prohibited by law or contract.
Must be legally authorized to work in the US without sponsorship for employment visa status now or in the future.
Please no third-party recruiting agencies.
Similar roles
- Senior Site Reliability EngineerParallel Domain · Madrid, Comunidad de Madrid, Spain · Remote
- Site Reliability EngineerPacer Group · Montreal, Quebec, Canada · Hybrid
- Senior Site Reliability EngineerBlock Inc · New York, New York, United States · Remote
- Senior Site Reliability EngineerBlock Inc · Bay, California, United States · Remote
- Senior Site Reliability EngineerUplink · United States · Hybrid