Site Reliability Engineer

Austin, Texas, United StatesOnsiteFull Time$98,583–$138,016 /yrPosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

The Site Reliability Engineer II is responsible for supporting, enhancing, and maintaining Restaurant365’s cloud infrastructure and applications. This role requires growing expertise in site reliability practices, including incident response, system monitoring, automation, and performance troubleshooting. The engineer will collaborate with cross-functional teams to resolve issues, improve the reliability and scalability of the SaaS platform, and participate in on-call rotations. Key responsibilities include automating manual processes, enhancing monitoring tools, implementing cloud automation using tools like Terraform and Ansible, and researching/remediating vulnerabilities. The role also involves maintaining documentation and contributing to technical diagrams and runbooks.

### Who you are
- BS in Computer Science, Information Systems, or related field (or equivalent experience)
- 2–4 years of experience in site reliability engineering, DevOps, or cloud operations
- Experience with cloud platforms (Azure or AWS), including services such as AKS, ECS/EKS, Functions/Lambda, S3, and Blob storage
- Proficiency with infrastructure-as-code and automation (Terraform, Ansible, YAML, Python, Bash, PowerShell)
- Strong Linux engineering skills; working knowledge of Windows administration
- Experience supporting production environments and participating in on-call rotations
- Familiarity with web servers and middleware (Nginx, Apache Tomcat)
- Experience with CI/CD tools (GitLab, Git, or similar)
- Strong written, oral, and interpersonal communication skills
- Experience with monitoring tools (Prometheus, Grafana, ELK, Site24x7, Nagios)
- Knowledge of performance analysis and system vulnerability remediation
- Cloud certification (AWS or Azure) preferred
- Familiarity with restaurant industry SaaS platforms and customer-facing applications

### What the job involves
- The Site Reliability Engineer II will be responsible for supporting, enhancing, and maintaining Restaurant365’s cloud infrastructure and applications
- Qualified candidates will demonstrate growing expertise in site reliability practices, with skills in incident response, system monitoring, automation, and performance troubleshooting
- You will collaborate with DevOps, development, and infrastructure teams to resolve moderately complex issues, propose improvements, and strengthen the reliability, scalability, and security of our SaaS platform
- Respond to production incidents, perform triage and troubleshooting, and contribute to post-incident analysis
- Identify and automate manual processes to improve efficiency and reduce risk
- Enhance and evolve monitoring tools and platforms to improve observability
- Promote and apply best practices for reliability, scalability, and performance across engineering
- Implement and support cloud automation using Terraform, Ansible, or CloudFormation
- Work within change management protocols to provide maximum uptime for production systems
- Participate in on-call rotation, providing 24x7 support for incidents and contributing to root cause analysis
- Partner with developers, architects, vendors, and IT teams to ensure reliable system operations
- Research and remediate vulnerabilities in coordination with security teams
- Maintain documentation of infrastructure, monitoring, runbooks, and incident response procedures
- Apply company policies and procedures when handling operational tasks and incidents
- Suggest and implement improvements to operational processes and monitoring practices
- Contribute to technical diagrams, documentation, and runbooks for system reliability
- Expand expertise in cloud services (Azure, AWS, or GCP) and container platforms (EKS, ECS, AKS)
- Build proficiency with observability and monitoring tools (Prometheus, Grafana, ELK, Site24x7, Nagios)
- Develop scripting and automation skills using Python, Bash, PowerShell, or similar
- Participate in planning discussions by contributing technical input on system stability and reliability

### Benefits
- Fully Covered Health Insurance: R365 provides comprehensive medical, dental, and vision coverage—100% paid for full-time employees.
- Flexible Time Off: Recharge when you need, so you can stay energized and focused.
- 401k with employer math.
- Stock options to purchase equity.
- Health Savings and Flexible Spending Accounts.
- Employee discount programs to lifestyle vendors and servcies.
- Annual learning and development reimbursement.
- Access to LinkedIn Learning.
- Employee recognition and TIP program.
- Fertility guidance and support.
- Online legal solutions.
- Paid parental leave.
- Disability insurance.
- Pet and life insurance.
- Wellness programs.
- Health concierge and support.
- Meditation and mindfulness courses.
- Gym discounts and virtual fitness classes.

Ready to apply?

You'll be redirected to Restaurant365's application page.

Is this role right for you?

Role summary

Similar roles