IT Site Reliability Engineer

Irvine, California, United StatesOnsiteFull Time$120,000–$145,000 /yrPosted 1 month agoVisa sponsorship available

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

\*This position is located onsite in Irvine, CA\*

Job Summary:

Willow Laboratories is a fast-growing and forward-thinking medical technology company focused on delivering innovative solutions that improve lives. With a strong foundation in software development and an expanding footprint in regulated medical environments, we are building the infrastructure and systems necessary to support our continued growth.

We are seeking an experienced Site Reliability Engineer to take ownership of operations for our cloud-based applications and infrastructure. As we scale our production systems to support growing user demand, this role will be instrumental in maturing our operational practices, strengthening our reliability posture, and establishing the monitoring and automation foundations critical for long-term success.

The ideal candidate will bring hands-on expertise with AWS infrastructure (EKS, DynamoDB, S3, and related services) and thrive in an environment where they can make immediate impact. You'll lead incident response, implement comprehensive observability solutions, develop infrastructure as code, and work closely with our development teams to build reliability into our mobile backend and microservices architecture from the ground up. This is an opportunity to define SRE practices and operational standards that will scale with the company. You'll work closely with our development team of 15+ engineers and report to the Vice President of Information Technology.

Duties & Responsibilities:

Reliability & Operations Management

Own the operational reliability and availability of production applications and infrastructure on AWS, with readiness to support future multi-cloud initiatives (Azure, Google Cloud, Akamai)
Respond to and lead incident management efforts, including root cause analysis, post-incident reviews, and implementation of preventive measures
Participate in on-call rotation and provide timely incident response
Establish and track SLOs, SLIs, and error budgets to drive reliability improvements
Document operational procedures, runbooks, and architectural decisions in Confluence

Infrastructure & Cloud Management

Manage and optimize AWS services including EKS (Kubernetes), DynamoDB, AppSync, Amplify, S3, and managed streaming services
Develop and maintain infrastructure as code using tools like Terraform, CloudFormation, or similar technologies
Manage multi-zone and multi-region database deployments and ensure data integrity and availability
Manage application load balancing and WAF configurations through Cloudflare
Integrate and monitor external services including Firebase, HealthKit, Health Connect, Branchio, Landbot, Twilio, and OpenAI
Optimize cloud resource utilization while maintaining performance standards

Monitoring & Observability

Design, implement, and maintain comprehensive monitoring and alerting solutions using tools such as Grafana, CloudWatch, and application performance monitoring platforms
Produce regular performance reports for stakeholders, highlighting optimization opportunities
Monitor cloud spending and identify optimization opportunities

DevOps & Automation

Implement and maintain CI/CD pipelines using Jenkins, Bitbucket, and related DevOps tooling

Minimum Qualifications and Experience:

Bachelor's degree in Computer Science, Information Technology, or related technical field, or equivalent practical experience
3+ years of experience in Site Reliability Engineering, DevOps, or Production Operations roles
Strong hands-on experience with AWS services (EC2, EKS, S3, RDS/DynamoDB, VPC, IAM, CloudWatch)
Proven experience with Kubernetes/EKS in production environments
Proficiency with at least one scripting/programming language (Python, Go, Bash, or similar)
Experience with monitoring and observability tools (Grafana, Prometheus, CloudWatch, or equivalent)
Solid understanding of networking concepts, DNS, load balancing, and web application security
Experience with CI/CD tools and practices (Jenkins, GitLab CI, GitHub Actions, or similar)
Knowledge of incident management processes and post-mortem analysis
Familiarity with Atlassian suite (Jira, Confluence, Bitbucket)
Understanding of database administration and optimization for both SQL and NoSQL systems
Experience managing mobile backend infrastructure
Strong problem-solving and troubleshooting skills
Excellent communication skills and ability to collaborate with cross-functional teams

Desired Qualifications:

Cloud & Infrastructure Expertise

AWS certifications (Solutions Architect, SysOps Administrator, or DevOps Engineer)
Experience with multi-cloud environments (Azure, Google Cloud Platform, Akamai)
Experience with infrastructure as code tools (Terraform, CloudFormation, Pulumi)
Understanding of serverless architectures and AWS Lambda
Experience with spot/preemptible instance management and cost optimization strategies

Architecture & Scalability

Knowledge of service mesh technologies and microservices architecture patterns
Background in capacity planning and performance engineering
Background in mobile application infrastructure and APIs
Experience with real-time data streaming platforms (Kafka, Kinesis)

Monitoring, Security & Reliability

Familiarity with APM tools and distributed tracing (New Relic, Datadog, Dynatrace)
Experience with container security and compliance frameworks
Experience with WAF configuration and DDoS mitigation strategies
Experience with chaos engineering and reliability testing practices

Domain-Specific Knowledge

Knowledge of healthcare or nutrition-related application compliance requirements

Willing to work extended hours and weekends when needed to meet critical deadlines

Compensation Range:

This salary range represents the full compensation band for this role. Most new hires are typically placed toward the middle of the range based on experience, skills, education, and job‑related qualifications. Compensation at the upper end of the range is reserved for candidates with exceptional experience or those who significantly exceed the role’s core requirements. Actual compensation within this range will be determined based on experience, skills, education, geographic location, and internal equity.

Physical requirements/Work Environment:

This is an on-site position located at our Irvine, CA office (121 Theory). The role primarily works in an office environment and requires frequent sitting, standing and walking. Daily use of a computer and other computing and digital devices is required. May stand for extended periods when facilitating meetings or walking in the facilities. Some local travel may be necessary; therefore the ability to operate a motor vehicle and maintain a valid Driver's license is required.

The physical demands of the position described herein are essential functions of the job and employees must be able to successfully perform these tasks for extended periods. Reasonable accommodations may be made for those individuals with real or perceived disabilities to perform the essential functions of the job described.

Ready to apply?

You'll be redirected to Willow Laboratories's application page.

Compensation estimateAI

Similar roles