Site Reliability Engineer

United StatesRemoteFull TimePosted 2 months agoVisa sponsorship available

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

Site Reliability Engineer

LOCATION | Hydrolix | USCAN, Remote

Job Description

At Hydrolix, we are revolutionizing the world of data management and analytics with our innovative cloud data platform, purpose-built for petabyte-scale datasets. Our mission is to help organizations drastically reduce data costs while increasing their data retention.

We are looking for a
Site Reliability Engineer (SRE)
to join our dynamic Services team. In this role, you will contribute to the reliability and scalability of our cutting-edge platform, ensuring exceptional solutions tailored to our customers’ unique needs. This is a highly technical, hands-on role that requires deep expertise in system reliability and automation.

Key Responsibilities

- Infrastructure Reliability
: Deploy, maintain, and ensure a highly reliable fleet of Kubernetes clusters and Hydrolix deployments across multiple cloud platforms.
- Service Optimization
: Design, implement, and maintain systems and processes to enhance the reliability, availability, and performance of our services.
- CI/CD Management
: Build and optimize CI/CD tools and processes to ensure efficient and reliable deployments.
- Monitoring and Incident Response
: Develop and manage monitoring, alerting, and incident response strategies to minimize downtime and enable rapid recovery.
- Root Cause Analysis
: Conduct comprehensive root cause analyses for system failures, implementing long-term preventive measures.
- Automation and Efficiency
: Automate repetitive tasks and optimize system performance to improve operational efficiency.
- On-Call Support
: Participate in covering weekday business hours and once-monthly weekend shifts.

Collaboration and Customer Engagement

- Cross-Functional Teamwork
: Work closely with software engineering, infrastructure, and product teams to integrate reliability practices into every stage of the development lifecycle.
- Reliability Advocacy
: Champion SRE best practices and foster a culture of operational excellence across the organization.
- Global Team Collaboration
: Collaborate with a distributed team of engineers worldwide to provide round-the-clock support.
- Customer Support
: Interface with customers to address and resolve reported incidents, ensuring a seamless user experience.

Qualifications and Skills

- SRE Expertise
: Proven experience as a Site Reliability Engineer or similar role, with a history of supporting complex distributed systems (minimum five years supporting complex distributed systems).
- Observability Tools
: Experience with monitoring and debugging tools like Prometheus, Vector, Grafana, Superset, or Kibana.
- Cloud Platforms
: Proficiency in at least one major cloud platform (AWS, GCP, Azure, or Linode).
- Database Knowledge
: Experience with SQL databases; familiarity with PostgreSQL is a plus but not required.
- Programming Skills
: Proficiency in programming languages such as Python, Go, or Rust.
- Linux Expertise
: Strong experience with Linux systems, including performance tuning and system-level troubleshooting.
- Communication Skills
: Excellent written and verbal communication skills, with the ability to convey technical concepts clearly to diverse audiences, including customers and cross-functional teams.

Ready to apply?

You'll be redirected to Hydrolix's application page.

Compensation estimateAI

Similar roles