CloudCruise Verified
Member of Technical Staff – Infrastructure
San Francisco, California, United StatesOnsiteFull TimeStaff$150,000–$250,000 /yrPosted 2 months agoHidden Gem · YC Startup
Role summary
CloudCruise is seeking a Member of Technical Staff – Infrastructure to build and scale the distributed systems powering their enterprise computer automation platform. This role involves orchestrating worker fleets on AWS, developing real-time coordination layers, and designing fault-tolerant systems. The ideal candidate has experience with large-scale distributed systems, production AWS infrastructure, Redis for advanced use cases, and a strong focus on observability and reliability. Experience with ML operations and security hardening is preferred. This is a remote, full-time position focused on building robust and scalable infrastructure for a rapidly growing startup in the healthcare automation space.
**About CloudCruise**
CloudCruise is building the coding agent for enterprise computer automation. Our developer platform writes, tests, and maintains automation code on fully-managed infrastructure – cutting dev time by 90%. We're starting with healthcare, where legacy systems make reliable automation a genuinely hard problem. We just raised $5M and brought in angels like Zack Lipton (CTO Abridge) and David Singleton (fmr. CTO Stripe).
We're looking for 10x engineers who are comfortable learning across domains and diving deep into unfamiliar territory. High agency, ground-up builders who thrive with significant ownership from day one.
**The Role**
You'll own the distributed systems that let us run tens of thousands of browser automations daily — and scale to millions. This means orchestrating ephemeral worker fleets across AWS, building real-time coordination layers over Redis and WebSockets, and designing fault-tolerant systems that recover gracefully when things go wrong.
Reliability is everything. When a customer's automation fails, claims don't get submitted and patients don't get care. Your job is to make sure that doesn't happen.
**What You'll Work On**
* Dynamic EC2 provisioning with auto-scaling, multi-OS support (Linux/Windows), health monitoring, crash recovery, and priority-based dispatch across resource groups
* [Socket.io](http://Socket.io) with Redis adapter for horizontally scalable WebSockets, custom distributed job queues with leader election and credential locking, pub/sub messaging for cross-instance communication
* Evolve our single-leader dispatcher toward sharded or multi-leader architectures, implement dynamic worker provisioning based on queue depth, optimize connection pooling and caching layers
* Deploy and optimize inference for vision-language models powering our agents – low latency, high throughput, cost-efficient GPU utilization
* Expand our OpenTelemetry and Langfuse tracing into full metrics dashboards, alerting, and SLO tracking
* Lambda functions for event processing, EC2/SSM for remote execution, S3 for artifact storage, IAM and security hardening
**You Might Be a Fit If**
* You've built distributed systems that handle real scale – worker orchestration, job queues, leader election
* You're fluent in Redis as more than a cache: pub/sub, distributed locks, state management
* You've operated production AWS infrastructure (EC2, Lambda, SSM) and understand the cost/reliability tradeoffs
* You care about observability – you've built dashboards, set up alerting, and debugged production issues with traces
* You're the person who sees "custom job queue" and immediately thinks about failure modes
**Compensation**
Competitive salary and meaningful equity. We want you to have real ownership in what we're building.
CloudCruise is building the coding agent for enterprise computer automation. Our developer platform writes, tests, and maintains automation code on fully-managed infrastructure – cutting dev time by 90%. We're starting with healthcare, where legacy systems make reliable automation a genuinely hard problem. We just raised $5M and brought in angels like Zack Lipton (CTO Abridge) and David Singleton (fmr. CTO Stripe).
We're looking for 10x engineers who are comfortable learning across domains and diving deep into unfamiliar territory. High agency, ground-up builders who thrive with significant ownership from day one.
**The Role**
You'll own the distributed systems that let us run tens of thousands of browser automations daily — and scale to millions. This means orchestrating ephemeral worker fleets across AWS, building real-time coordination layers over Redis and WebSockets, and designing fault-tolerant systems that recover gracefully when things go wrong.
Reliability is everything. When a customer's automation fails, claims don't get submitted and patients don't get care. Your job is to make sure that doesn't happen.
**What You'll Work On**
* Dynamic EC2 provisioning with auto-scaling, multi-OS support (Linux/Windows), health monitoring, crash recovery, and priority-based dispatch across resource groups
* [Socket.io](http://Socket.io) with Redis adapter for horizontally scalable WebSockets, custom distributed job queues with leader election and credential locking, pub/sub messaging for cross-instance communication
* Evolve our single-leader dispatcher toward sharded or multi-leader architectures, implement dynamic worker provisioning based on queue depth, optimize connection pooling and caching layers
* Deploy and optimize inference for vision-language models powering our agents – low latency, high throughput, cost-efficient GPU utilization
* Expand our OpenTelemetry and Langfuse tracing into full metrics dashboards, alerting, and SLO tracking
* Lambda functions for event processing, EC2/SSM for remote execution, S3 for artifact storage, IAM and security hardening
**You Might Be a Fit If**
* You've built distributed systems that handle real scale – worker orchestration, job queues, leader election
* You're fluent in Redis as more than a cache: pub/sub, distributed locks, state management
* You've operated production AWS infrastructure (EC2, Lambda, SSM) and understand the cost/reliability tradeoffs
* You care about observability – you've built dashboards, set up alerting, and debugged production issues with traces
* You're the person who sees "custom job queue" and immediately thinks about failure modes
**Compensation**
Competitive salary and meaningful equity. We want you to have real ownership in what we're building.