We're in beta · Starting with US & Canada · Shipping weekly — your feedback shapes RiseMe
Stealth AI Infrastructure Startup logo
Stealth AI Infrastructure Startup Verified
Artificial Intelligence, Infrastructure, Software Development

Founding Systems Software Engineer - AI Infrastructure

San Francisco, California, United StatesOnsiteFull TimePosted 1 month ago

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

Sign up to see compensation estimate

Read This First

If you are looking for a clean spec, clear boundaries, and incremental work, this role is not for you.

If you want to own critical infrastructure, operate in ambiguity, and be accountable for whether the system works in the real world, keep reading.

Company Context

Real time AI inside a clinical setting should exist. It does not, not because the models are missing, but because the systems are. Clinics are not data centers. They do not have the power, cooling, or infrastructure required to run high performance compute.

Because of that, AI runs outside the room. It supports documentation and back office work, but it is not involved when decisions are made.

We are building systems that run high-performance compute inside the physician’s office and keep it stable under real constraints. Power, thermals, noise, uptime all matter. When this works, AI becomes part of the clinical interaction and informs decisions in real time. Our v1 platform is moving from prototype to production, and early engineers will define how it works.

The Mission

You will build and operate the system that makes GPU infrastructure work outside a data center. You are responsible for whether it works under constraint.

You will work directly with the founding systems lead. There is no buffer layer. No handoff. No safety net. If the system fails in a clinic, that is your problem.

Core Responsibilities

GPU Infrastructure

●    Run Kubernetes across local GPU nodes in constrained environments

●    Make NVIDIA GPUs stable and usable outside standard setups

●    Own drivers, device plugins, and runtime behavior under load

Monitoring and Debugging

●    Track GPU health, thermals, power, and memory in real time

●    Catch issues early, before they take down the system

●    Build enough visibility to understand what is failing and why

Deployment

●    Maintain GitOps pipelines for firmware and software updates

●    Push changes across nodes without breaking running systems

●    Enforce resource limits and isolation in tight environments

System Failures

●    Debug issues across containers, networking, drivers, and hardware

●    Work from partial signals and narrow problems quickly

●    Fix root causes and make sure they do not happen again

What This Role Requires

Experience

●    Typically 5+ years working on production infrastructure or systems software

Technical Requirements

●    Deep familiarity with Linux systems, not surface level usage

●    Strong Kubernetes experience, including failure modes

●    Solid coding ability in Go or Python, willingness to work closer to the metal

●    Understanding of system resources, cgroups, namespaces, scheduling

Operating Traits

●    You take ownership of outcomes, not tasks

●    You stay with problems until they are resolved

●    You act without waiting when the system is failing

What Will Break Average Candidates

●    Ambiguity. Requirements will be incomplete

●    Constraints. Power, cooling, and noise are real limits

●    Lack of separation. You will cross hardware, infra, and application layers

●    Accountability. The system either works or it does not

If you need clean interfaces and stable environments, this will feel chaotic.

You Are Not a Fit If

●    Your experience is primarily application-level engineering without infrastructure ownership

●    You have not operated systems in production where failure had real consequences

●    You rely on stable environments, managed platforms, or abstracted infrastructure

●    You have not debugged issues across multiple layers (hardware, OS, container, network)

Signals of Fit

●    You have debugged failures across GPU, OS, container, and networking layers

●    You have operated infrastructure that others depended on in production

●    You care about reliability as a personal standard, not a metric

●    You prefer hard problems with unclear paths over well defined tasks

Compensation

Competitive with top engineering roles. Includes equity and full benefits package.

Final Note

This role will shape the system architecture and the company. If you are not comfortable being responsible for that, do not apply.

Ready to apply?
You'll be redirected to Stealth AI Infrastructure Startup's application page.