NVIDIA logo
NVIDIA Verified
Semiconductors, Artificial Intelligence, Computer Hardware, Software Development

Distinguished Engineer, Cloud Site Reliability Engineering

Santa Clara, California, United StatesOnsiteFull TimeDistinguished / Architect$320,000–$488,750 /yrPosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

NVIDIA is seeking a distinguished Cloud SRE Architect to join its Infrastructure, Planning and Process (IPP) Cloud Infrastructure Team. This role is critical in supporting NVIDIA's global operations, including Graphics Processors, Deep Learning, and Driverless Cars, by managing cloud services that handle nearly half a million automated jobs daily on thousands of servers. The architect will be responsible for evaluating, architecting, and implementing CI/CD solutions, optimizing AI development and testing systems for performance and cost, and leading projects to resolve complex software system issues. The position requires extensive experience in systems software development, cloud technologies, distributed systems, and containerization, with a strong emphasis on maintaining highly available production environments.

NVIDIA is looking for a Cloud SRE Architect to work in IPP's (Infrastructure, Planning and Process) Cloud Infrastructure Team. IPP is a global organization within NVIDIA. This group works with various other groups within NVIDIA such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence and Driverless Cars to cater to their infrastructure needs. These cloud services provide almost half a million automated jobs per day on thousands of servers helping with the efficiency of thousands of NVIDIA's software engineers worldwide. The cloud hosts various machines and devices with operating systems like Windows, Linux, and Android. It supports hardware platforms including NVIDIA GPUs and Tegra Processors. It delivers unified CI/CD solutions and cloud-based software development. Are you passionate about distributed infrastructure and looking for sophisticated, critical issues, ready to build the next generation of cloud services, design creative solutions, mine through data to uncover real problems and fix them?
What You'll Be Doing

  • Serve as an SRE Architect part of GPU Private Cloud team used by thousands of NVIDIANs globally for interactive development, centralized CI / CD and QA testing
  • Evaluating, identifying and developing software solutions to optimize critical software development workflows across various organizations within Nvidia.
  • Architecting, Implementing & supporting end-to-end CI/CD system using open-source and Nvidia proprietary software.
  • Customer (NVIDIA Internal development teams) onboarding to Private cloud infrastructure with a good discovery of the use case and available solutions within the cloud
  • Identify performance bottlenecks and optimize the speed and cost efficiency of AI development and testing systems.
  • Leading software development projects and technically direct a team of brilliant engineers and guide them to provide efficient and impactful solutions.
  • Looking for problems within software systems and resolving the issues
  • Craft and implement critical metrics using various analytics methods and dashboards

What We Need To See

  • BS EE/CS or equivalent experience with 18+ years of systems software development including at least 1 year dedicated to developing/exploring AI.
  • Experience of maintaining cloud infrastructure and highly available production environment.
  • Strong programming and software development skills in JAVA, Python, Shell-script along with good understanding of distributed systems and REST APIs.
  • Experience in working with SQL/NoSQL database systems such as MySQL, Cassandra, MongoDB or Elasticsearch.
  • Excellent knowledge and working experience with Docker containers and Virtual Machines.
  • Good background of Cloud technologies like: OpenStack, Docker, Kubernetes, Chef/Puppet, Hadoop/Ceph/SwiftStack, LXC, Git, Perforce, JFrog, Kafka.
  • Ability to work across organizational boundaries effectively to improve alignment and productivity between teams in a multi-national, multi-time-zone corporate environment.

Ways To Stand Out From The Crowd

  • Depth in AI, Machine Learning and Deep Learning algorithms and techniques.
  • Strong collaborative and interpersonal skills, with a consistent record of guiding and influencing others in dynamic environments.
  • Experience developing large-scale software systems using modular architecture under real-time performance requirements.
  • Background in designing high-performance, scalable software systems with a strong focus on hardware cost optimization.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 320,000 USD - 488,750 USD.
You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until April 5, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
, , JR2015250

Sample NVIDIA interview questions

  • 1

    Design a system for a rock paper scissors game

    system designmedium
  • 2

    Implement a distributed data migration management platform.

    system designmedium
  • 3

    Develop a distributed tracing system for tracking and debugging.

    system designmedium
  • 4

    Design a distributed training system for a trillion-parameter language model

    system designmedium
  • 5

    Design a system for an automation framework to generate a consent form using multiple agents

    system designaverage

Sign up for a personalized interview prep pack tailored to this role.

Ready to apply?
You'll be redirected to NVIDIA's application page.