
AI Production Engineer
Role summary
Meta is seeking an AI Production Engineer to build and scale production-grade AI systems for executive leadership. This role involves significant software and systems engineering, focusing on writing high-quality code, designing resilient systems, and creating automation and tooling for reliable AI operations. You will own AI infrastructure, including training, inference, data pipelines, and GPU fleet management, across major cloud platforms and Kubernetes. The position requires technical leadership, setting direction for AI infrastructure and reliability practices, and mentoring engineers. Responsibilities include designing and implementing AI/ML systems like LLMs and RAG, building CI/CD pipelines, and managing on-call rotations for critical incidents. The role emphasizes building resilience into systems from the start and requires a proven track record in leading complex technical initiatives and productionizing AI/ML systems.
Production Engineers (PEs) at Meta are specialized software engineers who develop the underlying infrastructure for all of Meta's products and services, forming the backbone of every major engineering effort that keeps our platforms running smoothly and scaling efficiently.As a AI Production Engineer on our AI Transformation team, you will apply this discipline to build and scale production-grade AI systems that enhance the productivity and experience of our executive leadership. This role is primarily a software and systems engineering role—you will spend the majority of your time writing high-quality code, designing resilient systems, building automation, and creating tooling that enables AI to run reliably and efficiently.Working alongside some of the best engineers in the industry, you'll contribute to code and systems that go into production and directly impact how our executives work. As a technical leader, you will set the direction for our AI infrastructure and reliability practices, engineering away operational burden through robust design, automation, and self-healing systems.
AI Production Engineer Responsibilities:
- Design and implement production-grade AI/ML systems for executive productivity, including LLMs, RAG systems, agents, inference pipelines, and MLOps infrastructure
- Write and review code, develop documentation and capacity plans, and debug the hardest problems, live, on complex AI systems serving executive leadership
- Build automation, self-healing systems, and CI/CD pipelines to minimize manual intervention and operational toil
- Own AI infrastructure—training, inference, data pipelines, and GPU fleet management—across cloud platforms (AWS, Azure, GCP) and Kubernetes
- Set technical direction, lead design reviews, mentor engineers, and advise leadership on AI technology trends and trade-offs
- Share an on-call rotation (~1 week per quarter) and serve as an escalation contact for critical AI system incidents
- Champion reliability by design—building resilience into systems from the start with circuit breakers, fallbacks, and graceful degradation
- Travel globally up to 20% of the year to engage with executive partners and scale business opportunities
Minimum Qualifications:
- Proven track record of leading complex technical initiatives and mentoring other engineers
- Experience building and productionizing AI/ML systems, including LLMs, RAG architectures, inference optimization, and MLOps
- 7+ years of experience in Linux/Unix and network fundamentals
- Knowledge of common web technologies and Internet service architectures (CDN, load balancing, distributed systems)
- Experience with Internet service architecture, capacity planning, and handling needs for urgent capacity augmentation
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- Experience configuring and running infrastructure-level applications such as Kubernetes, Terraform, and cloud platforms (AWS, Azure, GCP)
- 7+ years of coding experience in an industry-standard language (e.g., Python, Go, C++, Java, Rust)
Preferred Qualifications:
- Familiarity with observability tools (Prometheus, Grafana, Datadog) and database/caching technologies (MySQL, Redis, Memcached)
- Experience with GPU infrastructure, ML accelerators, and model serving at scale
- BS or MS in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- Background in Production Engineering, Platform Engineering, or Site Reliability Engineering (SRE)
- Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
- Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
- Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies
About Meta:
Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today—beyond the constraints of screens, the limits of distance, and even the rules of physics.
Meta is proud to be an Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender, gender identity, gender expression, transgender status, sexual stereotypes, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. We also consider qualified applicants with criminal histories, consistent with applicable federal, state and local law. Meta participates in the E-Verify program in certain locations, as required by law. Please note that Meta may leverage artificial intelligence and machine learning technologies in connection with applications for employment.
Meta is committed to providing reasonable accommodations for candidates with disabilities in our recruiting process. If you need any assistance or accommodations due to a disability, please let us know at accommodations-ext@meta.com.
$184,000/year to $257,000/year + bonus + equity + benefits
Individual compensation is determined by skills, qualifications, experience, and location. Compensation details listed in this posting reflect the base hourly rate, monthly rate, or annual salary only, and do not include bonus, equity or sales incentives, if applicable. In addition to base compensation, Meta offers benefits. Learn more about benefits at Meta.