Staff Forecasting Software Engineer

San Francisco, California, United StatesOnsiteFull TimeStaff$200,000–$230,000 /yrPosted 1 month agoVisa sponsorship available

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

### Who you are
- We are looking for an engineer who can contribute immediately, has shouldered real production incidents, and brings strong judgment around building stable, observable, and scalable systems, including modern agentic and LLM-powered applications
- 10+ years of professional software engineering experience building and operating production-grade distributed systems
- A strong track record of hands-on ownership of business-critical services, including measurable improvements in latency, throughput, stability, or cost
- Deep expertise in systems design, including service boundaries, concurrency, data modeling, failure handling, and scalability tradeoffs
- Production experience supporting machine learning–driven systems (forecasting, recommendations, or similar), with emphasis on serving, pipelines, and infrastructure
- Expert-level experience with AWS, including designing, deploying, and operating large-scale cloud-native systems
- Strong hands-on experience with Kubernetes, containerized microservices, and modern CI/CD pipelines
- Experience operating software in both on-prem data center and AWS cloud environments
- Fluency with modern AI-assisted development tools (e.g., Cursor, GitHub Copilot) and comfort working in “vibe coding”–style workflows that favor fast iteration, tight feedback loops, and continuous refactoring
- Proficiency in one or more backend languages commonly used for large-scale systems (e.g., Python, Java, Go, Scala)
- Bachelor’s or master’s degree in computer science, Mathematics, or a related field, or equivalent practical experience

### What the job involves
- We are hiring a hands-on Staff Software Engineer to provide technical leadership for our Forecasting and Recommendations platforms, with a strong focus on production-grade AI and agentic systems
- This role centers on designing, building, and operating high-throughput, low-latency distributed systems that power forecasting, recommendations, and AI-driven decisioning at scale
- You will work deeply in backend systems, infrastructure, and AI application architecture, while remaining accountable for reliability, observability, and operational excellence
- Design, build, and operate systems supporting forecasting, recommendations, and agentic AI workflows in production
- Write production-quality code daily; own services end-to-end from design through on-call and incident resolution
- Architect low-latency, high-throughput SaaS services, including APIs, data pipelines, model inference, and agent orchestration
- Build and maintain production-grade agentic applications, including tool-using agents, workflow orchestration, and guardrails
- Work fluently with foundational LLMs (e.g., GPT, Claude, Gemini Pro), selecting appropriate models and deployment patterns based on latency, cost, and reliability tradeoffs
- Use frameworks and tooling such as LangChain, voice agents, and related ecosystems to accelerate development—while enforcing production discipline
- Embrace AI-assisted development workflows (e.g., Cursor, GitHub Copilot, vibe coding paradigms) to move quickly without sacrificing quality
- Champion observability and reliability: metrics, logging, tracing, alerting, and post-incident analysis
- Lead and participate in production incident response, retrospectives, and systemic fixes
- Identify architectural risks early and make design decisions that prevent outages and scalability issues
- Reduce complexity across services, infrastructure, and processes to improve stability and team velocity
- Provide technical guidance across teams and participate in architectural reviews beyond your immediate domain
- Be a prolific, hands-on contributor who raises the bar through example
- Own and deeply understand large portions of the codebase and system architecture
- Simplify complex systems to increase team velocity and reliability
- Balance technical, analytical, and product constraints to deliver pragmatic solutions
- Set short- to medium-term (6–12 month) technical direction for your domain
- Influence standards and architectural decisions across teams through credibility and collaboration
- Ship multiple large services, shared libraries, or major infrastructure improvements

Ready to apply?

You'll be redirected to Zeta Global's application page.