Pilotcrew AI logo
Pilotcrew AI Verified
Artificial Intelligence, Software, Business Automation, SaaS

Machine Learning Engineer

San Francisco, California, United StatesHybridFull TimePosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

Pilotcrew AI is seeking an Applied Research Engineer to develop production-grade systems for evaluating Large Language Models (LLMs) and AI agents. This role involves translating cutting-edge AI research into scalable evaluation pipelines, benchmarking methodologies, and improved model performance. You will work closely with engineering and product teams to build systems for measuring, debugging, and enhancing AI agents. The position requires strong fundamentals in machine learning, Python programming, and deep learning frameworks like PyTorch or TensorFlow, with an emphasis on implementing research insights into practical, production-ready solutions in a fast-paced startup environment.

Machine Learning Engineer- Applied Research

Location: Hybrid

Company: Pilotcrew AI

Type: Full-Time

Experience: 3-5 Years

About Pilotcrew AI

Pilotcrew AI builds infrastructure for AI Agent Evaluation. We benchmark large language models, run automated agent evaluations, power human-in-the-loop assessments, and host AI arenas for competitive testing.

Our mission is to make AI agents measurable, reliable, and production-ready through structured, scalable evaluation systems.

Role Overview

We are hiring an Applied Research Engineer to bridge cutting-edge AI research with production-grade systems for evaluating LLMs and AI agents.

In this role, you will read, interpret, and implement ideas from the latest research across large language models, multimodal systems, and agent architectures. You will translate these insights into scalable evaluation pipelines, new benchmarking methodologies, and improved model performance.

You will work closely with engineering and product teams to turn research concepts into real-world systems used for measuring, debugging, and improving AI agents.

This is a research-driven, execution-heavy role requiring strong fundamentals, curiosity, and the ability to operate in a fast-paced startup environment.

Key Responsibilities

• Read and synthesize research papers in LLMs, multimodal AI, and agent systems

• Implement and adapt state-of-the-art methods into production-ready systems

• Design and improve evaluation methodologies (benchmarking, grading, scoring)

• Build experimental pipelines to test model behavior, robustness, and generalization

• Analyze model performance, failure modes, and edge cases

• Develop novel metrics for reliability, reasoning quality, and tool usage

• Contribute to adversarial testing and stress-testing frameworks

• Work on multimodal systems (text, vision, tool interactions) where relevant

• Collaborate with engineering teams to productionize research ideas

• Document findings and communicate insights clearly to technical stakeholders

Required Skills

• Strong Python programming skills

• Solid foundation in machine learning and deep learning

• Hands-on experience with PyTorch or TensorFlow

• Experience working with LLMs, transformers, or multimodal models

• Ability to read and understand research papers and implement them effectively

• Strong analytical thinking and experimentation skills

• Experience designing experiments and interpreting results

• Familiarity with evaluation metrics and benchmarking methodologies

Preferred Skills

• Experience with LLM evaluation, benchmarking, or alignment

• Familiarity with agent architectures (ReAct, tool-calling, planning systems)

• Experience with multimodal models (vision-language systems, CLIP, etc.)

• Knowledge of RLHF, reward modeling, or preference learning

• Experience with retrieval systems, search, or re-ranking

• Exposure to distributed systems or large-scale experimentation pipelines

• Background in applied ML research (industry or academia)

What We Value

• Strong curiosity and research mindset

• Ability to translate theory into practical systems

• Ownership and bias toward execution

• Comfort working with ambiguity and evolving problem spaces

• Clear and structured technical communication

• Ability to thrive in a fast-paced startup environment with high ownership

Why Join Pilotcrew AI

• Work on cutting-edge problems in AI evaluation and reliability

• Bridge research and real-world AI systems

• High ownership and autonomy in a fast-moving team

• Opportunity to shape how AI agents are evaluated at scale.

  • Exposure to both research-driven innovation and production systems
Ready to apply?
You'll be redirected to Pilotcrew AI's application page.

Similar roles