Machine Learning Engineer
Role summary
Pilotcrew AI is seeking an Applied Research Engineer to develop production-grade systems for evaluating Large Language Models (LLMs) and AI agents. This role involves translating cutting-edge AI research into scalable evaluation pipelines, benchmarking methodologies, and improved model performance. You will work closely with engineering and product teams to build systems for measuring, debugging, and enhancing AI agents. The position requires strong fundamentals in machine learning, Python programming, and deep learning frameworks like PyTorch or TensorFlow, with an emphasis on implementing research insights into practical, production-ready solutions in a fast-paced startup environment.
Machine Learning Engineer- Applied Research
Location: Hybrid
Company: Pilotcrew AI
Type: Full-Time
Experience: 3-5 Years
About Pilotcrew AI
Pilotcrew AI builds infrastructure for AI Agent Evaluation. We benchmark large language models, run automated agent evaluations, power human-in-the-loop assessments, and host AI arenas for competitive testing.
Our mission is to make AI agents measurable, reliable, and production-ready through structured, scalable evaluation systems.
Role Overview
We are hiring an Applied Research Engineer to bridge cutting-edge AI research with production-grade systems for evaluating LLMs and AI agents.
In this role, you will read, interpret, and implement ideas from the latest research across large language models, multimodal systems, and agent architectures. You will translate these insights into scalable evaluation pipelines, new benchmarking methodologies, and improved model performance.
You will work closely with engineering and product teams to turn research concepts into real-world systems used for measuring, debugging, and improving AI agents.
This is a research-driven, execution-heavy role requiring strong fundamentals, curiosity, and the ability to operate in a fast-paced startup environment.
Key Responsibilities
• Read and synthesize research papers in LLMs, multimodal AI, and agent systems
• Implement and adapt state-of-the-art methods into production-ready systems
• Design and improve evaluation methodologies (benchmarking, grading, scoring)
• Build experimental pipelines to test model behavior, robustness, and generalization
• Analyze model performance, failure modes, and edge cases
• Develop novel metrics for reliability, reasoning quality, and tool usage
• Contribute to adversarial testing and stress-testing frameworks
• Work on multimodal systems (text, vision, tool interactions) where relevant
• Collaborate with engineering teams to productionize research ideas
• Document findings and communicate insights clearly to technical stakeholders
Required Skills
• Strong Python programming skills
• Solid foundation in machine learning and deep learning
• Hands-on experience with PyTorch or TensorFlow
• Experience working with LLMs, transformers, or multimodal models
• Ability to read and understand research papers and implement them effectively
• Strong analytical thinking and experimentation skills
• Experience designing experiments and interpreting results
• Familiarity with evaluation metrics and benchmarking methodologies
Preferred Skills
• Experience with LLM evaluation, benchmarking, or alignment
• Familiarity with agent architectures (ReAct, tool-calling, planning systems)
• Experience with multimodal models (vision-language systems, CLIP, etc.)
• Knowledge of RLHF, reward modeling, or preference learning
• Experience with retrieval systems, search, or re-ranking
• Exposure to distributed systems or large-scale experimentation pipelines
• Background in applied ML research (industry or academia)
What We Value
• Strong curiosity and research mindset
• Ability to translate theory into practical systems
• Ownership and bias toward execution
• Comfort working with ambiguity and evolving problem spaces
• Clear and structured technical communication
• Ability to thrive in a fast-paced startup environment with high ownership
Why Join Pilotcrew AI
• Work on cutting-edge problems in AI evaluation and reliability
• Bridge research and real-world AI systems
• High ownership and autonomy in a fast-moving team
• Opportunity to shape how AI agents are evaluated at scale.
- Exposure to both research-driven innovation and production systems
Similar roles
Machine Learning EngineerMastech Digital · Dallas, Texas, United States · Onsite- Machine Learning EngineerEdurech Technoogy · Santa Clara, California, United States · Hybrid
- Machine Learning EngineerMORSE Corp · Boston, Massachusetts, United States · Hybrid
- Machine Learning EngineerReddit · San Francisco, California, United States · Remote
- Machine Learning EngineerReddit · New York, New York, United States · Remote