Crustdata Verified
Software, Artificial Intelligence, Data Management
ML Engineer Intern (Summer 2026)
San Francisco, California, United StatesOnsiteInternshipJunior / Entry-level$8,000–$14,000 /moPosted 2 months agoHidden Gem · YC Startup
Role summary
Seeking an ML Engineer Intern for a 12-week summer program (June-August 2026) focused on building the AI agent gateway. The intern will work directly with the founding team on research and engineering for the core intelligence layer, which involves indexing and enriching web-scale data. Responsibilities include researching, training, and deploying ML models from concept to production, tackling problems like multilingual search, entity resolution, org chart inference, technology detection, and job title mapping. This role requires strong fundamentals in NLP, information retrieval, or entity resolution, familiarity with transformer architectures, and experience with Python and PyTorch.
## About the role
Skills: Python, PyTorch, NLP, LLMs, Information Retrieval, Entity Resolution, Text Classification
We're building the gateway to the internet for AI agents. Our APIs already power hundreds of customers — and we went from 0 to $7M ARR in our first 12 months. Now we need someone who can push the boundaries of what our ML systems can do.
We're hiring an ML Engineer Intern to work directly with our founding team on the research and engineering behind our core intelligence layer. Our platform indexes hundreds of millions of professional profiles and company records from across the web. Making that data searchable, matchable, and enriched is an ML problem at its core.
This is a 12-week summer internship (June–August 2026). You will not be fetching coffee or watching from the sidelines. You will be researching, training, and shipping models — from paper to prototype to production. Previous interns' work has shipped to customers within weeks.
## Who you are
* Currently pursuing a Master's or PhD in Computer Science, Machine Learning, NLP, or a related field
* Strong fundamentals in NLP, information retrieval, or entity resolution — through coursework, research, or side projects
* Familiar with transformer architectures — you've trained or fine-tuned encoder models, not just called APIs
* Experience building retrieval systems, classifiers, or embedding models (in academic or personal projects)
* Exposure to contrastive learning, metric learning, or representation learning
* Have used LLMs for structured extraction, classification, or data generation
* Strong Python and PyTorch
* A true grinder — we work very hard
* Founder mentality — someone who wants to build a company someday
## What you'll be doing
You'll own real ML problems that turn messy, multilingual, web-scale data into structured intelligence. Some example problems:
* A customer searches for "RevOps professionals" — you need to return people titled "Head of Revenue Department," "Revenue Operations Manager," and "VP Sales Operations," across English, French, and German
* Three different data sources list what looks like three different companies — but it's actually one. You figure out how to resolve that automatically across millions of records
* Given raw people data, infer the org chart — who reports to whom, what the team structure looks like, how the engineering org differs from sales
* Detect what technologies a company uses from unstructured signals scattered across the web
* Classify whether a job change was a promotion, lateral move, demotion, or just a title edit — and do it for millions of transitions
* Map raw job titles to canonical titles, seniority levels, and job functions — across dozens of languages and naming conventions
## Nice to haves
* Published research or conference papers (NeurIPS, ICML, ICLR, ACL, EMNLP, etc.)
* Experience with entity resolution or record linkage at scale
* Built taxonomy or ontology systems over messy real-world data
* Background in multilingual NLP or cross-lingual transfer
* Open-source contributions in NLP/IR
* Experience with distributed training on GPU clusters
## Compensation & perks
* **$8,000–$14,000/month** (above market rate for SF internships)
* **Housing stipend** for those relocating to SF
* Direct mentorship from the founding team — no layers between you and the CEO
* Your work ships to production and reaches real customers
Skills: Python, PyTorch, NLP, LLMs, Information Retrieval, Entity Resolution, Text Classification
We're building the gateway to the internet for AI agents. Our APIs already power hundreds of customers — and we went from 0 to $7M ARR in our first 12 months. Now we need someone who can push the boundaries of what our ML systems can do.
We're hiring an ML Engineer Intern to work directly with our founding team on the research and engineering behind our core intelligence layer. Our platform indexes hundreds of millions of professional profiles and company records from across the web. Making that data searchable, matchable, and enriched is an ML problem at its core.
This is a 12-week summer internship (June–August 2026). You will not be fetching coffee or watching from the sidelines. You will be researching, training, and shipping models — from paper to prototype to production. Previous interns' work has shipped to customers within weeks.
## Who you are
* Currently pursuing a Master's or PhD in Computer Science, Machine Learning, NLP, or a related field
* Strong fundamentals in NLP, information retrieval, or entity resolution — through coursework, research, or side projects
* Familiar with transformer architectures — you've trained or fine-tuned encoder models, not just called APIs
* Experience building retrieval systems, classifiers, or embedding models (in academic or personal projects)
* Exposure to contrastive learning, metric learning, or representation learning
* Have used LLMs for structured extraction, classification, or data generation
* Strong Python and PyTorch
* A true grinder — we work very hard
* Founder mentality — someone who wants to build a company someday
## What you'll be doing
You'll own real ML problems that turn messy, multilingual, web-scale data into structured intelligence. Some example problems:
* A customer searches for "RevOps professionals" — you need to return people titled "Head of Revenue Department," "Revenue Operations Manager," and "VP Sales Operations," across English, French, and German
* Three different data sources list what looks like three different companies — but it's actually one. You figure out how to resolve that automatically across millions of records
* Given raw people data, infer the org chart — who reports to whom, what the team structure looks like, how the engineering org differs from sales
* Detect what technologies a company uses from unstructured signals scattered across the web
* Classify whether a job change was a promotion, lateral move, demotion, or just a title edit — and do it for millions of transitions
* Map raw job titles to canonical titles, seniority levels, and job functions — across dozens of languages and naming conventions
## Nice to haves
* Published research or conference papers (NeurIPS, ICML, ICLR, ACL, EMNLP, etc.)
* Experience with entity resolution or record linkage at scale
* Built taxonomy or ontology systems over messy real-world data
* Background in multilingual NLP or cross-lingual transfer
* Open-source contributions in NLP/IR
* Experience with distributed training on GPU clusters
## Compensation & perks
* **$8,000–$14,000/month** (above market rate for SF internships)
* **Housing stipend** for those relocating to SF
* Direct mentorship from the founding team — no layers between you and the CEO
* Your work ships to production and reaches real customers