Confidential Company logo
Confidential Company Verified
Accounting

Data Engineer, Ingestion

San Francisco, California, United StatesOnsiteFull Time$150,000–$200,000 /yrPosted 2 months ago

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a Data Engineer, Ingestion to build systems that transform raw scientific data into clean, analysis-ready datasets. You will develop AI-powered ingestion pipelines for diverse sources like Excel/CSV and lab instruments, design schema mapping and normalization logic, and create systems for metadata standardization and data quality enforcement. This role combines classical data engineering with LLM approaches to structure semi-structured data, ensuring downstream systems operate on trustworthy datasets. You will have a foundational impact on the quality and reliability of the data and AI platform, working with messy data and building resilient, scalable transformation systems.

ABOUT THE ROLE

We are hiring a
Data Engineer, Ingestion
to build the systems that transform messy, real-world scientific data into clean, structured, and analysis-ready datasets. You will partner closely with data scientists, bioinformatics specialists, and product teams to turn diverse data inputs into reliable, standardized assets that power the broader platform.

Your work will include building AI-powered ingestion pipelines for heterogeneous data sources (e.g., Excel/CSV uploads, lab instrument outputs, and internal pipelines), designing robust schema mapping and normalization logic, and developing systems that standardize metadata, resolve inconsistencies, and enforce data quality at the point of entry. You will combine classical data engineering with LLM-driven approaches to structure semi-structured data and ensure that downstream systems always operate on canonical, trustworthy datasets.

This role sits at the critical entry point of the data lifecycle, bridging raw data ingestion and high-performance analytics and AI systems. If you enjoy working with messy data, building resilient pipelines, and designing scalable data transformation systems, this is an opportunity to have a foundational impact on the quality and reliability of an entire data and AI platform.

WHAT YOU WILL DO

  • Build and own an AI-powered ingestion and normalization pipeline to import data from a wide variety of sources — including unprocessed Excel/CSV uploads, lab and instrument exports, and processed data from internal pipelines
  • Develop robust schema mapping, coercion, and conversion logic (e.g., units normalization, metadata standardization, variable-name harmonization, handling vendor-specific formats, reference updates, and batch-effect correction)
  • Use a combination of LLM-driven and classical data engineering techniques to structure semi-structured or messy tabular data — including metadata extraction, column type inference, header normalization, inconsistency resolution, and dataset preparation
  • Ensure that one-time transformations (normalization, coercion, batch correction) are executed during ingestion so downstream systems and AI applications operate on clean, canonical data
  • Build validation, verification, and quality-control layers to detect ambiguous, inconsistent, or corrupt data before it enters the platform
  • Collaborate with product teams, data scientists, bioinformatics specialists, and infrastructure engineers to define and enforce data standards and ensure seamless integration with downstream systems

WHAT YOU BRING

Must-have

  • 5+ years of experience in data engineering or data wrangling with real-world tabular or semi-structured data
  • Strong proficiency in Python and modern data processing tools (e.g., Pandas, Polars, PyArrow, or similar)
  • Extensive experience working with messy spreadsheet-style data (Excel/CSV), including inconsistent headers, multi-sheet formats, mixed data types, and free-text fields
  • Experience designing and maintaining robust ETL/ELT pipelines, ideally involving scientific or lab-derived data
  • Ability to combine traditional data engineering approaches with LLM-powered data normalization, metadata extraction, and cleaning
  • Strong ownership mindset with the ability to design and manage ingestion and normalization systems end-to-end, with attention to maintainability, reproducibility, and scalability
  • Strong communication skills and ability to collaborate across cross-functional teams and translate real-world data challenges into reliable engineering solutions

Nice-to-have

  • Familiarity with scientific data types and modalities (e.g., plate readers, genomics metadata, time-series data, batch information, instrumentation outputs)
  • Experience with workflow orchestration tools (e.g., Nextflow, Prefect, Airflow, Dagster) or building pipeline abstractions
  • Experience with cloud infrastructure and data storage systems (e.g., object storage, data lakes/warehouses, database schema design) supporting multi-tenant environments
  • Exposure to LLM-based data transformation or cleansing systems
  • Background in computational biology, bioinformatics, or lab data systems (not required)

WHAT YOU WILL LOVE

- Mission-driven impact:
Play a critical role in ensuring that all incoming scientific data is clean, consistent, and analysis-ready, directly influencing the reliability of downstream analytics and AI systems
- High ownership and autonomy:
Own the ingestion and normalization layer end-to-end — from raw data input to clean datasets — and shape how data flows through the platform
- Team:
Work alongside a highly skilled, collaborative group of engineers, scientists, and builders
- Culture:
A focus on clarity, consistency, and solving challenging problems through disciplined execution
- Speed:
Fast iteration cycles with continuous improvement driven by real user feedback
- Environment:
In-person, high-energy, collaborative office setting

Benefits:
Comprehensive health coverage and retirement benefits

Ready to apply?
You'll be redirected to Confidential Company's application page.