Tech - Staff Data Scientist, Clinical Data + AI

Irving, Texas, United StatesRemoteFull TimeStaff$185,000–$225,000 /yrPosted 1 month ago

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

Category: Professional

Employment Type: Full Time

Location: US-Remote

Department: Technology

Reports To: Director of Engineering

Additional Info: Visa / work permit sponsorship is not available for this position

## Position Overview

We are seeking a Staff Data Scientist to own the design, normalization, and optimization of advanced data representation, retrieval, and knowledge systems that power AI and machine learning initiatives. This role is centered on vector databases, knowledge graphs, and retrieval-augmented generation (RAG) solutions, with deep responsibility for statistical rigor, mathematical correctness, and semantic normalization across disparate data sources. This role owns the clinical AI quality measurement: rubrics, golden sets, and regression suites.

This individual serves as the architect of the data substrate, ensuring that heterogeneous data, originating from different systems, formats, scales, and semantics, can be reliably compared, linked, and reasoned over by downstream analytics and AI models. The role works in close partnership with Staff AI Engineer to ensure data representations support robust, explainable, and scalable AI systems.

## Key Responsibilities

Mathematical & Statistical Foundations

Apply a strong mathematical foundation—including linear algebra, probability theory, statistics, and optimization—to data representation and analysis problems.
Use statistical methods to evaluate distributions, variance, correlation, uncertainty, and signal-to-noise characteristics across datasets.
Design normalization and transformation strategies grounded in mathematical correctness to preserve meaning while enabling comparison.
Ensure statistical assumptions are explicit, tested, and documented.

Data Flywheel from MAA Decisions

Standardize MAA edits/overrides into structured labels (change, reason, severity).
Build semi-automated labeling using rules, heuristics, and human QA.
Track inter-rater reliability to ensure label integrity.

Data Normalization & Cross-Source Comparability

Lead efforts to normalize data from disparate sources so that values, features, and representations can be meaningfully compared and combined.
Address differences in:
Scale, units, and measurement conventions
Schema and structural variation
Semantic meaning and context
Temporal granularity and resolution
Define canonical representations, reference models, and normalization pipelines for structured, semi-structured, and unstructured data.
Implement statistical and embedding-based techniques to align heterogeneous data into shared latent or semantic spaces.

PHI-safe data operations

Build PHI-aware data pipelines with de-identification, access controls, retention, and audit trails.
Ensure all evaluation and training datasets are traceable and compliant.

Data Foundations & Quality Engineering

Perform advanced exploratory data analysis (EDA) to identify bias, missingness, anomalies, and structural inconsistencies.
Define, measure, and monitor data quality dimensions including completeness, consistency, accuracy, timeliness, and lineage.
Establish validation checks and statistical controls to ensure normalized and linked data remains trustworthy over time.
Document data assumptions, limitations, and known failure modes.

Vector Databases & Embedding Systems

Design and operate vector database architectures for semantic retrieval, similarity search, and AI grounding.
Define embedding strategies that allow heterogeneous data types (text, structured records, metadata) to coexist in comparable vector spaces.
Optimize retrieval performance, relevance, and statistical fidelity at scale.
Partner with ML scientists to ensure embeddings and retrieval align with model behavior and evaluation criteria.

Knowledge Graphs & Structured Knowledge Systems

Design and maintain knowledge graphs that encode normalized entities, relationships, and domain semantics.
Define ontologies and schemas that enable semantic consistency across data sources.
Support inference, traversal, and explainability by linking structured knowledge with unstructured content.
Maintain alignment between graph representations and vector-based retrieval systems.

RAG & Hybrid Retrieval Solutions

Architect retrieval-augmented generation (RAG)pipelines combining:
Vector similarity search
Knowledge graph traversal
SQL-based relational queries
Document and object storage
Evaluate hybrid retrieval strategies to improve grounding, reduce hallucinations, and increase contextual relevance.
Instrument retrieval systems with statistical metrics to measure coverage, relevance, and failure patterns.

Hybrid Data Architecture & Semantic Linking

Design hybrid data architecturesthat link:
Relational (SQL) databases
Vector databases
Knowledge graphs
Document stores and large text corpora
Implement entity resolution, canonical identifiers, and cross-system joins.
Enable bidirectional traceability between raw data, normalized forms, embeddings, and knowledge representations.

Production Monitoring and Drift Detection

Define and continuously monitor model and agent quality metrics segmented by tenant, EHR system, workflow, and cohort.
Build targeted drift alerting systems that focus on clinically meaningful failure modes, avoiding reliance on generic embedding drift signals.
Maintain an evolving error taxonomy and operate a weekly review loop to identify and address top failure modes, ensuring rapid feedback and continuous improvement.

Day-to-Day Tools & Practices

A successful Principal Data Scientist in this role routinely applies:

Define rubrics and acceptance criteria per AI Agent Workflow
Golden data sets and scenario suites
Statistical analysis and mathematical modeling
Advanced SQL and relational data design
Python-based data analysis (pandas, NumPy, SciPy)
Vector and graph databases and embedding pipelines
Data validation, transformation, and lineage tooling
Large-scale text and document processing workflows

Required Qualifications

Advanced degree (PhD welcome; demonstrated shipping impact wins) in Data Science, Statistics, Applied Mathematics, Computer Science, or a related field.
Strong mathematical background, including statistics, probability, and linear algebra.
Extensive experience normalizing and integrating data from multiple heterogeneous sources.
Hands-on expertise with:
Vector databases and semantic retrieval
Knowledge graphs and graph data models
SQL and relational database systems
Proven ability to collaborate with ML scientists and engineers in production environments.
Proficiency in Python and data-centric tooling.

Success Metrics

Data from disparate sources can be reliably normalized, compared, and linked.
Retrieval and knowledge systems measurably improve AI accuracy, grounding, and explainability.
Clear statistical and mathematical rigor is evident in data pipelines and documentation.
Strong alignment and trust between data science, ML, and engineering teams.
Reduced friction and faster iteration across AI research and production.

## Physical Demands

The physical demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. While performing the duties of this job, the employee is regularly required to speak, hear, read, and type. This is largely a sedentary role; however, some shipping may be required. This position requires the ability to occasionally lift office products and supplies up to 40 pounds.

## Work Environment

To perform this job successfully, an individual must be able to perform each essential duty satisfactorily. The requirements listed above are representative of the knowledge, skill and/or ability required.

Please note this job description is not designed to cover or contain a comprehensive listing of activities, duties, or responsibilities that are required of the employee for this job. Duties, responsibilities, and activities may change at any time, with or without notice.

Ready to apply?

You'll be redirected to Onpoint Healthcare Partners Inc's application page.