We're in beta · Starting with US & Canada · Shipping weekly — your feedback shapes RiseMe
Onpoint Healthcare Partners Inc logo
Onpoint Healthcare Partners Inc Verified
Healthcare Consulting, Management Consulting

Tech - Staff Data Scientist, Clinical Data + AI

Irving, Texas, United StatesRemoteFull TimeStaff$185,000–$225,000 /yrPosted 1 month ago

Compensation estimateAI

See base, equity, bonus, and total comp estimates for this role — free, no credit card.

Sign up to see compensation estimate

Category: Professional

Employment Type: Full Time

Location: US-Remote

Department: Technology

Reports To: Director of Engineering

Additional Info: Visa / work permit sponsorship is not available for this position

## Position Overview

We are seeking a Staff Data Scientist to own the design, normalization, and optimization of advanced data representation, retrieval, and knowledge systems that power AI and machine learning initiatives. This role is centered on vector databases, knowledge graphs, and retrieval-augmented generation (RAG) solutions, with deep responsibility for statistical rigor, mathematical correctness, and semantic normalization across disparate data sources. This role owns the clinical AI quality measurement: rubrics, golden sets, and regression suites.

This individual serves as the architect of the data substrate, ensuring that heterogeneous data, originating from different systems, formats, scales, and semantics, can be reliably compared, linked, and reasoned over by downstream analytics and AI models. The role works in close partnership with Staff AI Engineer to ensure data representations support robust, explainable, and scalable AI systems.

## Key Responsibilities

Mathematical & Statistical Foundations

  • Apply a strong mathematical foundation—including linear algebra, probability theory, statistics, and optimization—to data representation and analysis problems.
  • Use statistical methods to evaluate distributions, variance, correlation, uncertainty, and signal-to-noise characteristics across datasets.
  • Design normalization and transformation strategies grounded in mathematical correctness to preserve meaning while enabling comparison.
  • Ensure statistical assumptions are explicit, tested, and documented.

Data Flywheel from MAA Decisions

  • Standardize MAA edits/overrides into structured labels (change, reason, severity).
  • Build semi-automated labeling using rules, heuristics, and human QA.
  • Track inter-rater reliability to ensure label integrity.

Data Normalization & Cross-Source Comparability

  • Lead efforts to normalize data from disparate sources so that values, features, and representations can be meaningfully compared and combined.
  • Address differences in:
  • Scale, units, and measurement conventions
  • Schema and structural variation
  • Semantic meaning and context
  • Temporal granularity and resolution
  • Define canonical representations, reference models, and normalization pipelines for structured, semi-structured, and unstructured data.
  • Implement statistical and embedding-based techniques to align heterogeneous data into shared latent or semantic spaces.

PHI-safe data operations

  • Build PHI-aware data pipelines with de-identification, access controls, retention, and audit trails.
  • Ensure all evaluation and training datasets are traceable and compliant.

Data Foundations & Quality Engineering

  • Perform advanced exploratory data analysis (EDA) to identify bias, missingness, anomalies, and structural inconsistencies.
  • Define, measure, and monitor data quality dimensions including completeness, consistency, accuracy, timeliness, and lineage.
  • Establish validation checks and statistical controls to ensure normalized and linked data remains trustworthy over time.
  • Document data assumptions, limitations, and known failure modes.

Vector Databases & Embedding Systems

  • Design and operate vector database architectures for semantic retrieval, similarity search, and AI grounding.
  • Define embedding strategies that allow heterogeneous data types (text, structured records, metadata) to coexist in comparable vector spaces.
  • Optimize retrieval performance, relevance, and statistical fidelity at scale.
  • Partner with ML scientists to ensure embeddings and retrieval align with model behavior and evaluation criteria.

Knowledge Graphs & Structured Knowledge Systems

  • Design and maintain knowledge graphs that encode normalized entities, relationships, and domain semantics.
  • Define ontologies and schemas that enable semantic consistency across data sources.
  • Support inference, traversal, and explainability by linking structured knowledge with unstructured content.
  • Maintain alignment between graph representations and vector-based retrieval systems.

RAG & Hybrid Retrieval Solutions

  • Architect retrieval-augmented generation (RAG)pipelines combining:
  • Vector similarity search
  • Knowledge graph traversal
  • SQL-based relational queries
  • Document and object storage
  • Evaluate hybrid retrieval strategies to improve grounding, reduce hallucinations, and increase contextual relevance.
  • Instrument retrieval systems with statistical metrics to measure coverage, relevance, and failure patterns.

Hybrid Data Architecture & Semantic Linking

  • Design hybrid data architecturesthat link:
  • Relational (SQL) databases
  • Vector databases
  • Knowledge graphs
  • Document stores and large text corpora
  • Implement entity resolution, canonical identifiers, and cross-system joins.
  • Enable bidirectional traceability between raw data, normalized forms, embeddings, and knowledge representations.

Production Monitoring and Drift Detection

  • Define and continuously monitor model and agent quality metrics segmented by tenant, EHR system, workflow, and cohort.
  • Build targeted drift alerting systems that focus on clinically meaningful failure modes, avoiding reliance on generic embedding drift signals.
  • Maintain an evolving error taxonomy and operate a weekly review loop to identify and address top failure modes, ensuring rapid feedback and continuous improvement.

Day-to-Day Tools & Practices

A successful Principal Data Scientist in this role routinely applies:

  • Define rubrics and acceptance criteria per AI Agent Workflow
  • Golden data sets and scenario suites
  • Statistical analysis and mathematical modeling
  • Advanced SQL and relational data design
  • Python-based data analysis (pandas, NumPy, SciPy)
  • Vector and graph databases and embedding pipelines
  • Data validation, transformation, and lineage tooling
  • Large-scale text and document processing workflows

Required Qualifications

  • Advanced degree (PhD welcome; demonstrated shipping impact wins) in Data Science, Statistics, Applied Mathematics, Computer Science, or a related field.
  • Strong mathematical background, including statistics, probability, and linear algebra.
  • Extensive experience normalizing and integrating data from multiple heterogeneous sources.
  • Hands-on expertise with:
  • Vector databases and semantic retrieval
  • Knowledge graphs and graph data models
  • SQL and relational database systems
  • Proven ability to collaborate with ML scientists and engineers in production environments.
  • Proficiency in Python and data-centric tooling.

Success Metrics

  • Data from disparate sources can be reliably normalized, compared, and linked.
  • Retrieval and knowledge systems measurably improve AI accuracy, grounding, and explainability.
  • Clear statistical and mathematical rigor is evident in data pipelines and documentation.
  • Strong alignment and trust between data science, ML, and engineering teams.
  • Reduced friction and faster iteration across AI research and production.

## Physical Demands

The physical demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. While performing the duties of this job, the employee is regularly required to speak, hear, read, and type. This is largely a sedentary role; however, some shipping may be required. This position requires the ability to occasionally lift office products and supplies up to 40 pounds.

## Work Environment

To perform this job successfully, an individual must be able to perform each essential duty satisfactorily. The requirements listed above are representative of the knowledge, skill and/or ability required.

Please note this job description is not designed to cover or contain a comprehensive listing of activities, duties, or responsibilities that are required of the employee for this job. Duties, responsibilities, and activities may change at any time, with or without notice.

Ready to apply?
You'll be redirected to Onpoint Healthcare Partners Inc's application page.