Pyspark Data Engineer with Databricks

New York, New York, United StatesHybridFull Time$90,000–$100,000 /yrPosted 2 months agoVisa sponsorship available

Is this role right for you?

Upload your resume and get a skill-by-skill breakdown — see exactly where you match, where you're close, and what to highlight. Not a mystery percentage.

Get a tailored resume highlighting what this role needs.

Role summary

We are seeking a mid-to-senior level PySpark Data Engineer with Databricks expertise to design, build, and own production-grade data pipelines and platform components. The role requires strong skills in Python/PySpark, Databricks, and Snowflake, focusing on creating scalable, cost-efficient, and reliable data systems for analytics and machine learning. Key responsibilities include developing ETL/ELT pipelines, optimizing Spark jobs, implementing data quality and monitoring frameworks, managing ML lifecycles with MLflow, and building data ingestion and modeling solutions. The position also involves hands-on CI/CD implementation for data and ML pipelines.

Position Title : Pyspark Data Engineer with Databricks

Location : New York, NY (Onsite/Hybrid)

Experience : 8+ Years

Employee Type : Full Time with Benefits

Note :- Must be comfortable to attend In Person Interview at New York Location

Job Description

We are looking for a hands-on mid–senior level
PySpark Data Engineer with Databricks
who can design, build, and own production-grade data pipelines and platform components. This role requires strong expertise in Python/PySpark, Databricks, and Snowflake, with a focus on building scalable, cost‑efficient, and reliable data systems that support both analytics and machine learning use cases.

Key Responsibilities

- Design, develop, and maintain
end‑to‑end ETL/ELT pipelines using Python
and
PySpark on Databricks
.
- Optimize
Spark jobs for performance, scalability, and cost-efficiency
in production environments.
- Implement
data quality frameworks
including validation, reconciliation, and anomaly detection.
- Build and manage
orchestration workflows
(Airflow / Databricks Workflows / equivalent).
- Implement
pipeline monitoring, logging, alerting, and observability
for reliable operations.
- Develop and operationalize
ML workflows using MLflow
(experiment tracking, model registry, packaging, deployment).
- Build scalable
data ingestion and data modeling solutions
for analytics and ML use cases.
- Collaborate with data scientists, platform teams, engineering stakeholders, and business partners.

Required Skills & Qualifications

- 8+ years of experience in
data engineering
with strong hands‑on work in
PySpark and Python
.
- Deep experience with
Databricks
, Spark optimization, cluster tuning, and performance troubleshooting.
- Strong experience working with
Snowflake
or similar cloud data warehouses.
- Practical knowledge of
workflow orchestration tools
and dependency management.
- Solid understanding of
data modeling
, ingestion frameworks, and distributed systems architecture.
- Hands‑on experience implementing
CI/CD
for data and ML pipelines.
- Strong experience with
MLflow
for managing the ML lifecycle.
- Excellent communication skills with the ability to work across engineering and business teams. Desired Skills

Nice-to-Have Skills

- Exposure to
AI/LLM use cases
, vector search, or RAG pipelines.
- Familiarity with
Java-based services
or microservices architecture.
- Knowledge of data governance, cataloging, and security practices.

Sample Capgemini interview questions

1
Design a system for managing a distributed feature flag system.
system designmedium
2
Design a system for real-time processing of customer feedback.
system designmedium
3
Develop a data processing engine for real-time analytics.
system designmedium
4
Diameter of a Binary Tree Find the diameter of a binary tree. Input: root = [1,2] Output: 1 Explanation: The longest path is simply the single edge connecting the root node to its only child.
codingmedium
5
Aggressive Cows Maximize the minimum distance between aggressive cows in stalls. Input: stalls = [0,4,3,7,10,9], cows = 3 Output: 4 Explanation: Placing the cows at positions 0, 4, and 10 yields a maximum possible minimum distance of 4 between any two cows.
codingmedium

Ready to apply?

You'll be redirected to Capgemini's application page.