
Pyspark Data Engineer with Databricks
Role summary
We are seeking a mid-to-senior level PySpark Data Engineer with Databricks expertise to design, build, and own production-grade data pipelines and platform components. The role requires strong skills in Python/PySpark, Databricks, and Snowflake, focusing on creating scalable, cost-efficient, and reliable data systems for analytics and machine learning. Key responsibilities include developing ETL/ELT pipelines, optimizing Spark jobs, implementing data quality and monitoring frameworks, managing ML lifecycles with MLflow, and building data ingestion and modeling solutions. The position also involves hands-on CI/CD implementation for data and ML pipelines.
Position Title : Pyspark Data Engineer with Databricks
Location : New York, NY (Onsite/Hybrid)
Experience : 8+ Years
Employee Type : Full Time with Benefits
Note :- Must be comfortable to attend In Person Interview at New York Location
Job Description
We are looking for a hands-on mid–senior level
PySpark Data Engineer with Databricks
who can design, build, and own production-grade data pipelines and platform components. This role requires strong expertise in Python/PySpark, Databricks, and Snowflake, with a focus on building scalable, cost‑efficient, and reliable data systems that support both analytics and machine learning use cases.
Key Responsibilities
- Design, develop, and maintain
end‑to‑end ETL/ELT pipelines using Python
and
PySpark on Databricks
.
- Optimize
Spark jobs for performance, scalability, and cost-efficiency
in production environments.
- Implement
data quality frameworks
including validation, reconciliation, and anomaly detection.
- Build and manage
orchestration workflows
(Airflow / Databricks Workflows / equivalent).
- Implement
pipeline monitoring, logging, alerting, and observability
for reliable operations.
- Develop and operationalize
ML workflows using MLflow
(experiment tracking, model registry, packaging, deployment).
- Build scalable
data ingestion and data modeling solutions
for analytics and ML use cases.
- Collaborate with data scientists, platform teams, engineering stakeholders, and business partners.
Required Skills & Qualifications
- 8+ years of experience in
data engineering
with strong hands‑on work in
PySpark and Python
.
- Deep experience with
Databricks
, Spark optimization, cluster tuning, and performance troubleshooting.
- Strong experience working with
Snowflake
or similar cloud data warehouses.
- Practical knowledge of
workflow orchestration tools
and dependency management.
- Solid understanding of
data modeling
, ingestion frameworks, and distributed systems architecture.
- Hands‑on experience implementing
CI/CD
for data and ML pipelines.
- Strong experience with
MLflow
for managing the ML lifecycle.
- Excellent communication skills with the ability to work across engineering and business teams. Desired Skills
Nice-to-Have Skills
- Exposure to
AI/LLM use cases
, vector search, or RAG pipelines.
- Familiarity with
Java-based services
or microservices architecture.
- Knowledge of data governance, cataloging, and security practices.
Sample Capgemini interview questions
- 1
Design a system for managing a distributed feature flag system.
system designmedium - 2
Design a system for real-time processing of customer feedback.
system designmedium - 3
Develop a data processing engine for real-time analytics.
system designmedium - 4
Diameter of a Binary Tree Find the diameter of a binary tree. Input: root = [1,2] Output: 1 Explanation: The longest path is simply the single edge connecting the root node to its only child.
codingmedium - 5
Aggressive Cows Maximize the minimum distance between aggressive cows in stalls. Input: stalls = [0,4,3,7,10,9], cows = 3 Output: 4 Explanation: Placing the cows at positions 0, 4, and 10 yields a maximum possible minimum distance of 4 between any two cows.
codingmedium
Sign up for a personalized interview prep pack tailored to this role.