
Performance Test Data Engineer
Role summary
We are seeking a Data Platform Engineer with a strong focus on Quality Assurance and Data Storage. This role involves designing, developing, and executing data validation and QA test strategies for large-scale data platforms, including ETL/ELT pipelines and data lakes. Responsibilities include performing end-to-end data validation, validating large datasets using SQL and Python, and ensuring data quality, accuracy, and performance across distributed environments. The ideal candidate will have hands-on experience with data lakes (S3, ADLS, HDFS), various data formats (Parquet, ORC, Delta Lake), and automated testing frameworks. Experience with cloud platforms like AWS or Azure is required.
Job Description: Data Platform Engineer (QA + Storage Focus)
Role Overview
We are looking for a
Data Platform Engineer with strong QA and Data Validation experience
to support large-scale data platforms. The ideal candidate will have hands-on experience in
testing data pipelines, validating data lakes/storage systems, and ensuring data quality, accuracy, and performance across distributed environments
.
Key Responsibilities
- Design, develop, and execute
data validation and QA test strategies
for ETL/ELT pipelines
- Perform
end-to-end data validation
between source systems and target data platforms (Data Lake / Data Warehouse)
- Validate
large-scale datasets
(millions/billions of records) using SQL, Python, and PySpark
- Perform
file-level and storage validation
across data lakes (S3 / ADLS / HDFS)
- File count validation
- Schema validation
- Partition validation
- Data completeness checks
- Test and validate
data ingestion pipelines
(batch & streaming)
- Validate data across
Bronze / Silver / Gold layers (Medallion architecture)
- Perform
data reconciliation and consistency checks
across multiple systems
- Develop and maintain
automated data validation frameworks
using Python (PyTest or similar)
- Implement and monitor
data quality checks
(nulls, duplicates, referential integrity)
- Validate
data formats
such as Parquet, ORC, Delta Lake
- Conduct
performance testing of data pipelines and queries
(Spark / SQL)
- Analyze and validate
data processing performance, latency, and throughput
- Collaborate with Data Engineers to
identify and fix data issues and optimize pipelines
Required Skills
Data QA / Testing
- Strong experience in
ETL/ELT testing and data validation
- Expertise in
SQL for data validation and reconciliation
- Experience with
test case design, execution, and defect tracking
- Knowledge of
data quality frameworks and validation techniques
Data Engineering Knowledge
- Understanding of
data pipelines (ADF / Airflow / Glue / Databricks)
- Experience with
PySpark / Apache Spark (basic to intermediate)
- Familiarity with
data modeling and transformations
Storage / Data Lake Validation (MANDATORY)
- Hands-on experience with
Data Lakes (AWS S3 / Azure ADLS / HDFS)
- Strong knowledge of:
- File-based validation
- Partitioning strategies
- Schema evolution
- Experience validating
Parquet / ORC / Delta Lake datasets
Programming & Tools
- Python (for automation/testing)
- SQL (strong)
- Experience with
PyTest / automation frameworks
- Git / CI-CD basics
Cloud Platforms (Any One)
- AWS (S3, Glue, Athena) OR
- Azure (ADLS, ADF, Databricks)
Nice to Have
- Experience with
Great Expectations / Deequ (data quality tools)
- Knowledge of
Kafka / streaming validation
- Experience with
Delta Lake features (time travel, versioning)
- Exposure to
data governance tools (Glue Catalog, Unity Catalog)
Ideal Candidate Profile
- Strong
Data Engineer with QA/testing experience
- Hands-on with
data validation + storage systems
- Comfortable working with
large-scale distributed data platforms
- Detail-oriented with a focus on
data accuracy, quality, and performance