Responsibilities:
1. Design, develop, and maintain large-scale data pipelines and infrastructure.
2. Implement ETL (Extract, Transform, Load) processes to extract data from various sources, transform it into a structured format, and load it into our data warehouse.
3. Work with Apache technologies such as Spark, Flink, Kafka, and Lucene to build scalable data processing systems.
4. Develop data processing applications using Java and/or Python.
5. Collaborate with data scientists and analysts to understand data requirements and develop data pipelines to meet those needs.
6. Work with large-scale infrastructure and ensure data pipelines are scalable, efficient, and reliable.
7. Ensure data quality and integrity by implementing data validation, data cleansing, and data normalization techniques.
8. Familiarity with cloud platforms, mostly AWS, and experience with deploying data pipelines on cloud infrastructure.
9. Stay up-to-date with new big data technologies and methodologies and apply that knowledge to improve our data infrastructure.