Sudip Padhye

Studying at Indiana University-Bloomington
Living in United States
Website https://github.com/sudip-padhye

About

I'm a Data Engineer turned Data Scientist who loves building Machine Learning and Deep Learning models along with working with data pipelines. My latest project enables people to migrate data from a Data-warehouse to Data Lakes and later analyze them on the cloud. Right now, I'm pursuing MS in Data Science to explore my passion in the Data Science domain.

Skills:
• Programming Languages: Java (Object Oriented Programming), Python, R
• Platforms: Amazon Web Services (AWS), Google Cloud Platform (GCP)
• Database: Oracle SQL/PLSQL, PostgreSQL, MySQL, Cassandra, Amazon Redshift, Google BigQuery
• Big Data Tools: Spark SQL, AWS EMR (Elastic Map Reduce)
• Software: Informatica PowerCenter 10.x, Tableau, TensorFlow, Apache AirFlow
• Data Science: Machine Learning, Exploratory Data Analysis, Data Mining, Statistical Modelling & Analysis (ANOVA & Hypothesis Testing), Natural Language Processing (NLP) using RNN, CNN

Experience
• Data Scientist (S.O.S. Challenge Participant), TECHPOINT
Jun 2020 - Jul 2020
- Applied Exploratory Data Analysis of various COVID-19 cases & determined various factors involved using SparkSQL.
- Formulated a Time-series model using LSTM to forecast future county-wise COVID-19 cases, deaths & tests to be carried out. Results generated using statistical charts and plotly geo maps for Indiana state counties.
- Performed web-scraping of COVID-19 related news using Google News API.

• Data Engineer, KPIT Technologies Ltd.
Jul 2016 - Feb 2019
- Developed and Maintained ETL (Extract Transform Load) data pipelines, Data-warehouse & Business Intelligence (BI) Dashboards accessed by 20000+ end-users worldwide using Agile methodology.
- Performed Data Extraction from several source systems and later consolidated & organized transformed data using Data Modeling techniques.
- Delivered a Data Engineering Value-add single-handedly that auto-alerts the users on system failure by generating Root Cause Analysis (RCA) of inconsistent BI reports; reduced manual inspection by 97%.
- Optimized, automated, and revamped certain pain areas within the Data Infrastructure for efficient data retrieval from ~54M records using techniques such as Partitioning, Indexing, Query Optimization, fine-tuning stored procedures, etc.
- Mastered the project architecture and data model within 1 year of working in the project; trained several new team members by being an SME.
Projects
• Music Data Lake on AWS
- Initiated an AWS EMR cluster with provided AWS account credentials using IaC (Infrastructure-as-Code) and PySpark.
- Pioneered data extraction from JSON files, performed data cleaning, and later modeled into various data views.
- Analyzed data using AWS Athena (FaaS). Separate S3 buckets are used for source and output storage.

• Data-warehousing using Amazon Redshift
- Extracted data from S3 buckets using the ETL pipeline stages them in Redshift and transforms into a set of dimensional tables in the PostgreSQL database.
- Achieved optimal performance with the usage of star schema along with appropriate distribution/partition strategies.

• Data Pipeline Orchestration using Airflow
- Developed custom operators to perform tasks such as staging data from AWS Redshift, loading fact and dimensions, and validating through data quality checks.
- Transformed data from various sources into a star schema optimized for the analytics team’s use cases.
- Incorporated Workflow Scheduling mechanism along with execution pausing for a prescribed amount of time until the source data is available.

• Exploratory Data Analysis of a Spotify Artist
- Deduced correlation between two attributes: Energy & Loudness. Performed Statistical Analysis (Confidence Intervals, Two-sample Hypothesis test, and ANOVA) to determine whether all albums have the same/similar mean.
- Established a linear regression model between the two attributes using lm and later inspected for linear regression assumptions.

• EDA of Malware Infected Devices using SparkSQL
- Inspected a large input data file (almost 5 GB) using Microsoft Excel & Python by processing and cleaning the data for robust analysis.
- Proposed the conclusion by tracing various factors contributing to malware infection.

• Melanoma classification using InceptionResNet
- Performed Data Leakage check, data augmentation, and oversampling of minority class using TensorFlow2.0 on TPU accelerator, Transfer learning & Convolutional Neural Networks (CNN).
- Accomplishes an accuracy of 87% with training over the highly biased dataset of size 108 GB.

• New York City Taxi Fare Prediction
- Performed Exploratory Data Analysis to determine the key factors involved in predicting the Fare.
- Achieved a Mean Squared Error of about 0.12 using XGBoost & Randomized Search Cross-validation for Parameter Tuning.

• ArviZ: A library for Exploratory Analysis of Bayesian Models
- Contributed to the open-source arviz_example_data repository by compiling several examples from PyMC3 models to arviz format.
- Stored models using netCDF file format for future use by arviz library users.