-
Experience
Owl Rock Capital Partners LLC. | Data Analyst Co-op | Boston Sep. 2019-Dec. 2019
• Extracted Dead/Passed Deals data from multiple data sources (original database, emails, excel files, txt files)
• Designed a database in SQL server and Transformed (include cleansing, validation, etc.) the data by SQL queries
• Loaded the data into the CRM resulting in 80% time reduction for the company and performed Ad-Hoc Analysis
• Trained unsupervised machine learning model K-means Clustering on the dead deal dataset and extracted meaningful insights for the investment team to make decisions
-
Projects
Bank Customer Churn Prediction | Python, Google Colab
• Preprocessed data by data cleaning and categorical feature transformation (one-hot encoding and label encoding)
• Trained supervised machine learning models including Logistic Regression, K-Nearest Neighbors, Random Forest, and SVM and implemented regularization with optimal parameters to prevent overfitting
• Proved the Random Forest model as the most successful with the accuracy of 0.86 and AUC score of 0.84
• Evaluated feature importance and identified top factors that influence user retention
Movie Recommendation System | Apache Spark, Python, SQL
• Conducted Exploratory data analysis on a movie dataset containing 9,000 movies and 100,000 ratings by Spark SQL
• Trained Alternating Least Square model and tuned the hyper-parameters by grid search and evaluated the model via cross-validation resulting in an optimal model with RMSE = 0.89 which decreased 70% compared to baseline model
• Deployed the recommender and recommended 5 movies to each user and find 10 similar movies for each movie
CMS Medicare Data Warehouse | SQL Server, SSIS, SSAS, Tableau
• Designed a snowflake data schema by Toad data modeler for four massive datasets extracted from CMS database
• Built ETL (Extract, Transform, Load) pipeline using Microsoft SSIS and populated the data into different fact tables and multiple dimensions in data warehouse by SSIS and SSMS
• Created OLAP cubes using SSAS to calculate KPIs such as the number of total beneficiaries for each drug
• Made interactive data visualization dashboards by Tableau and got insightful observations from it
San Francisco Crime Analysis and Prediction | Apache Spark, Python, SQL
• Built data processing pipeline based on Spark RDD and Spark SQL for big data OLAP applied on a 15-year dataset
• Explored and visualized the variation of the spatial and time distribution of crimes by PySpark then got insightful suggestions for the police to help them fight crimes efficiently and for travelers to help them avoid criminal activities
• Performed K-means clustering on spatial analysis by using Spark ML
• Trained an ARIMA model to predict the monthly number of criminal incidents