Ujjawal Madan

MS in Analytics Candidate
Georgia Tech

Musician to Data Scientist...

Hi! My name is Ujjawal and I am currently an on-campus student in the MS in Analytics Program at Georgia Tech. As the title suggests, I have a pretty unconventional background. As a student, my two biggest interests were math and music, a not so unusual combination. After being accepted to Juillard for my undergraduate studies, I decided to fully commit myself to becoming a professional musician, specifically an orchestral player. However, during an internship in arts administration, I found myself working extensively with the Analytics team on trending, modelling and forecasting. Enthralled with the capability and potential of analytics as well as its evolving role across all industries, I found myself naturally transitioning into data science. Currently, my priorities are to expand my data science skill-set and tool-kit, seek opportunities to apply what I know and find new ways to solve high impact problems.

  • ML
  • Python
  • Numpy
  • Pandas
  • Keras
  • TensorFlow
  • R
  • dplyr
  • ggplot
  • Caret
  • HTML/CSS/JS
  • d3.js
  • Descriptive Analytics
  • Predictive Analytics
  • Reporting and Visualization
  • Power BI
  • Web Applications
  • NLP
Good ol' Resume (pdf) 

Technical Skills

Python
85%
R
85%
SQL
80%
Power BI
78%
d3.js
70%
Web Scraping (Scrapy, Splash, Selenium)
68%
HTML/CSS/JS
65%
TensorFlow
33%
ARENA Simluation Software
26%

Professional Skills

  • Problem Solving
  • Analytical Thinking
  • Communication
  • Creativity

Portfolio

  • All Categories
  • Personal Projects
  • GaTech

Identifying Dental Shortage Areas in the United States

Project Description

Myself along with three other MSA students completed this project as part of the Georgia Tech CSE 6242 course requirements, under the supervision of Dr. Nicoletta Serban of the IsYE department. Our goal was to identify dental shortage areas in the United States and produce both an analysis as well as a comprehensive visual analytics tool for policy makers to investigate such areas. This tool will be going into production sometime during 2021 and my own role was focused solely on the web development side, single-handedly building the state level visualization (Figure 2) using D3.js. In addition to getting 100 percent on the project, we received a lot of positive feedback on this project from our research advisor and I am personally really excited to see this going into production!

  • Python
  • HTML/CSS/JS
  • d3.html
  • Flask
Live Demo   Poster Presentation  

Poster Presentation

Figure 1 - Country Level View (Data is continually being added, hence the missing states)

Figure 2 - State Level View

Duolingo - Maximizing Resubscription Rate

Description

As part of one of the technical rounds for a data scientist interview at Duolingo, I was asked to complete this project in a 48 hour period. The goal was to investigate the factors that drive the resubscription rate for those of whom who are offered a trial and to then further give recommendations from a business perspective on what should be done to maximize the resubscription rate.

  • Python
  • Pandas
  • Matplotlib
  • Plotly
Jupyter Notebook   Final Report  

"Full-Stack" Multi-Class Sentiment Analysis with Deep Learning

Description

In this project, I build a sentiment analysis tool that is trained on the comments and ratings from rateprofessors.com, a website that allows students to rate schools and professors and submit comments. I further develop a web application which returns a rating prediction based on a submitted comment by the user.

Motivation

The primary motivation for this project was that I wanted to build a "full-stack" or "end-to-end" project, whereby I go through the entire data science lifecyle from data acquisition to deployment. Given that perfectly curated datasets are practically non-existent in the real world, and wanting to create an interface for users to interact with the models I built, a sentiment analysis tool based on ratemyprof data seemed ideal. Moreover, there did not seem to be any other projects on the web like this, and very little analysis conducted on the data from ratemyprofessors given that currently, the only way to obtain the data is to scrape it.

Business Value

Sentiment Analysis provides value in a wide variety of contexts and scenarios and is very commonly used in social media, marketing and other facets of analytics. Having multiclass-labelling only makes the analyses more refined, allowing, for example, organizations to prioritize analyses of lowest-rated commments to highest-rated comments. Specific to this project, a website such as ratemyprofessors.com could either predict in real time what rating the user will give based on the comment they entered and automatically select the prediction as the default rating (making it more convenient for the user). An extension of this project could be to build a keyword detection model which would allow for more detailed analyses and which would yield even more valuable information.

I think an interesting application of sentiment analysis, which I have yet to see, would be in the surveys requested after you finish your call with a customer representative. Rather than filling out some tedious survey on the phone which is quite cumbersome, if companies simply asked customers to leave a message about their experience (which would take from 10 to 20 seconds), a speech-to-text conversion could convert the message into text, and then both sentiment analysis and keyword detection could be performed on the text. The feedback could even be stratified by each customer representative and a keyword search, for example, would be able to provide the representative with more specific customer feedback. Having phoned American Airlines recently to modify an existing reservation where the representative asked me to complete the survey which was in the form of a simple - Press 1 if your experience was satisfactory, else press 0, I believe this approach would provide much more information to the company at little to no additional expense to the customer.

Process
  1. Obtaining the Data through Web Scraping - Using Scrapy and Selenium
  2. Preprocessing - Using Python, Numpy and Pandas
  3. Analysis - Exploratory Analysis with data from World University Rankings
  4. Building Models - Built and Fine-Tuned over 10 different NLP Models
  5. Evaluating Our Results
  6. Deployment - Using Heroku
Challenges

Overall, this has been the most time-consuming project I have undertaken, but also the most satisfying. Perhaps the experience gained from previous projects significantly helped, but from start to finish, I did not come across any significant or insurmountable obstacles.

I did find however myself feeling a bit apprehensive given that the data I was working with was not tried and tested and that there were very few analyses done on it. Did I scrape it incorrectly? What if the comments don't reflect the ratings? Ultimately, I think the experience of working on this was a bit closer to a real-life project where you may be the first person to be analyzing the dataset and the models trained on that data may not yield the results one may have come to expect with tried and tested data.


  • Scrapy
  • Splash
  • Selenium
  • Python
  • TensorFlow
  • Keras
  • HTML/CSS
  • Flask
Live Demo   Python Notebook   Github Repo

Sample Screenshot of Deployed Application on Heroku

IMDb Ratings Visualization - d3.js

Description

This visualization allows one to view a show of their liking and visualize the episode ratings in chronological order. If one hovers over an episode, then the episode information as well as a screenshot from the episode will be shown. If one clicks on the episode, they will be redirected to the IMDb page directly. I recommend you explore the live demo (link at the bottom of the page) and see the ratings of your favourite shows!

Motivation

The true motivation for this project came actually from watching The Simpsons. As some of you may know, Simpsons episodes can be sometimes hit or miss and I wanted to be able to go through the entire 20+ seasons without watching the low-rated episodes. This visualization allows me to do just that! I could even prioritize which epiodes to watch, going from highest-rated to lowest-rated. Given that I knew in advance that Georgia Tech has a rigorous Data and Visual Analytics Course which requires students to learn d3.js, I decided to learn and implement a project in d3 whilst also accomplish my goal of visualizing show ratings. As they say, two birds with one stone.

Business Value

The visualization itself can be utilized by avid TV viewers to monitor the ratings of their favourite shows, or even be used in a context like I described above. The scraped data itself has a lot of value and can be used in regression based prediction tasks or more complex NLP analyses (using plot and reviews information). The data can also be aggregated with other sources (such as Rotten Tomatoes data) with which, one could perform even more complex tasks and derive more impactful insights.

Process
  1. Web Scraping - Using Scrapy and Splash
  2. Preprocessing - Using R
  3. Visualization - Using HTML/CSS/JS and d3.js
  4. Deployment - Using Heroku
Challenges

One of the initial challenges was simply scraping the data while staying undetected. I used a proxy server, and set manual delays so that my access to IMDb would not be revoked. Either IMDb was quite lenient or my safeguards were enough to slip past them. The biggest challenge no doubt was learning and programming this visualization in d3.js. As those of whom that have used d3.js may be agree with, d3.js is not very intuitive and requires manually coding many of the functions that we take for granted in visualization packages such as Matplotlib and ggplot. There were many look ups on Stack Overflow, and many online tutorials. After several face-palms and long nights however, I managed to complete it.


  • Scrapy
  • Splash
  • R
  • HTML/CSS/JS
  • d3.js
Live Demo   Github Repo

The Simpsons Ratings

Sample Screenshot of my own favourite episode from Breaking Bad on Deployed Application using Heroku.

Orchestral Performance Trends

Description

This is a descriptive analytics project which is focused on uncovering trends and insights from classical music live-performance data. The data was aggregated from multiple sources, namely the New York Philharmonic and the Boston Symphony Orchestra.

Motivation

Having worked in the music industry for several years, and being a regular classical-music concertgoer, I was curious to uncover trends in artistic programming, composers, conductors and other live performance information.

Were some of the composers we celebrate today appreciated in their own time? Was there always an unsaid rule that concerts are to begin at 8pm? Are there works once fashionable that are not in style anymore, and on the flip side, are there works that are time-agnostic? These were just some of the questions I was itching to find out more about.

Business Context

In my opinion, data analysis is not being used nearly enough by arts organizations as it could and should be. I can't tell you how many times I have gotten into discussions with peers, professors and classmates over how arts organizations should be appropriately reacting to declining audiences , where even basic facts of the business are in debate. Data-based decision making building on this analysis (along with some predictive modelling) can help answer questions such as:

  1. Artistic Programming - What repertoire should we perform and when?
  2. Determining demand for Conductors and Soloists - What price are we willing to pay for conductors and soloist?
  3. Scheduling of Performances - What day and what time is optimal controlled for audience type and repertoire
Data science can also be greatly utilized in both marketing as well as development (fundraising). Too often I think, we in the music world over rely on experience and heuristically make decisions (often based on our own assumptions). I think decision-makers could benefit from data-based evidence in conjunction with past experience and instinct.

Process
  1. Obtaining the Data - Webscraping using Scrapy
  2. Preprocessing
  3. Analysis - Pandas and Matplotlib
  4. Conclusion
Challenges

This project did not present any serious technical challenges or barriers. However, I do think this project tested my citical thinking skills and pushed to think of creative ways value could be extracted from this data. More of my time was devoted to asking myself how this data could be useful rather than looking something up on stackoverflow or medium. Although I did this project more for curiosity than anything else and it was interesting to just discover interesting facts about the history of live orchestral music, I ultimately believe that the purpose of this kind of analytics is to provide value to an organization. Building on the kind of thinking I used in this project and expanding my perpsective to ask myself even more pertinent questions will, I think, make me a much better and more valuable data scientist.


  • Scrapy
  • Splash
  • Python
  • Pandas
  • Matplotlib
Python Notebook   Github Repo

Number of Concerts Per Season

Which day of the week was most popular?

Most Popular Symphonic Works

Hall Popularity by Month

Analytics Modeling - ISYE 6501

Description

Analytics Modeling - ISYE 6501 offered at Georgia Tech is an introduction to descriptive, predictive and prescriptive analytical modeling. There are a wide variety of topics covered including classification, clustering, change detection, data smoothing, validation, prediction, optimization, experimentation, decision making and others.

What I Learned

Given a business problem, I feel confident in my ability to translate that into an analytics-based problem and select an appropriate analytics-based solution as well as building a preliminary solution in R. I also feel quite comfortable with evaluating someone elses analytics-based solution and determining whether the conclusion made was reasonable or not.

Perhaps most important for any data practitioner is the ability to think critically and problem-solve. This course lays the groundwork for what kinds of tools are available to use for us analytics professionals and how to use each appropriately and effectively.

Sample Topics Covered in Assignments
  1. Classification
  2. Unsupervised Learning
  3. Data Preparation
  4. Change Detection
  5. Time Series Models
  6. Regression
  7. Decision Making
  8. Variable Selection
  9. PCA
  10. Design of Experiments
  11. Probability-Based Models
  12. Optimization
  13. Simulation

  • R
  • Python
  • Arena
  • caret
  • ggplot
  • PuLP
Github Repo

Change Detection Plot - CUSUM

Principal Component Analysis - Sample Scree Plot and Cumalative Variance Plot

Arena Software Simulation of Airport Security

Adult Census Income Classification

Description and Motivation

This was the second of two Capstone Projects as part of the Harvard Data Science Professional Certification. The objective was to predict, based on certain factors, whether a person's income is less than or greater than $50,000. After having worked on a regression-based capstone project, I thought it would be interesting and quite useful to explore solutions to a classification-based problem.

Business Value

A classification model, a type of predictive model used to solve this analytical problem, is a cornerstone of supervised learning and is therefore extremely valuable in any business context. While there was no assigned business problem to this dataset, a sample use case could be that the federal government is looking to contact households who's income is above a certain threshold (for tax reasons as an example). Given that there will be a cost to false positive and false negatives, we can optimize our algorithms to minimize total cost.

Process
  1. Exploratory Data Analysis/Data Visualization
  2. Methods and Analysis - Feature Selection using RFE, Training 7 different classification-based models (with hyperparameter tuning)
  3. Evaluation of Models - Testing and Validation
  4. Conclusion
Challenges

The primary challenge for me, just as it was in the MovieLens Project, was not to be overwhelmed by the sheer number of approaches and methods one could employ for each component of the methods and analysis section. There are a wide variety of ways, for example, to do feature selection (greedy algorithms, PCA, etc.) and then there are even different versions and packages for the same machine learning algorithms (e.g. Logistic Regression). What method do I then use to tune hyperparameters (Bayesian Optimization, Random, TuneGrid, etc.)? How do I know which approach is the best for this particular problem? Once again, I tried not getting dragged down by the insignificant minutae of each model and approach whilst still remaining detail-oriented .


  • R
  • dplyr
  • ggplot2
  • caret
PDF Report   Github Repo

Sample Visualization of feature variables

Model Validation Performance Measures

MovieLens Rating Prediction

Description and Motivation

This was the first of two Capstone Projects as part of the Harvard Data Science Professional Certification. Given various input features, create a model that will predict the movie rating.

Business Value

A regression model, which is the type of predictive model used to solve this analytical problem, is a cornerstone of supervised learning and is therefore extremely valuable in any business context. The extension to this would be to build a recommender system. Accurate predictions of how much a user would like a particular movie would allow for more informed recommendations and thus more user engagement.

Process
  1. Exploratory Data Analysis/Data Visualization
  2. Methods and Analysis - Building and Training 12 Different Models
  3. Testing and Validation
  4. Evaluating Results and Conclusion
Challenges

The primary challenge for me was actually to not get overwhelmed by the sheer number of algorithms and approaches one could employ for multivariate regression. Feeling very much like I was wading into an ocean, I had to remind myself to not get bogged down by each and every detail whilst still remain methodical and focused.


  • R
  • dplyr
  • ggplot2
  • caret
PDF Report   Github Repo

Sample ggplot Visualization

Final Results