Hi! My name is Ujjawal and I am currently an on-campus student in the MS in Analytics Program at Georgia Tech. As the title suggests, I have a pretty unconventional background. As a student, my two biggest interests were math and music, a not so unusual combination. After being accepted to Juillard for my undergraduate studies, I decided to fully commit myself to becoming a professional musician, specifically an orchestral player. However, during an internship in arts administration, I found myself working extensively with the Analytics team on trending, modelling and forecasting. Enthralled with the capability and potential of analytics as well as its evolving role across all industries, I found myself naturally transitioning into data science. Currently, my priorities are to expand my data science skill-set and tool-kit, seek opportunities to apply what I know and find new ways to solve high impact problems.
Myself along with three other MSA students completed this project as part of the Georgia Tech CSE 6242 course requirements, under the supervision of Dr. Nicoletta Serban of the IsYE department. Our goal was to identify dental shortage areas in the United States and produce both an analysis as well as a comprehensive visual analytics tool for policy makers to investigate such areas. This tool will be going into production sometime during 2021 and my own role was focused solely on the web development side, single-handedly building the state level visualization (Figure 2) using D3.js. In addition to getting 100 percent on the project, we received a lot of positive feedback on this project from our research advisor and I am personally really excited to see this going into production!
Poster Presentation
Figure 1 - Country Level View (Data is continually being added, hence the missing states)
Figure 2 - State Level View
As part of one of the technical rounds for a data scientist interview at Duolingo, I was asked to complete this project in a 48 hour period. The goal was to investigate the factors that drive the resubscription rate for those of whom who are offered a trial and to then further give recommendations from a business perspective on what should be done to maximize the resubscription rate.
In this project, I build a sentiment analysis tool that is trained on the comments and ratings from rateprofessors.com, a website that allows students to rate schools and professors and submit comments. I further develop a web application which returns a rating prediction based on a submitted comment by the user.
The primary motivation for this project was that I wanted to build a "full-stack" or "end-to-end" project, whereby I go through the entire data science lifecyle from data acquisition to deployment. Given that perfectly curated datasets are practically non-existent in the real world, and wanting to create an interface for users to interact with the models I built, a sentiment analysis tool based on ratemyprof data seemed ideal. Moreover, there did not seem to be any other projects on the web like this, and very little analysis conducted on the data from ratemyprofessors given that currently, the only way to obtain the data is to scrape it.
Sentiment Analysis provides value in a wide variety of contexts and scenarios and is very commonly used in social media, marketing and other facets of analytics.
Having multiclass-labelling only makes the analyses more refined, allowing, for example, organizations to prioritize analyses of lowest-rated commments to highest-rated comments.
Specific to this project, a website such as ratemyprofessors.com could either predict in real time what rating the user will give based on the comment they entered and automatically select the prediction as the default rating (making it more convenient for the user).
An extension of this project could be to build a keyword detection model which would allow for more detailed analyses and which would yield even more valuable information.
I think an interesting application of sentiment analysis, which I have yet to see, would be in the surveys requested after you finish your call with a customer representative.
Rather than filling out some tedious survey on the phone which is quite cumbersome, if companies simply asked customers to leave a message about their experience (which would take from 10 to 20 seconds),
a speech-to-text conversion could convert the message into text, and then both sentiment analysis and keyword detection could be performed on the text. The feedback could even be stratified by each customer representative and a keyword search, for example, would be able to provide the representative with more specific customer feedback.
Having phoned American Airlines recently to modify an existing reservation where the representative asked me to complete the survey which was in the form of a simple - Press 1 if your experience was satisfactory, else press 0,
I believe this approach would provide much more information to the company at little to no additional expense to the customer.
Overall, this has been the most time-consuming project I have undertaken, but also the most satisfying.
Perhaps the experience gained from previous projects significantly helped,
but from start to finish, I did not come across any significant or insurmountable obstacles.
I did find however myself feeling a bit apprehensive given that the data I was working with was not tried and tested and that there were very few analyses done on it. Did I scrape it incorrectly? What if the comments don't reflect the ratings?
Ultimately, I think the experience of working on this was a bit closer to a real-life project where you may be the first person to be analyzing the dataset and the models trained on that data may not yield the results one may have come to expect with tried and tested data.
Sample Screenshot of Deployed Application on Heroku
This visualization allows one to view a show of their liking and visualize the episode ratings in chronological order. If one hovers over an episode, then the episode information as well as a screenshot from the episode will be shown. If one clicks on the episode, they will be redirected to the IMDb page directly. I recommend you explore the live demo (link at the bottom of the page) and see the ratings of your favourite shows!
The true motivation for this project came actually from watching The Simpsons. As some of you may know, Simpsons episodes can be sometimes hit or miss and I wanted to be able to go through the entire 20+ seasons without watching the low-rated episodes. This visualization allows me to do just that! I could even prioritize which epiodes to watch, going from highest-rated to lowest-rated. Given that I knew in advance that Georgia Tech has a rigorous Data and Visual Analytics Course which requires students to learn d3.js, I decided to learn and implement a project in d3 whilst also accomplish my goal of visualizing show ratings. As they say, two birds with one stone.
The visualization itself can be utilized by avid TV viewers to monitor the ratings of their favourite shows, or even be used in a context like I described above. The scraped data itself has a lot of value and can be used in regression based prediction tasks or more complex NLP analyses (using plot and reviews information). The data can also be aggregated with other sources (such as Rotten Tomatoes data) with which, one could perform even more complex tasks and derive more impactful insights.
One of the initial challenges was simply scraping the data while staying undetected. I used a proxy server, and set manual delays so that my access to IMDb would not be revoked. Either IMDb was quite lenient or my safeguards were enough to slip past them. The biggest challenge no doubt was learning and programming this visualization in d3.js. As those of whom that have used d3.js may be agree with, d3.js is not very intuitive and requires manually coding many of the functions that we take for granted in visualization packages such as Matplotlib and ggplot. There were many look ups on Stack Overflow, and many online tutorials. After several face-palms and long nights however, I managed to complete it.
The Simpsons Ratings
Sample Screenshot of my own favourite episode from Breaking Bad on Deployed Application using Heroku.
This is a descriptive analytics project which is focused on uncovering trends and insights from classical music live-performance data. The data was aggregated from multiple sources, namely the New York Philharmonic and the Boston Symphony Orchestra.
Having worked in the music industry for several years, and being a regular classical-music concertgoer,
I was curious to uncover trends in artistic programming, composers, conductors and other live performance information.
Were some of the composers we celebrate today appreciated in their own time?
Was there always an unsaid rule that concerts are to begin at 8pm?
Are there works once fashionable that are not in style anymore, and on the flip side, are there works that are time-agnostic?
These were just some of the questions I was itching to find out more about.
In my opinion, data analysis is not being used nearly enough by arts organizations as it could and should be. I can't tell you how many times I have gotten into discussions with peers, professors and classmates over how arts organizations should be appropriately reacting to declining audiences , where even basic facts of the business are in debate. Data-based decision making building on this analysis (along with some predictive modelling) can help answer questions such as:
This project did not present any serious technical challenges or barriers. However, I do think this project tested my citical thinking skills and pushed to think of creative ways value could be extracted from this data. More of my time was devoted to asking myself how this data could be useful rather than looking something up on stackoverflow or medium. Although I did this project more for curiosity than anything else and it was interesting to just discover interesting facts about the history of live orchestral music, I ultimately believe that the purpose of this kind of analytics is to provide value to an organization. Building on the kind of thinking I used in this project and expanding my perpsective to ask myself even more pertinent questions will, I think, make me a much better and more valuable data scientist.
Number of Concerts Per Season
Which day of the week was most popular?
Most Popular Symphonic Works
Hall Popularity by Month
Analytics Modeling - ISYE 6501 offered at Georgia Tech is an introduction to descriptive, predictive and prescriptive analytical modeling. There are a wide variety of topics covered including classification, clustering, change detection, data smoothing, validation, prediction, optimization, experimentation, decision making and others.
Given a business problem, I feel confident in my ability to translate that into an analytics-based problem and select an appropriate analytics-based solution as well as building a preliminary solution in R.
I also feel quite comfortable with evaluating someone elses analytics-based solution and determining whether the conclusion made was reasonable or not.
Perhaps most important for any data practitioner is the ability to think critically and problem-solve.
This course lays the groundwork for what kinds of tools are available to use for us analytics professionals and how to use each appropriately and effectively.
Change Detection Plot - CUSUM
Principal Component Analysis - Sample Scree Plot and Cumalative Variance Plot
Arena Software Simulation of Airport Security
This was the second of two Capstone Projects as part of the Harvard Data Science Professional Certification. The objective was to predict, based on certain factors, whether a person's income is less than or greater than $50,000. After having worked on a regression-based capstone project, I thought it would be interesting and quite useful to explore solutions to a classification-based problem.
A classification model, a type of predictive model used to solve this analytical problem, is a cornerstone of supervised learning and is therefore extremely valuable in any business context. While there was no assigned business problem to this dataset, a sample use case could be that the federal government is looking to contact households who's income is above a certain threshold (for tax reasons as an example). Given that there will be a cost to false positive and false negatives, we can optimize our algorithms to minimize total cost.
The primary challenge for me, just as it was in the MovieLens Project, was not to be overwhelmed by the sheer number of approaches and methods one could employ for each component of the methods and analysis section. There are a wide variety of ways, for example, to do feature selection (greedy algorithms, PCA, etc.) and then there are even different versions and packages for the same machine learning algorithms (e.g. Logistic Regression). What method do I then use to tune hyperparameters (Bayesian Optimization, Random, TuneGrid, etc.)? How do I know which approach is the best for this particular problem? Once again, I tried not getting dragged down by the insignificant minutae of each model and approach whilst still remaining detail-oriented .
Sample Visualization of feature variables
Model Validation Performance Measures
This was the first of two Capstone Projects as part of the Harvard Data Science Professional Certification. Given various input features, create a model that will predict the movie rating.
A regression model, which is the type of predictive model used to solve this analytical problem, is a cornerstone of supervised learning and is therefore extremely valuable in any business context. The extension to this would be to build a recommender system. Accurate predictions of how much a user would like a particular movie would allow for more informed recommendations and thus more user engagement.
The primary challenge for me was actually to not get overwhelmed by the sheer number of algorithms and approaches one could employ for multivariate regression. Feeling very much like I was wading into an ocean, I had to remind myself to not get bogged down by each and every detail whilst still remain methodical and focused.
Sample ggplot Visualization
Final Results