For our third project, McNulty, my group chose to work with a bank marketing campaign dataset (1 of 3 pre-selected options we had to choose from) to predict who would say yes to a term deposit. To do this we first built a MySQL database on a DigitalOcean cloud server to store the data.

Next, we trained and tested multiple types of supervised learning classifiers (Logistic Regression, Decision Trees, KNN, SVM, Random Forest) to predict if a person would subscribe to a term deposit. Please see my blog posts for more details on the specific algorithms. In our case, logistic regression performed the best.

Finally, each person in our group had to come up with a different d3 visualization for the data. I choose to look at the group of people who said yes to the term deposit broken down by various demographics.

You can click on each of the groups and zoom in to see the specific count of each individual section. For example, clicking on the Marital Status bubble and hover over the word Single, you'll see that out of the 5,289 people who said yes to a term deposit, 1,912 of them were single.

Thanks to d3 guru, Mike Bostock, for the code.



For project Luther, we focused on predictions with linear regression models using movie data. We had to come up with an initial idea and scrape and perform an analysis using Box Office Mojo and any additional sources we wanted. To scrape the data we used python Requests and Beautiful Soup. For the analysis and graphing I used some of following Python packages: Pandas, Numpy, StatsModels, and Matplotlib. My code for the project can be found here.

For my project, I came up with the idea that the CSA (Casting Society of America), was my client and needed data science assistance to work with a small budget studio whose goal was to produce a Oscar Best Picture Nominee. However, the studio had a limited budget so they wanted to know if it was better to spend on a top actor, director, producer, etc. The studio then just for fun wanted me to attempt to predict the number of Oscar wins for the best picture nominees for this year.

To start off, I found this really helpful page on Box Offfice Mojo that contains the entire best picture nominees listed by year. However, if you click on any of the movie links, you’ll notice that it goes to just the Oscar specific data for the movie. So I ended up making two sets of new urls from the initial list: one in order to get the nomination and wins from that Oscar specific page and then another that went to the main movie’s page to grab the actors, director, producer, writers, and distributors. I did this for every best picture nominee since 2000. I choose 2000 as a cutoff because specific movie interests change over time, for example a 1985 best picture nominee might not be a nominee in 2015. Just wanted to keep the sample relevant.

After I scraped all of the data, I had six different sets of data. One that had nominations and wins for each of the movies and then other five that had the actors, directors and so on with each of the movies (all of this is clear and available in the code). So for each data set I then grouped by the actor to see who appeared in the most best picture nominees from the last 15 years. I looked at each list and cut it off where it seemed like it was around the top 5% for each category. I then made a list of all of those movies that had what I deemed a top actor. Finally, I plotted the nominations vs wins for each of those new lists to see which R-Squared was highest. Writers surprisingly ended up #1 but I’m pretty sure that’s only because one of the writers in that group wrote that Lord of The Rings version that won 11 Oscars(!!). So if you took that out, I believe distributors and actors would be the best bet in terms of an investment.

To predict this years results, I looked if the nominees had a top actor etc in their film and re-ran the model. So my not-very-confident-at-all predictions of total number of oscar wins this year are:

2015 Oscar wins prediction (between)

  • Birdman: 3 and 5

  • The Grand Budapest Hotel: 2 and 4

  • The Imitation Game: 2 or 3

  • Boyhood: 1 or 2

  • American Sniper: 1 or 2

  • Whiplash: 1 or 2

  • The Theory of Everything: 1 or 2

  • Selma: 0 or 1

Final thoughts: I now know that my initial question was way too specific because the sample size is only 100 movies. It’s also very hard to weigh in what makes an Oscar win because there are so many variables involved. However, I did find that the majority of best picture nominees are dramas and are released in the fall or summer. So keep that in mind when your next favorite movie comes out to see if it’s on the right track!


For our first project, Benson, we were in groups and had to use MTA turnstile data to estimate the volume of people on the street, so that (theoretical) non-profits and companies could deploy street teams efficiently. Below is our proposal to our company as well as our final findings.  (Please keep in mind that we only had one week for this project and that the goal was focused on exploratory data analysis and the iterative design process.If the PDF's aren't loading below, the links are here: Proposal and Presentation)