For project Luther, we focused on predictions with linear regression models using movie data. We had to come up with an initial idea and scrape and perform an analysis using Box Office Mojo and any additional sources we wanted. To scrape the data we used python Requests and Beautiful Soup. For the analysis and graphing I used some of following Python packages: Pandas, Numpy, StatsModels, and Matplotlib. My code for the project can be found here.
For my project, I came up with the idea that the CSA (Casting Society of America), was my client and needed data science assistance to work with a small budget studio whose goal was to produce a Oscar Best Picture Nominee. However, the studio had a limited budget so they wanted to know if it was better to spend on a top actor, director, producer, etc. The studio then just for fun wanted me to attempt to predict the number of Oscar wins for the best picture nominees for this year.
To start off, I found this really helpful page on Box Offfice Mojo that contains the entire best picture nominees listed by year. However, if you click on any of the movie links, you’ll notice that it goes to just the Oscar specific data for the movie. So I ended up making two sets of new urls from the initial list: one in order to get the nomination and wins from that Oscar specific page and then another that went to the main movie’s page to grab the actors, director, producer, writers, and distributors. I did this for every best picture nominee since 2000. I choose 2000 as a cutoff because specific movie interests change over time, for example a 1985 best picture nominee might not be a nominee in 2015. Just wanted to keep the sample relevant.
After I scraped all of the data, I had six different sets of data. One that had nominations and wins for each of the movies and then other five that had the actors, directors and so on with each of the movies (all of this is clear and available in the code). So for each data set I then grouped by the actor to see who appeared in the most best picture nominees from the last 15 years. I looked at each list and cut it off where it seemed like it was around the top 5% for each category. I then made a list of all of those movies that had what I deemed a top actor. Finally, I plotted the nominations vs wins for each of those new lists to see which R-Squared was highest. Writers surprisingly ended up #1 but I’m pretty sure that’s only because one of the writers in that group wrote that Lord of The Rings version that won 11 Oscars(!!). So if you took that out, I believe distributors and actors would be the best bet in terms of an investment.
To predict this years results, I looked if the nominees had a top actor etc in their film and re-ran the model. So my not-very-confident-at-all predictions of total number of oscar wins this year are:
2015 Oscar wins prediction (between)
Birdman: 3 and 5
The Grand Budapest Hotel: 2 and 4
The Imitation Game: 2 or 3
Boyhood: 1 or 2
American Sniper: 1 or 2
Whiplash: 1 or 2
The Theory of Everything: 1 or 2
Selma: 0 or 1
Final thoughts: I now know that my initial question was way too specific because the sample size is only 100 movies. It’s also very hard to weigh in what makes an Oscar win because there are so many variables involved. However, I did find that the majority of best picture nominees are dramas and are released in the fall or summer. So keep that in mind when your next favorite movie comes out to see if it’s on the right track!