We're back!

May 21, 2020 Nelson Spencer

Hello, world! It’s been a while! I’ve recently brought back my website back to life and updated my online presence. I have moved my portfolio to Github (see above link). Currently there are two specific people analytics related projects and a few old analysis projects from grad school and the bootcamp. I hope to continue to add new content to the more regularly going forward!

Machine Learning: Supervised Learning pt. 3

February 16, 2015 Nelson Spencer

Welcome back. This is the third and final installment of the machine learning algorithms. Next up, we'll cover Decision Trees and Random Forests.

Decision Trees

You can think of decision trees basically as flow charts that are comprised of a lot of If..Then..Else statements. An example could look something like this:

Decision trees work by finding ways of splitting your data based on different features of the dataset. Going back to the fruit examples, a decision tree could start by asking is the shape of the fruit round, if it's yes it could then ask what is the texture of the fruit and so on. Eventually it would be ably to separate the data and tell you which features are the best at distinguishing a mango from an apple. The machine learning aspect comes in because all of this done automatically and the computer is able to see which features are most important. For our example, we only used a few features that separates mangos and apples, but imagine if we had 100 different features, the decision tree would be able to use all of these features and be able to determine where the best split occurs. One issue with decision trees is that sometimes they can be too good and cause overfitting. Remember, the point of these algorithms is to eventually use them on a new set of data to make predictions. One way to avoid this is to use the random forest algorithm.

Random Forest

Instead of using just one decision tree on your dataset, random forests makes a whole a bunch of decision trees using the same data. For example, one tree might use size and color of the fruit while another tree might use color and texture. The random aspect happens because it randomly chooses these features from your data and makes a decision with those features and repeats the process and then takes the average of the end results to give your weights for the features. The main take away is that decision trees use all of the features and then chooses the best split. Random forests randomly selects the features and then chooses the best split which allows your model not to overfit your data! So exciting.

Machine Learning: Supervised Learning pt. 2

February 15, 2015 Nelson Spencer

Welcome back!

In the last post, we went over the basics of supervised machine learning. Next, we'll go over some of the algorithms that you can use for classification, including: Logistic Regression, KNN, SVM, Decision Trees, and Random Forest.

Logistic Regression

First of all, logistic regression isn't named super well. When you think of regression, you would probably assume that it would involving doing some type of regression, right? Wrong. Logistic Regression is a classification algorithm, like all of the algorithms below: it predicts what class or category a new data point is using past data points. The main thing to remember for logistic regression is that it has to do with probability. When we talk about probability, we're talking about an estimate in the range of 0 and 1, where 1 is a success and 0 is a failure. Logistic regression is used to predict the odds of something being a success, and when I say odds I mean the probability of something being a success divided by something being a failure. So say your confidence interval is 95 %, if the new data point calculates to 96%, you would count it as a success and vice versa if it were below 95%.

You use logistic regression when your Y, aka dependent variable, is binary (or dichotomous if you want to get all fancy), meaning the outcome is one of two options. For example, yes or no to the success of a medical treatment, voting for or against a new law to pass, you get the picture.

KNN

Let's go back and use the mango and apple example to explain how KNN works. Imagine a scatter plot with a a bunch of data points, the points are either categorized as mango or apple. Now imagine our new mystery fruit appears at a random point in the scatter plot, how would we know if it's a mango or an apple? Well, we could use KNN, of course! KNN stands for K-Nearest Neighbor. It works by looking at the k nearest points to an unknown label and then seeing which of those points occurs the most in order to classify the unknown label. Back to our fruit example, let's say K = 3, then we would look at the three closest data points to our mystery fruit. Let's say that 2 of those data points are mangos and 1 is an apple, since mangos appear more frequently within those 3 data points, we'd classify the mystery fruit as a mango! Logically, you might wonder, "That seems nice and all, but how do you know what k to choose?". Luckily, the Python module, sci-kit learn, helps a lot with this. You can simply make a for loop that changes the number of k and keeps track of the accuracy of each, then you can choose the k that performs best (sidenote: it's good to keep the k as an odd number so there aren't any "ties" when you're looking at the closest data points").

SVM

Full disclaimer: stole all of these images from here as it's wonderful and simple and great.

Let's continue using our beloved mango and apple data set. Let's pretend it looks something like this:

Where the blue dots are mangos and the red dots are apples. We want to classify these into different groups. We could start by just trying to fit a line between the two groups like so:

But, as you can see, there are a number of different lines we could draw here that could technically make this work. Here's how SVM (support vector machines) comes to the rescue:

SVM works by putting the line where the gap on each side is the largest, like above. However, what happens when our points aren't nicely separated like so?

I'm glad you asked, because SVM can handle that as well. There's this pretty neat trick that SVM uses called Kernels. It's able to transform your data so that it's linearly separable, aka by a line like above. This is possible because nonlinear data becomes linear in higher dimensions. Don't freak out. It looks like this:

That green object looking thing is called a hyperplane. Now your data is nicely classified and separated. Voila, ya'll:

This is much longer than I anticipated so i'm just going to make a part 3.

Machine Learning: Supervised Learning pt. 1

February 15, 2015 Nelson Spencer

Hola, internet.

The last two weeks we've spent a lot of time going over a few different classification algorithms in machine learning, specifically with in supervised learning. So far we've covered: Logistic Regression, Decision Trees, KNN, SVM, and Random Forest. I know all of these may sound really intimidating if you've never studied statistics/data science before, but I'm going to try and do my best to explain all of them as best I can.

I figure it might be helpful to start from the beginning and give a brief intro into machine learning. So let's imagine you have two groups of fruits: mangos and apples. (Aren't mangos just the best?) Imagine one day you go the grocery store and you see a fruit that you've never seen before, how would you know if it's a mango or apple? (Imagine for the sake of this example that you know it has to be one of these two) To start, you'd probably notice a few things about this new mystery fruit, for example: color, size, texture, and shape. After noticing these things, you'd then be able to say that this mystery fruit is an apple! Easy, right?

This basic example is the idea behind supervised machine learning - we're training a computer to make these distinctions. It's called supervised because we are giving the computer a big set of data that is already labeled correctly. For example, a 1000 measurements of mangos and oranges with the various features (color, size, texture, and shape) so that when our new mystery fruit appears, the computer is able to predict it's an orange based off of the dataset it has already seen before. Where as unsupervised would be a 1000 mystery fruits and telling the computer to classify the fruits into the different fruits automatically, see the difference? In the next post I'll go over the some of the different classification algorithms that the computer uses to classify new items in order to make a prediction.

Two Week Recap!

January 24, 2015 Nelson Spencer

Hello, there! I’m a little behind in the blogging department, so bear with me as I give a quick update on what’s been going down during these first two weeks of the bootcamp. Hopefully from now on i'll have at least one blog per week.

Week 1

Learned about what data scientists do on a daily basis and about the iterative design process, which is the crux of how data scientists solve problems:
- Figure out what the problem is you’re trying to solve (ask a lot of questions to get specific)
- Take the problem and brainstorm, as a group, any and all possible solutions
- Rank top options and prototype an idea through very basic sketches for the first three or so
- Take your top few ideas and go back to your client and show them what you’re thinking of doing for their problem
- Take the client feedback and repeat steps 2-5 until you the problem is solved and the client is satisfied
We finished the first day learning about supervised machine learning while playing a game called Spot the Hipster. In groups of 3 we had to build a model (with no computer input) that would classify pictures of men as “hipsters” or “non hipsters”. My group did pretty well and came in second in the class, as we predicted 13 of the 15 new pictures accurately. Go team.
Went over Git and Github and learned all about version control. Think of Github as a Dropbox/Google Drive type of site for coders and git as the process of how you upload and retrieve your code from the site. The combo allows multiple users to collaborate on projects without anyone’s code getting overwritten. Cool stuff.
iPython notebooks are pretty great. They allow you to work on code in a browser-like environment and be able to run your code in nice little chunks instead of having to run your entire python file via the terminal
Finally, we completed Project Benson!

Week 2

Reviewed best practices for python coding #pythonic
Learned about web scraping using BeautifulSoup and Selenium
Started brainstorming and coming up with idea of our next project, Luther
Introduced to some of the top python statistical analysis packages with Pandas, Numpy, Scikit-learn, and StatsModels
Reviewed Bayesian probability and linear regression
The majority of our week was spent individually working on Project Luther. For this project we have to scrape Box Office Mojo (and whatever other sites we’d like) to come up with a movie related question and solve it using movie data and linear regression. My idea is to predict total Oscar awards won in a year given that a movie is a nominee for best picture. I’ll update my findings in the portfolio section after I’m done next week!

General Feelings

It's incredible that two weeks have already past, every day goes by very quickly as there is always something to learn and do.
Our instructors weren't kidding when they said that Googling is a real skill. You can't find help if you don't know what to ask for!
WeWork is awesome because they give us free food.

"Trust that you are here for a reason."

January 12, 2015 Nelson Spencer

This was one of the main points that our instructors preached throughout our first day, and after going through introductions, I can certainly see why. Physicist, chemist, 20+ years of professional experience, Masters and Doctorate degrees galore, the list of accolades and accomplishments goes on and on. There were even multiple people who have never even lived in the US before, but relocated specifically to be a part of this program. Ah-may-zing.

It would be easy to look around the room and feel very intimidated. In fact, I did. But the key thing to keep in mind, and the reason why that quote stuck with me, is because our differences and diverse backgrounds only make us stronger, as a group learning with and from one another, and as individuals who are striving towards a common goal, everyone has something positive to contribute. It's a great and empowering feeling to be surrounded by so many brilliant people who share the same drive and goals. I can’t wait to see what the rest of Week 1 has in store.

First Post!

December 24, 2014 Nelson Spencer

Hi!

Welcome to the blog. Exciting times are taking place! Here is where i'll be updating the intwerwebs on my progress through the bootcamp as well as sharing interesting other tidbits I come across.

Stay tuned.