Machine Learning: Supervised Learning pt. 2

Welcome back!

In the last post, we went over the basics of supervised machine learning. Next, we'll go over some of the algorithms that you can use for classification, including: Logistic Regression, KNN, SVM, Decision Trees, and Random Forest.

Logistic Regression

First of all, logistic regression isn't named super well. When you think of regression, you would probably assume that it would involving doing some type of regression, right? Wrong. Logistic Regression is a classification algorithm, like all of the algorithms below: it predicts what class or category a new data point is using past data points. The main thing to remember for logistic regression is that it has to do with probability. When we talk about probability, we're talking about an estimate in the range of 0 and 1, where 1 is a success and 0 is a failure. Logistic regression is used to predict the odds of something being a success, and when I say odds I mean the probability of something being a success divided by something being a failure. So say your confidence interval is 95 %, if the new data point calculates to 96%, you would count it as a success and vice versa if it were below 95%.

You use logistic regression when your Y, aka dependent variable, is binary (or dichotomous if you want to get all fancy), meaning the outcome is one of two options. For example, yes or no to the success of a medical treatment, voting for or against a new law to pass, you get the picture. 

KNN

Let's go back and use the mango and apple example to explain how KNN works. Imagine a scatter plot with a a bunch of data points, the points are either categorized as mango or apple. Now imagine our new mystery fruit appears at a random point in the scatter plot, how would we know if it's a mango or an apple? Well, we could use KNN, of course! KNN stands for K-Nearest Neighbor. It works by looking at the k nearest points to an unknown label and then seeing which of those points occurs the most in order to classify the unknown label. Back to our fruit example, let's say K = 3, then we would look at the three closest data points to our mystery fruit. Let's say that 2 of those data points are mangos and 1 is an apple, since mangos appear more frequently within those 3 data points, we'd classify the mystery fruit as a mango! Logically, you might wonder, "That seems nice and all, but how do you know what k to choose?". Luckily, the Python module, sci-kit learn, helps a lot with this. You can simply make a for loop that changes the number of k and keeps track of the accuracy of each, then you can choose the k that performs best (sidenote: it's good to keep the k as an odd number so there aren't any "ties" when you're looking at the closest data points").

SVM

Full disclaimer: stole all of these images from here as it's wonderful and simple and great.

Let's continue using our beloved mango and apple data set. Let's pretend it looks something like this:

Where the blue dots are mangos and the red dots are apples. We want to classify these into different groups. We could start by just trying to fit a line between the two groups like so:

But, as you can see, there are a number of different lines we could draw here that could technically make this work. Here's how SVM (support vector machines) comes to the rescue:

SVM works by putting the line where the gap on each side is the largest, like above. However, what happens when our points aren't nicely separated like so?

I'm glad you asked, because SVM can handle that as well. There's this pretty neat trick that SVM uses called Kernels. It's able to transform your data so that it's linearly separable, aka by a line like above. This is possible because nonlinear data becomes linear in higher dimensions. Don't freak out. It looks like this:

That green object looking thing is called a hyperplane. Now your data is nicely classified and separated. Voila, ya'll:

This is much longer than I anticipated so i'm just going to make a part 3.

Machine Learning: Supervised Learning pt. 1

Hola, internet.

The last two weeks we've spent a lot of time going over a few different classification algorithms in machine learning, specifically with in supervised learning. So far we've covered: Logistic Regression, Decision Trees, KNN, SVM, and Random Forest. I know all of these may sound really intimidating if you've never studied statistics/data science before, but I'm going to try and do my best to explain all of them as best I can.

I figure it might be helpful to start from the beginning and give a brief intro into machine learning. So let's imagine you have two groups of fruits: mangos and apples. (Aren't mangos just the best?) Imagine one day you go the grocery store and you see a fruit that you've never seen before, how would you know if it's a mango or apple? (Imagine for the sake of this example that you know it has to be one of these two) To start, you'd probably notice a few things about this new mystery fruit, for example: color, size, texture,  and shape. After noticing these things, you'd then be able to say that this mystery fruit is an apple! Easy, right?  

This basic example is the idea behind supervised machine learning - we're training a computer to make these distinctions. It's called supervised because we are giving the computer a big set of data that is already labeled correctly. For example, a 1000 measurements of mangos and oranges with the various features (color, size, texture,  and shape) so that when our new mystery fruit appears, the computer is able to predict it's an orange based off of the dataset it has already seen before. Where as unsupervised would be a 1000 mystery fruits and telling the computer to classify the fruits into the different fruits automatically, see the difference? In the next post I'll go over the some of the different classification algorithms that the computer uses to classify new items in order to make a prediction.

 

Two Week Recap!

Hello, there! I’m a little behind in the blogging department, so bear with me as I give a quick update on what’s been going down during these first two weeks of the bootcamp. Hopefully from now on i'll have at least one blog per week. 

Week 1

  • Learned about what data scientists do on a daily basis and about the iterative design process, which is the crux of how data scientists solve problems:
    • Figure out what the problem is you’re trying to solve (ask a lot of questions to get specific)
    • Take the problem and brainstorm, as a group, any and all possible solutions
    • Rank top options and prototype an idea through very basic sketches for the first three or so
    • Take your top few ideas and go back to your client and show them what you’re thinking of doing for their problem
    • Take the client feedback and repeat steps 2-5 until you the problem is solved and the client is satisfied
  • We finished the first day learning about supervised machine learning while playing a game called Spot the Hipster. In groups of 3 we had to build a model (with no computer input) that would classify pictures of men as “hipsters” or “non hipsters”. My group did pretty well and came in second in the class, as we predicted 13 of the 15 new pictures accurately. Go team.      
  • Went over Git and Github and learned all about version control. Think of Github as a Dropbox/Google Drive type of site for coders and git as the process of how you upload and retrieve your code from the site. The combo allows multiple users to collaborate on projects without anyone’s code getting overwritten. Cool stuff.
  • iPython notebooks are pretty great. They allow you to work on code in a browser-like environment and be able to run your code in nice little chunks instead of having to run your entire python file via the terminal
  • Finally, we completed Project Benson!

Week 2

  • Reviewed best practices for python coding #pythonic
  • Learned about web scraping using BeautifulSoup and Selenium
  • Started brainstorming and coming up with idea of our next project, Luther
  • Introduced to some of the top python statistical analysis packages with Pandas, Numpy, Scikit-learn, and StatsModels
  • Reviewed Bayesian probability and linear regression
  • The majority of our week was spent individually working on Project Luther. For this project we have to scrape Box Office Mojo (and whatever other sites we’d like) to come up with a movie related question and solve it using movie data and linear regression. My idea is to predict total Oscar awards won in a year given that a movie is a nominee for best picture. I’ll update my findings in the portfolio section after I’m done next week!

General Feelings

  • It's incredible that two weeks have already past, every day goes by very quickly as there is always something to learn and do.
  • Our instructors weren't kidding when they said that Googling is a real skill. You can't find help if you don't know what to ask for!
  •  WeWork is awesome because they give us free food.

 

"Trust that you are here for a reason."

This was one of the main points that our instructors preached throughout our first day, and after going through introductions, I can certainly see why. Physicist, chemist, 20+ years of professional experience, Masters and Doctorate degrees galore, the list of accolades and accomplishments goes on and on. There were even multiple people who have never even lived in the US before, but relocated specifically to be a part of this program. Ah-may-zing. 

It would be easy to look around the room and feel very intimidated. In fact, I did. But the key thing to keep in mind, and the reason why that quote stuck with me, is because our differences and diverse backgrounds only make us stronger, as a group learning with and from one another, and as individuals who are striving towards a common goal, everyone has something positive to contribute. It's a great and empowering feeling to be surrounded by so many brilliant people who share the same drive and goals.  I can’t wait to see what the rest of Week 1 has in store.  

First Post!

Hi!

Welcome to the blog. Exciting times are taking place! Here is where i'll be updating the intwerwebs on my progress through the bootcamp as well as sharing interesting other tidbits I come across.

Stay tuned.