Skip to main content

Applied Machine Learning, Module 1: A simple classification task


You are currently looking at version 1.0 of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource.

Applied Machine Learning, Module 1: A simple classification task

Import required modules and load data file

In [2]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
fruits = pd.read_table('readonly/fruit_data_with_colors.txt')
In [3]:
fruits.head()
Out[3]:
fruit_labelfruit_namefruit_subtypemasswidthheightcolor_score
01applegranny_smith1928.47.30.55
11applegranny_smith1808.06.80.59
21applegranny_smith1767.47.20.60
32mandarinmandarin866.24.70.80
42mandarinmandarin846.04.60.79
In [4]:
# create a mapping from fruit label value to fruit name to make results easier to interpret
lookup_fruit_name = dict(zip(fruits.fruit_label.unique(), fruits.fruit_name.unique()))   
lookup_fruit_name
Out[4]:
{1: 'apple', 2: 'mandarin', 3: 'orange', 4: 'lemon'}
The file contains the mass, height, and width of a selection of oranges, lemons and apples. The heights were measured along the core of the fruit. The widths were the widest width perpendicular to the height.

Examining the data

In [5]:
# plotting a scatter matrix
from matplotlib import cm
X = fruits[['height', 'width', 'mass', 'color_score']]
y = fruits['fruit_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cmap = cm.get_cmap('gnuplot')
scatter = pd.scatter_matrix(X_train, c= y_train, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap=cmap)
Figure 1
pan/zoom
In [6]:
# plotting a 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(X_train['width'], X_train['height'], X_train['color_score'], c = y_train, marker = 'o', s=100)
ax.set_xlabel('width')
ax.set_ylabel('height')
ax.set_zlabel('color_score')
plt.show()
Figure 2
Stop Interaction

Create train-test split

In [7]:
# For this example, we use the mass, width, and height features of each fruit instance
X = fruits[['mass', 'width', 'height']]
y = fruits['fruit_label']
# default is 75% / 25% train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Create classifier object

In [8]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)

Train the classifier (fit the estimator) using the training data

In [9]:
knn.fit(X_train, y_train)
Out[9]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Estimate the accuracy of the classifier on future data, using the test data

In [11]:
knn.score(X_test, y_test)
Out[11]:
0.53333333333333333

Use the trained k-NN classifier model to classify new, previously unseen objects

In [12]:
# first example: a small fruit with mass 20g, width 4.3 cm, height 5.5 cm
fruit_prediction = knn.predict([[20, 4.3, 5.5]])
lookup_fruit_name[fruit_prediction[0]]
Out[12]:
'mandarin'
In [13]:
# second example: a larger, elongated fruit with mass 100g, width 6.3 cm, height 8.5 cm
fruit_prediction = knn.predict([[100, 6.3, 8.5]])
lookup_fruit_name[fruit_prediction[0]]
Out[13]:
'lemon'
In [16]:
# second example: a larger, elongated fruit with mass 100g, width 6.3 cm, height 8.5 cm
fruit_prediction = knn.predict([[20, 2, 12]])
lookup_fruit_name[fruit_prediction[0]]
Out[16]:
'mandarin'

Plot the decision boundaries of the k-NN classifier

In [17]:
from adspy_shared_utilities import plot_fruit_knn
plot_fruit_knn(X_train, y_train, 5, 'uniform')   # we choose 5 nearest neighbors
Figure 3

How sensitive is k-NN classification accuracy to the choice of the 'k' parameter?

In [18]:
k_range = range(1,20)
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    scores.append(knn.score(X_test, y_test))
plt.figure()
plt.xlabel('k')
plt.ylabel('accuracy')
plt.scatter(k_range, scores)
plt.xticks([0,5,10,15,20]);
Figure 4

How sensitive is k-NN classification accuracy to the train/test split proportion?

In [20]:
t = [0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2]
knn = KNeighborsClassifier(n_neighbors = 5)
plt.figure()
for s in t:
    scores = []
    for i in range(1,1000):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1-s)
        knn.fit(X_train, y_train)
        scores.append(knn.score(X_test, y_test))
    plt.plot(s, np.mean(scores), 'bo')
plt.xlabel('Training set proportion (%)')
plt.ylabel('accuracy');
Figure 5
x=0.689301 y=0.513783
In [ ]:

Comments

Popular posts from this blog

How to Read .CSV file in Pandas

import pandas as pd df = pd . read_csv ( 'downloads/adeshbhai.csv' ) df . head () Out[1]: Region Country Item Type Sales Channel Order Priority Order Date Order ID Ship Date Units Sold Unit Price Unit Cost Total Revenue Total Cost Total Profit 0 Australia and Oceania Tuvalu Baby Food Offline H 5/28/2010 669165933 6/27/2010 9925 255.28 159.42 2533654.00 1582243.50 951410.50 1 Central America and the Caribbean Grenada Cereal Online C 8/22/2012 963881480 9/15/2012 2804 205.70 117.11 576782.80 328376.44 248406.36 2 Europe Russia Office Supplies Offline L 5/2/2014 341417157 5/8/2014 1779 651.21 524.96 1158502.59 933903.84 224598.75 3 Sub-Saharan Africa Sao Tome and Principe Fruits Online C 6/20/2014 514321792 7/5/2014 8102 9.33 6.92 75591.66 56065.84 19525.82 4 Sub-Saharan Africa Rwanda Office Supplies Offline L 2/1/2013 115456712 2/6/2013 5062 651.21 524.96 3296425.02 2657347.52 639077.50 In [2]: df . tail () Out[2]: Reg...

Regression Graded Quiz week 2 quiz (ibm) Coursera

Congratulations! You passed! TO PASS   80% or higher Keep Learning GRADE 80% Regression LATEST SUBMISSION GRADE 80% 1. Question 1 Based on the reading, which of the following best describes the real added value of the author's research on residential real estate properties? Quantifying the magnitude of relationships between housing prices and different determinants. Quantifying people's preferences of different transport services. The research revealed findings that opposed basic perceptions that people hold about the real estate properties. The research determined that there was no correlation between proximity to shopping centres and housing prices. Correct Correct. The research confirmed many perceptions that people have about real estate properties but it major contribution is quantifying the magnitude of the relationships between the housing prices and different deter...

Assignment 4 - Understanding and Predicting Property Maintenance Fines

You are currently looking at  version 1.0  of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the  Jupyter Notebook FAQ  course resource. Assignment 4 - Understanding and Predicting Property Maintenance Fines This assignment is based on a data challenge from the Michigan Data Science Team ( MDST ). The Michigan Data Science Team ( MDST ) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ( MSSISS ) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight.  Blight violations  are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city...