Skip to main content

1.1.1 Problems Machine Learning Can Solve

1.1.1 Problems Machine Learning Can Solve 


The most successful kinds of machine learning algorithms are those thatautomate decision-making processes by generalizing from known examples.In this setting, which is known as supervised learning, the userprovides the algorithm with pairs of inputs and desired outputs, and thealgorithm finds a way to produce the desired output given an input. Inparticular, the algorithm is able to create an output for an input ithas never seen before without any help from a human. Going back to ourexample of spam classification, using machine learning, the userprovides the algorithm with a large number of emails (which are theinput), together with information about whether any of these emails arespam (which is the desired output). Given a new email, the algorithmwill then produce a prediction as to whether the new email is spam.Machine learning algorithms that learn from input/output pairs arecalled supervised learning algorithms because a “teacher” providessupervision to the algorithms in the form of the desired outputs foreach example that they learn from. While creating a dataset of inputsand outputs is often a laborious manual process, supervised learningalgorithms are well understood and their performance is easy to measure.If your application can be formulated as a supervised learning problem,and you are able to create a dataset that includes the desired outcome,machine learning will likely be able to solve your problem.Examples of supervised machine learning tasks include:Identifying the zip code from handwritten digits on an envelopeHerethe input is a scan of the handwriting, and the desired output is theactual digits in the zip code. To create a dataset for building amachine learning model, you need to collect many envelopes. Then you canread the zip codes yourself and store the digits as your desiredoutcomes.Determining whether a tumor is benign based on a medical imageHerethe input is the image, and the output is whether the tumor is benign.To create a dataset for building a model, you need a database of medicalimages. You also need an expert opinion, so a doctor needs to look atall of the images and decide which tumors are benign and which are not.It might even be necessary to do additional diagnosis beyond the contentof the image to determine whether the tumor in the image is cancerous ornot.Detecting fraudulent activity in credit card transactionsHere theinput is a record of the credit card transaction, and the output iswhether it is likely to be fraudulent or not. Assuming that you are theentity distributing the credit cards, collecting a dataset means storingall transactions and recording if a user reports any transaction asfraudulent.An interesting thing to note about these examples is that although theinputs and outputs look fairly straightforward, the data collectionprocess for these three tasks is vastly different. While readingenvelopes is laborious, it is easy and cheap. Obtaining medical imagingand diagnoses, on the other hand, requires not only expensive machinerybut also rare and expensive expert knowledge, not to mention the ethicalconcerns and privacy issues. In the example of detecting credit cardfraud, data collection is much simpler. Your customers will provide youwith the desired output, as they will report fraud. All you have to doto obtain the input/output pairs of fraudulent and nonfraudulentactivity is wait.Unsupervised algorithms are the other type of algorithm that we willcover in this book. In unsupervised learning, only the input data isknown, and no known output data is given to the algorithm. While thereare many successful applications of these methods, they are usuallyharder to understand and evaluate. Examples of unsupervised learning include:Identifying topics in a set of blog postsIf you have a largecollection of text data, you might want to summarize it and findprevalent themes in it. You might not know beforehand what these topicsare, or how many topics there might be. Therefore, there are no knownoutputs.Segmenting customers into groups with similar preferencesGiven a setof customer records, you might want to identify which customers aresimilar, and whether there are groups of customers with similarpreferences. For a shopping site, these might be “parents,” “bookworms,”or “gamers.” Because you don’t know in advance what these groups mightbe, or even how many there are, you have no known outputs.Detecting abnormal access patterns to a websiteTo identify abuse orbugs, it is often helpful to find access patterns that are differentfrom the norm. Each abnormal pattern might be very different, and youmight not have any recorded instances of abnormal behavior. Because inthis example you only observe traffic, and you don’t know whatconstitutes normal and abnormal behavior, this is an unsupervisedproblem.For both supervised and unsupervised learning tasks, it is important tohave a representation of your input data that a computer can understand.Often it is helpful to think of your data as a table. Each data pointthat you want to reason about (each email, each customer, eachtransaction) is a row, and each property that describes that data point(say, the age of a customer or the amount or location of a transaction)is a column. You might describe users by their age, their gender, whenthey created an account, and how often they have bought from your onlineshop. You might describe the image of a tumor by the grayscale values ofeach pixel, or maybe by using the size, shape, and color of the tumor.Each entity or row here is known as a sample (or data point) inmachine learning, while the columns—the properties that describe theseentities—are called features.Later in this book we will go into more detail on the topic of buildinga good representation of your data, which is called feature extractionor feature engineering. You should keep in mind, however, that nomachine learning algorithm will be able to make a prediction on data forwhich it has no information. For example, if the only feature that youhave for a patient is their last name, no algorithm will be able topredict their gender. This information is simply not contained in yourdata. If you add another feature that contains the patient’s first name,you will have much better luck, as it is often possible to tell thegender by a person’s first name.

Comments

Popular posts from this blog

How to Read .CSV file in Pandas

import pandas as pd df = pd . read_csv ( 'downloads/adeshbhai.csv' ) df . head () Out[1]: Region Country Item Type Sales Channel Order Priority Order Date Order ID Ship Date Units Sold Unit Price Unit Cost Total Revenue Total Cost Total Profit 0 Australia and Oceania Tuvalu Baby Food Offline H 5/28/2010 669165933 6/27/2010 9925 255.28 159.42 2533654.00 1582243.50 951410.50 1 Central America and the Caribbean Grenada Cereal Online C 8/22/2012 963881480 9/15/2012 2804 205.70 117.11 576782.80 328376.44 248406.36 2 Europe Russia Office Supplies Offline L 5/2/2014 341417157 5/8/2014 1779 651.21 524.96 1158502.59 933903.84 224598.75 3 Sub-Saharan Africa Sao Tome and Principe Fruits Online C 6/20/2014 514321792 7/5/2014 8102 9.33 6.92 75591.66 56065.84 19525.82 4 Sub-Saharan Africa Rwanda Office Supplies Offline L 2/1/2013 115456712 2/6/2013 5062 651.21 524.96 3296425.02 2657347.52 639077.50 In [2]: df . tail () Out[2]: Reg...

Regression Graded Quiz week 2 quiz (ibm) Coursera

Congratulations! You passed! TO PASS   80% or higher Keep Learning GRADE 80% Regression LATEST SUBMISSION GRADE 80% 1. Question 1 Based on the reading, which of the following best describes the real added value of the author's research on residential real estate properties? Quantifying the magnitude of relationships between housing prices and different determinants. Quantifying people's preferences of different transport services. The research revealed findings that opposed basic perceptions that people hold about the real estate properties. The research determined that there was no correlation between proximity to shopping centres and housing prices. Correct Correct. The research confirmed many perceptions that people have about real estate properties but it major contribution is quantifying the magnitude of the relationships between the housing prices and different deter...

Assignment 4 - Understanding and Predicting Property Maintenance Fines

You are currently looking at  version 1.0  of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the  Jupyter Notebook FAQ  course resource. Assignment 4 - Understanding and Predicting Property Maintenance Fines This assignment is based on a data challenge from the Michigan Data Science Team ( MDST ). The Michigan Data Science Team ( MDST ) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ( MSSISS ) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight.  Blight violations  are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city...