1.1.1 Problems Machine Learning Can Solve
The most successful kinds of machine learning algorithms are those thatautomate decision-making processes by generalizing from known examples.In this setting, which is known as supervised learning, the userprovides the algorithm with pairs of inputs and desired outputs, and thealgorithm finds a way to produce the desired output given an input. Inparticular, the algorithm is able to create an output for an input ithas never seen before without any help from a human. Going back to ourexample of spam classification, using machine learning, the userprovides the algorithm with a large number of emails (which are theinput), together with information about whether any of these emails arespam (which is the desired output). Given a new email, the algorithmwill then produce a prediction as to whether the new email is spam.Machine learning algorithms that learn from input/output pairs arecalled supervised learning algorithms because a “teacher” providessupervision to the algorithms in the form of the desired outputs foreach example that they learn from. While creating a dataset of inputsand outputs is often a laborious manual process, supervised learningalgorithms are well understood and their performance is easy to measure.If your application can be formulated as a supervised learning problem,and you are able to create a dataset that includes the desired outcome,machine learning will likely be able to solve your problem.Examples of supervised machine learning tasks include:Identifying the zip code from handwritten digits on an envelopeHerethe input is a scan of the handwriting, and the desired output is theactual digits in the zip code. To create a dataset for building amachine learning model, you need to collect many envelopes. Then you canread the zip codes yourself and store the digits as your desiredoutcomes.Determining whether a tumor is benign based on a medical imageHerethe input is the image, and the output is whether the tumor is benign.To create a dataset for building a model, you need a database of medicalimages. You also need an expert opinion, so a doctor needs to look atall of the images and decide which tumors are benign and which are not.It might even be necessary to do additional diagnosis beyond the contentof the image to determine whether the tumor in the image is cancerous ornot.Detecting fraudulent activity in credit card transactionsHere theinput is a record of the credit card transaction, and the output iswhether it is likely to be fraudulent or not. Assuming that you are theentity distributing the credit cards, collecting a dataset means storingall transactions and recording if a user reports any transaction asfraudulent.An interesting thing to note about these examples is that although theinputs and outputs look fairly straightforward, the data collectionprocess for these three tasks is vastly different. While readingenvelopes is laborious, it is easy and cheap. Obtaining medical imagingand diagnoses, on the other hand, requires not only expensive machinerybut also rare and expensive expert knowledge, not to mention the ethicalconcerns and privacy issues. In the example of detecting credit cardfraud, data collection is much simpler. Your customers will provide youwith the desired output, as they will report fraud. All you have to doto obtain the input/output pairs of fraudulent and nonfraudulentactivity is wait.Unsupervised algorithms are the other type of algorithm that we willcover in this book. In unsupervised learning, only the input data isknown, and no known output data is given to the algorithm. While thereare many successful applications of these methods, they are usuallyharder to understand and evaluate. Examples of unsupervised learning include:Identifying topics in a set of blog postsIf you have a largecollection of text data, you might want to summarize it and findprevalent themes in it. You might not know beforehand what these topicsare, or how many topics there might be. Therefore, there are no knownoutputs.Segmenting customers into groups with similar preferencesGiven a setof customer records, you might want to identify which customers aresimilar, and whether there are groups of customers with similarpreferences. For a shopping site, these might be “parents,” “bookworms,”or “gamers.” Because you don’t know in advance what these groups mightbe, or even how many there are, you have no known outputs.Detecting abnormal access patterns to a websiteTo identify abuse orbugs, it is often helpful to find access patterns that are differentfrom the norm. Each abnormal pattern might be very different, and youmight not have any recorded instances of abnormal behavior. Because inthis example you only observe traffic, and you don’t know whatconstitutes normal and abnormal behavior, this is an unsupervisedproblem.For both supervised and unsupervised learning tasks, it is important tohave a representation of your input data that a computer can understand.Often it is helpful to think of your data as a table. Each data pointthat you want to reason about (each email, each customer, eachtransaction) is a row, and each property that describes that data point(say, the age of a customer or the amount or location of a transaction)is a column. You might describe users by their age, their gender, whenthey created an account, and how often they have bought from your onlineshop. You might describe the image of a tumor by the grayscale values ofeach pixel, or maybe by using the size, shape, and color of the tumor.Each entity or row here is known as a sample (or data point) inmachine learning, while the columns—the properties that describe theseentities—are called features.Later in this book we will go into more detail on the topic of buildinga good representation of your data, which is called feature extractionor feature engineering. You should keep in mind, however, that nomachine learning algorithm will be able to make a prediction on data forwhich it has no information. For example, if the only feature that youhave for a patient is their last name, no algorithm will be able topredict their gender. This information is simply not contained in yourdata. If you add another feature that contains the patient’s first name,you will have much better luck, as it is often possible to tell thegender by a person’s first name.
The most successful kinds of machine learning algorithms are those thatautomate decision-making processes by generalizing from known examples.In this setting, which is known as supervised learning, the userprovides the algorithm with pairs of inputs and desired outputs, and thealgorithm finds a way to produce the desired output given an input. Inparticular, the algorithm is able to create an output for an input ithas never seen before without any help from a human. Going back to ourexample of spam classification, using machine learning, the userprovides the algorithm with a large number of emails (which are theinput), together with information about whether any of these emails arespam (which is the desired output). Given a new email, the algorithmwill then produce a prediction as to whether the new email is spam.Machine learning algorithms that learn from input/output pairs arecalled supervised learning algorithms because a “teacher” providessupervision to the algorithms in the form of the desired outputs foreach example that they learn from. While creating a dataset of inputsand outputs is often a laborious manual process, supervised learningalgorithms are well understood and their performance is easy to measure.If your application can be formulated as a supervised learning problem,and you are able to create a dataset that includes the desired outcome,machine learning will likely be able to solve your problem.Examples of supervised machine learning tasks include:Identifying the zip code from handwritten digits on an envelopeHerethe input is a scan of the handwriting, and the desired output is theactual digits in the zip code. To create a dataset for building amachine learning model, you need to collect many envelopes. Then you canread the zip codes yourself and store the digits as your desiredoutcomes.Determining whether a tumor is benign based on a medical imageHerethe input is the image, and the output is whether the tumor is benign.To create a dataset for building a model, you need a database of medicalimages. You also need an expert opinion, so a doctor needs to look atall of the images and decide which tumors are benign and which are not.It might even be necessary to do additional diagnosis beyond the contentof the image to determine whether the tumor in the image is cancerous ornot.Detecting fraudulent activity in credit card transactionsHere theinput is a record of the credit card transaction, and the output iswhether it is likely to be fraudulent or not. Assuming that you are theentity distributing the credit cards, collecting a dataset means storingall transactions and recording if a user reports any transaction asfraudulent.An interesting thing to note about these examples is that although theinputs and outputs look fairly straightforward, the data collectionprocess for these three tasks is vastly different. While readingenvelopes is laborious, it is easy and cheap. Obtaining medical imagingand diagnoses, on the other hand, requires not only expensive machinerybut also rare and expensive expert knowledge, not to mention the ethicalconcerns and privacy issues. In the example of detecting credit cardfraud, data collection is much simpler. Your customers will provide youwith the desired output, as they will report fraud. All you have to doto obtain the input/output pairs of fraudulent and nonfraudulentactivity is wait.Unsupervised algorithms are the other type of algorithm that we willcover in this book. In unsupervised learning, only the input data isknown, and no known output data is given to the algorithm. While thereare many successful applications of these methods, they are usuallyharder to understand and evaluate. Examples of unsupervised learning include:Identifying topics in a set of blog postsIf you have a largecollection of text data, you might want to summarize it and findprevalent themes in it. You might not know beforehand what these topicsare, or how many topics there might be. Therefore, there are no knownoutputs.Segmenting customers into groups with similar preferencesGiven a setof customer records, you might want to identify which customers aresimilar, and whether there are groups of customers with similarpreferences. For a shopping site, these might be “parents,” “bookworms,”or “gamers.” Because you don’t know in advance what these groups mightbe, or even how many there are, you have no known outputs.Detecting abnormal access patterns to a websiteTo identify abuse orbugs, it is often helpful to find access patterns that are differentfrom the norm. Each abnormal pattern might be very different, and youmight not have any recorded instances of abnormal behavior. Because inthis example you only observe traffic, and you don’t know whatconstitutes normal and abnormal behavior, this is an unsupervisedproblem.For both supervised and unsupervised learning tasks, it is important tohave a representation of your input data that a computer can understand.Often it is helpful to think of your data as a table. Each data pointthat you want to reason about (each email, each customer, eachtransaction) is a row, and each property that describes that data point(say, the age of a customer or the amount or location of a transaction)is a column. You might describe users by their age, their gender, whenthey created an account, and how often they have bought from your onlineshop. You might describe the image of a tumor by the grayscale values ofeach pixel, or maybe by using the size, shape, and color of the tumor.Each entity or row here is known as a sample (or data point) inmachine learning, while the columns—the properties that describe theseentities—are called features.Later in this book we will go into more detail on the topic of buildinga good representation of your data, which is called feature extractionor feature engineering. You should keep in mind, however, that nomachine learning algorithm will be able to make a prediction on data forwhich it has no information. For example, if the only feature that youhave for a patient is their last name, no algorithm will be able topredict their gender. This information is simply not contained in yourdata. If you add another feature that contains the patient’s first name,you will have much better luck, as it is often possible to tell thegender by a person’s first name.
Comments