Skip to main content

1.4.2 NumPy,1.4.3 SciPy,1.4.4 matplotlib,1.4.5 pandas

1.4.2 NumPy


NumPy is one of the fundamental packages for scientific computing in Python. It contains functionality for multidimensional arrays, high-level mathematical functions such as linear algebra operations and the Fourier transform, and pseudorandom number generators.
In scikit-learn, the NumPy array is the fundamental data structure. scikit-learn takes in data in the form of NumPy arrays. Any data you’re using will have to be converted to a NumPy array. The core functionality of NumPy is the ndarray class, a multidimensional (n-dimensional) array. All elements of the array must be of the same type. A NumPy array looks like this:

In[1]:
import numpy as np

x = np.array([[1, 2, 3], [4, 5, 6]])
print("x:\n{}".format(x))
Out[1]:
x:
[[1 2 3]
 [4 5 6]]
We will be using NumPy a lot in this book, and we will refer to objects of the NumPy ndarray class as “NumPy arrays” or just “arrays.”

1.4.3 SciPy


SciPy is a collection of functions for scientific computing in Python. It provides, among other functionality, advanced linear algebra routines, mathematical function optimization, signal processing, special mathematical functions, and statistical distributions. scikit-learn draws from SciPy’s collection of functions for implementing its algorithms. The most important part of SciPy for us is scipy.sparse: this provides sparse matrices, which are another representation that is used for data in scikit-learn. Sparse matrices are used whenever we want to store a 2D array that contains mostly zeros:


In[2]:
from scipy import sparse

# Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else
eye = np.eye(4)
print("NumPy array:\n", eye)
Out[2]:
NumPy array:
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

In[3]:
# Convert the NumPy array to a SciPy sparse matrix in CSR format
# Only the nonzero entries are stored
sparse_matrix = sparse.csr_matrix(eye)
print("\nSciPy sparse CSR matrix:\n", sparse_matrix)
Out[3]:
SciPy sparse CSR matrix:
  (0, 0) 1.0
  (1, 1) 1.0
  (2, 2) 1.0
  (3, 3) 1.0

Usually it is not possible to create dense representations of sparse data (as they would not fit into memory), so we need to create sparse representations directly. Here is a way to create the same sparse matrix as before, using the COO format:


In[4]:
data = np.ones(4)
row_indices = np.arange(4)
col_indices = np.arange(4)
eye_coo = sparse.coo_matrix((data, (row_indices, col_indices)))
print("COO representation:\n", eye_coo)
Out[4]:
COO representation:
  (0, 0) 1.0
  (1, 1) 1.0
  (2, 2) 1.0
  (3, 3) 1.0

1.4.4 matplotlib

matplotlib is the primary scientific plotting library in Python. It provides functions for making publication-quality visualizations such as line charts, histograms, scatter plots, and so on. Visualizing your data and different aspects of your analysis can give you important insights, and we will be using matplotlib for all our visualizations. When working inside the Jupyter Notebook, you can show figures directly in the browser by using the %matplotlib notebook and %matplotlib inline commands. We recommend using %matplotlib notebook, which provides an interactive environment (though we are using %matplotlib inline to produce this book). For example, this code produces the plot in
In[5]:
%matplotlib inline
import matplotlib.pyplot as plt

# Generate a sequence of numbers from -10 to 10 with 100 steps in between
x = np.linspace(-10, 10, 100)
# Create a second array using sine
y = np.sin(x)
# The plot function makes a line chart of one array against another
plt.plot(x, y, marker="x")

png


1.4.5 pandas

pandas is a Python library for data wrangling and analysis. It is built around a data structure called the DataFrame that is modeled after the R DataFrame. Simply put, a pandas DataFrame is a table, similar to an Excel spreadsheet. pandas provides a great range of methods to modify and operate on this table; in particular, it allows SQL-like queries and joins of tables. In contrast to NumPy, which requires that all entries in an array be of the same type, pandas allows each column to have a separate type (for example, integers, dates, floating-point numbers, and strings). Another valuable tool provided by pandas is its ability to ingest from a great variety of file formats and databases, like SQL, Excel files, and comma-separated values (CSV) files
In[6]:
import pandas as pd

# create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
        'Location' : ["New York", "Paris", "Berlin", "London"],
        'Age' : [24, 13, 53, 33]
       }

data_pandas = pd.DataFrame(data)
# IPython.display allows "pretty printing" of dataframes
# in the Jupyter notebook
display(data_pandas)

This produces the following output:



AgeLocationName
0
24
New York
John
1
13
Paris
Anna
2
53
Berlin
Peter
3
33
London
Linda


In[7]:
# Select all rows that have an age column greater than 30
display(data_pandas[data_pandas.Age > 30])


Comments

Popular posts from this blog

How to Read .CSV file in Pandas

import pandas as pd df = pd . read_csv ( 'downloads/adeshbhai.csv' ) df . head () Out[1]: Region Country Item Type Sales Channel Order Priority Order Date Order ID Ship Date Units Sold Unit Price Unit Cost Total Revenue Total Cost Total Profit 0 Australia and Oceania Tuvalu Baby Food Offline H 5/28/2010 669165933 6/27/2010 9925 255.28 159.42 2533654.00 1582243.50 951410.50 1 Central America and the Caribbean Grenada Cereal Online C 8/22/2012 963881480 9/15/2012 2804 205.70 117.11 576782.80 328376.44 248406.36 2 Europe Russia Office Supplies Offline L 5/2/2014 341417157 5/8/2014 1779 651.21 524.96 1158502.59 933903.84 224598.75 3 Sub-Saharan Africa Sao Tome and Principe Fruits Online C 6/20/2014 514321792 7/5/2014 8102 9.33 6.92 75591.66 56065.84 19525.82 4 Sub-Saharan Africa Rwanda Office Supplies Offline L 2/1/2013 115456712 2/6/2013 5062 651.21 524.96 3296425.02 2657347.52 639077.50 In [2]: df . tail () Out[2]: Reg...

Regression Graded Quiz week 2 quiz (ibm) Coursera

Congratulations! You passed! TO PASS   80% or higher Keep Learning GRADE 80% Regression LATEST SUBMISSION GRADE 80% 1. Question 1 Based on the reading, which of the following best describes the real added value of the author's research on residential real estate properties? Quantifying the magnitude of relationships between housing prices and different determinants. Quantifying people's preferences of different transport services. The research revealed findings that opposed basic perceptions that people hold about the real estate properties. The research determined that there was no correlation between proximity to shopping centres and housing prices. Correct Correct. The research confirmed many perceptions that people have about real estate properties but it major contribution is quantifying the magnitude of the relationships between the housing prices and different deter...

Assignment 4 - Understanding and Predicting Property Maintenance Fines

You are currently looking at  version 1.0  of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the  Jupyter Notebook FAQ  course resource. Assignment 4 - Understanding and Predicting Property Maintenance Fines This assignment is based on a data challenge from the Michigan Data Science Team ( MDST ). The Michigan Data Science Team ( MDST ) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ( MSSISS ) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight.  Blight violations  are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city...