Estimated read time: 6 min.

Overview

This will be a series of notes to record the thinking and learning of my Machine Learning journey. Hopefully it can serve as a place to record the learning process/progress as well as providing some reference for others just entering this field.

Numpy Basics

Numpy is a powerful module of Python. It covers a lot of basics of Linear algebra, and are great when it comes to doing sceintific and mathematical calculations. That’s why it is the ‘go-to’ solution for Machine Learning. There are other ‘highler-level’ and more dedicated modules out there like Pandas, seaborn that also utilize or even built on Numpy. Thus proves the powerfulness of Numpy in another way. Though works mostly on the fundation level, Numpy should be considered an equivelant to other more ‘advanced’ modules and whichever fits the most to solve the problem at hand should be used. Without further ado, let’s jump right into some code

In [1]:
import numpy as np   # use 'np' to represent Numpy is kind of a coding convention
print('Numpy version: ',np.__version__)
Numpy version:  1.13.3

Now it’s imported, let’s use it to do some basic stuffs.

In [2]:
x = np.random.rand(5,3)
x
Out[2]:
array([[ 0.06584995,  0.58936577,  0.02054667],
       [ 0.90132301,  0.62575129,  0.76350867],
       [ 0.70707616,  0.19462487,  0.35358817],
       [ 0.44329537,  0.57977487,  0.78594087],
       [ 0.76109628,  0.86614311,  0.51521077]])

Manipulate with array is what Numpy do best. Here we generated an 5 row 3 columns array of randome numbers (from 0 to 1)

In [3]:
print(x.shape)
print(x.dtype)
(5, 3)
float64

We can look at the shape of an array and what data type it is. Obviously it’s a float since it’s generated from np.random.rand.

In [4]:
y = np.random.rand(3,4)
z = np.dot(x,y)
z
Out[4]:
array([[ 0.60055916,  0.43108417,  0.35430406,  0.32056793],
       [ 2.10608162,  0.91944706,  0.73772388,  1.22819697],
       [ 1.1554381 ,  0.48784172,  0.3088563 ,  0.63719203],
       [ 1.66918268,  0.69113499,  0.68490871,  1.04397122],
       [ 1.95288349,  0.98819028,  0.76639304,  1.09896047]])

Doing some good old dot product.

In [6]:
z = x @ y
z
Out[6]:
array([[ 0.60055916,  0.43108417,  0.35430406,  0.32056793],
       [ 2.10608162,  0.91944706,  0.73772388,  1.22819697],
       [ 1.1554381 ,  0.48784172,  0.3088563 ,  0.63719203],
       [ 1.66918268,  0.69113499,  0.68490871,  1.04397122],
       [ 1.95288349,  0.98819028,  0.76639304,  1.09896047]])

Or more intuitively use ‘@’ operand.

How to index Numpy array

First create a sample array using np.array function.

In [7]:
x1 = np.array([[1,2,3],
               [4,5,6],
               [7,8,9]])
x1
Out[7]:
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Array[row, col] represent array element in ‘row’ and ‘col’. Note that

In [8]:
x1[1,1]   #row 1, col 1, since Python list starts from 0, so it's 2nd row and col;
Out[8]:
5
In [9]:
x1[:,2]   #':' means all elements, so this means all elements on 3rd column, let's see
Out[9]:
array([3, 6, 9])
In [10]:
x1[:,1]>3  # an array condition equation will generate an array of boolean values;
Out[10]:
array([False,  True,  True], dtype=bool)
In [11]:
x1[ x1[:,1]>3 ] # This means index the rows that the 2nd column is greater than 3;
Out[11]:
array([[4, 5, 6],
       [7, 8, 9]])

Shape Manipulations

In [12]:
x1.reshape(9)   #reshape x1 into one row
Out[12]:
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [13]:
x1.reshape(3,3)   #reshape back to 3x3
Out[13]:
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
In [14]:
x1.reshape(9,1)   #reshape into 9x1
Out[14]:
array([[1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

Dot and Mutiplication

In [15]:
x2 = np.arange(9).reshape(3,3)
x2
Out[15]:
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
In [16]:
multi = x1*x2
dot = np.dot(x1,x2)
print('x1 * x2 =', multi)
print('x1 dot x2 =', dot)
x1 * x2 = [[ 0  2  6]
 [12 20 30]
 [42 56 72]]
x1 dot x2 = [[ 24  30  36]
 [ 51  66  81]
 [ 78 102 126]]

Pandas Basics

In [17]:
# set some basic data
col_names = ['temperature','time','day']
data = np.array([[64,2100,1],
                 [50,2200,1],
                 [48,2300,1],
                 [34,0,   2],
                 [30,100, 5]])
data
Out[17]:
array([[  64, 2100,    1],
       [  50, 2200,    1],
       [  48, 2300,    1],
       [  34,    0,    2],
       [  30,  100,    5]])
In [18]:
# explore the data using numpy, it's clumsy
data2 = data[data[:,1]>1500]
data2
Out[18]:
array([[  64, 2100,    1],
       [  50, 2200,    1],
       [  48, 2300,    1]])
In [19]:
# now let's try use Pandas
import pandas as pd

df = pd.DataFrame(data,columns=col_names)
df
Out[19]:
temperature time day
0 64 2100 1
1 50 2200 1
2 48 2300 1
3 34 0 2
4 30 100 5

Pandas DataFrame will put all the data into a much nicer form with neat labels

In [20]:
df[df.time>1500] #Now do it again with data exploration, this time using Pandas DataFrame
Out[20]:
temperature time day
0 64 2100 1
1 50 2200 1
2 48 2300 1

Much nicer!

In [22]:
# now let's get some basic info of the DataFrame
df.info()
df.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
temperature    5 non-null int64
time           5 non-null int64
day            5 non-null int64
dtypes: int64(3)
memory usage: 200.0 bytes
Out[22]:
temperature time day
count 5.000000 5.00000 5.000000
mean 45.200000 1340.00000 2.000000
std 13.608821 1180.25421 1.732051
min 30.000000 0.00000 1.000000
25% 34.000000 100.00000 1.000000
50% 48.000000 2100.00000 1.000000
75% 50.000000 2200.00000 2.000000
max 64.000000 2300.00000 5.000000
In [23]:
# you can change the element in DataFrame like so:
df.day[df.day==1] = 'Mon'
df
Out[23]:
temperature time day
0 64 2100 Mon
1 50 2200 Mon
2 48 2300 Mon
3 34 0 2
4 30 100 5
In [24]:
# or do it the Pandas way
df.day.replace(to_replace=range(7),
               value=['Su','Mon','Tues','Wed','Th','Fri','Sat'],
               inplace=True)
df
Out[24]:
temperature time day
0 64 2100 Mon
1 50 2200 Mon
2 48 2300 Mon
3 34 0 Tues
4 30 100 Fri
In [25]:
# one hot encoding example
pd.get_dummies(df.day)
Out[25]:
Fri Mon Tues
0 0 1 0
1 0 1 0
2 0 1 0
3 0 0 1
4 1 0 0
In [ ]:
 

Michael Li Avatar Michael Li is the creator and lead developer of this site.

Published

Category

Machine Learning

Tags

Stay in Touch

Get Monthly Updates