I am a PhD student at Max Planck Institute - Intelligent Systems, working with Manuel Gomez-Rodriguez and Bernhard Schölkopf. I am interested in Machine Learning and Social Netwroks.
# what is this line all about?!? Answer in lecture 4
%pylabinline
Populating the interactive namespace from numpy and matplotlib
Introduction
The numpy package (module) is used in almost all numerical computation using Python. It is a package that provide high-performance vector, matrix and higher-dimensional data structures for Python. It is implemented in C and Fortran so when calculations are vectorized (formulated with vectors and matrices), performance is very good.
To use numpy need to import the module it using of example:
1
fromnumpyimport*
In the numpy package the terminology used for vectors, matrices and higher-dimensional data sets is array.
Creating numpy arrays
There are a number of ways to initialize new numpy arrays, for example from
a Python list or tuples
using functions that are dedicated to generating numpy arrays, such as arange, linspace, etc.
reading data from files
From lists
For example, to create new vector and matrix arrays from Python lists we can use the numpy.array function.
1
2
3
4
# a vector: the argument to the array function is a Python list
v=array([1,2,3,4])v
array([1, 2, 3, 4])
1
2
3
4
# a matrix: the argument to the array function is a nested Python list
M=array([[1,2],[3,4]])M
array([[1, 2],
[3, 4]])
The v and M objects are both of the type ndarray that the numpy module provides.
1
type(v),type(M)
(numpy.ndarray, numpy.ndarray)
The difference between the v and M arrays is only their shapes. We can get information about the shape of an array by using the ndarray.shape property.
1
v.shape
(4,)
1
M.shape
(2, 2)
The number of elements in the array is available through the ndarray.size property:
1
M.size
4
Equivalently, we could use the function numpy.shape and numpy.size
1
shape(M)
(2, 2)
1
size(M)
4
So far the numpy.ndarray looks awefully much like a Python list (or nested list). Why not simply use Python lists for computations instead of creating a new array type?
There are several reasons:
Python lists are very general. They can contain any kind of object. They are dynamically typed. They do not support mathematical functions such as matrix and dot multiplications, etc. Implementating such functions for Python lists would not be very efficient because of the dynamic typing.
Numpy arrays are statically typed and homogeneous. The type of the elements is determined when array is created.
Numpy arrays are memory efficient.
Because of the static typing, fast implementation of mathematical functions such as multiplication and addition of numpy arrays can be implemented in a compiled language (C and Fortran is used).
Using the dtype (data type) property of an ndarray, we can see what type the data of an array has:
1
M.dtype
dtype('int64')
We get an error if we try to assign a value of the wrong type to an element in a numpy array:
1
M[0,0]="hello"
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-a09d72434238> in <module>()
----> 1 M[0,0] = "hello"
ValueError: invalid literal for long() with base 10: 'hello'
If we want, we can explicitly define the type of the array data when we create it, using the dtype keyword argument:
1
2
3
M=array([[1,2],[3,4]],dtype=complex)M
array([[ 1.+0.j, 2.+0.j],
[ 3.+0.j, 4.+0.j]])
Common type that can be used with dtype are: int, float, complex, bool, object, etc.
We can also explicitly define the bit size of the data types, for example: int64, int16, float128, complex128.
Using array-generating functions
For larger arrays it is impractical to initialize the data manually, using explicit python lists. Instead we can use one of the many functions in numpy that generates arrays of different forms. Some of the more common are:
arange
1
2
3
4
5
# create a range
x=arange(0,10,1)# arguments: start, stop, step
x
A very common file format for data files are the comma-separated values (CSV), or related format such as TSV (tab-separated values). To read data from such file into Numpy arrays we can use the numpy.genfromtxt function. For example,
fig,ax=subplots(figsize=(14,4))ax.plot(data[:,0]+data[:,1]/12.0+data[:,2]/365,data[:,5])ax.axis('tight')ax.set_title('tempeatures in Stockholm')ax.set_xlabel('year')ax.set_ylabel('temperature (C)');
Using numpy.savetxt we can store a Numpy array to a file in CSV format:
col_indices=[1,2,-1]# remember, index -1 means the last element
A[row_indices,col_indices]
array([11, 22, 34])
We can also index masks: If the index mask is an Numpy array of with data type bool, then an element is selected (True) or not (False) depending on the value of the index mask at the position each element:
Vectorizing code is the key to writing efficient numerical calculation with Python/Numpy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations, like matrix-matrix multiplication.
Scalar-array operations
We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers.
What about matrix mutiplication? There are two ways. We can either use the dot function, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments:
Alternatively, we can cast the array objects to the type matrix. This changes the behavior of the standard arithmetic operators +, -, * to use matrix algebra.
1
2
M=matrix(A)v=matrix(v1).T# make it a column vector
We can compute with subsets of the data in an array using indexing, fancy indexing, and the other methods of extracting data from an array (described above).
For example, let’s go back to the temperature dataset:
The dataformat is: year, month, day, daily average temperature, low, high, location.
If we are interested in the average temperature only in a particular month, say February, then we can create a index mask and use the select out only the data for that month using:
1
unique(data[:,1])# the month column takes values from 1 to 12
# the temperature data is in column 3
mean(data[mask_feb,3])
-3.2121095707365961
With these tools we have very powerful data processing capabilities at our disposal. For example, to extract the average monthly average temperatures for each month of the year only takes a few lines of code:
When functions such as min, max, etc., is applied to a multidimensional arrays, it is sometimes useful to apply the calculation to the entire array, and sometimes only on a row or column basis. Using the axis argument we can specify how these functions should behave:
With newaxis, we can insert new dimensions in an array, for example converting a vector to a column or row matrix:
1
v=array([1,2,3])
1
shape(v)
(3,)
1
2
# make a column matrix of the vector v
v[:,newaxis]
array([[1],
[2],
[3]])
1
2
# column matrix
v[:,newaxis].shape
(3, 1)
1
2
# row matrix
v[newaxis,:].shape
(1, 3)
Stacking and repeating arrays
Using function repeat, tile, vstack, hstack, and concatenate we can create larger vectors and matrices from smaller ones:
tile and repeat
1
a=array([[1,2],[3,4]])
1
2
# repeat each element 3 times
repeat(a,3)
array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])
1
2
# tile the matrix 3 times
tile(a,3)
array([[1, 2, 1, 2, 1, 2],
[3, 4, 3, 4, 3, 4]])
concatenate
1
b=array([[5,6]])
1
concatenate((a,b),axis=0)
array([[1, 2],
[3, 4],
[5, 6]])
1
concatenate((a,b.T),axis=1)
array([[1, 2, 5],
[3, 4, 6]])
hstack and vstack
1
vstack((a,b))
array([[1, 2],
[3, 4],
[5, 6]])
1
hstack((a,b.T))
array([[1, 2, 5],
[3, 4, 6]])
Copy and “deep copy”
To achieve high performance, assignments in Python usually do not copy the underlaying objects. This is important for example when objects are passed between functions, to avoid an excessive amount of memory copying when it is not necessary (techincal term: pass by reference).
1
2
3
A=array([[1,2],[3,4]])A
array([[1, 2],
[3, 4]])
1
2
# now B is referring to the same array data as A
B=A
1
2
3
4
# changing B affects A
B[0,0]=10B
array([[10, 2],
[ 3, 4]])
1
A
array([[10, 2],
[ 3, 4]])
If we want to avoid this behavior, so that when we get a new completely independent object B copied from A, then we need to do a so-called “deep copy” using the function copy:
1
B=copy(A)
1
2
3
4
# now, if we modify B, A is not affected
B[0,0]=-5B
array([[-5, 2],
[ 3, 4]])
1
A
array([[10, 2],
[ 3, 4]])
Iterating over array elements
Generally, we want to avoid iterating over the elements of arrays whenever we can (at all costs). The reason is that in a interpreted language like Python (or MATLAB), iterations are really slow compared to vectorized operations.
However, sometimes iterations are unavoidable. For such cases, the Python for loop is the most convenient way to iterate over an array:
When we need to iterate over each element of an array and modify its elements, it is convenient to use the enumerate function to obtain both the element and its index in the for loop:
1
2
3
4
5
6
7
8
forrow_idx,rowinenumerate(M):print("row_idx",row_idx,"row",row)forcol_idx,elementinenumerate(row):print("col_idx",col_idx,"element",element)# update the matrix M: square each element
M[row_idx,col_idx]=element**2
As mentioned several times by now, to get good performance we should try to avoid looping over elements in our vectors and matrices, and instead use vectorized algorithms. The first step in converting a scalar algorithm to a vectorized algorithm is to make sure that the functions we write work with vector inputs.
1
2
3
4
5
6
7
8
defTheta(x):"""
Scalar implemenation of the Heaviside step function.
"""ifx>=0:return1else:return0
1
Theta(array([-3,-2,-1,0,1,2,3]))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-166-6658efdd2f22> in <module>()
----> 1 Theta(array([-3,-2,-1,0,1,2,3]))
<ipython-input-165-9a0cb13d93d4> in Theta(x)
3 Scalar implemenation of the Heaviside step function.
4 """
----> 5 if x >= 0:
6 return 1
7 else:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
OK, that didn’t work because we didn’t write the Theta function so that it can handle with vector input…
To get a vectorized version of Theta we can use the Numpy function vectorize. In many cases it can automatically vectorize a function:
1
Theta_vec=vectorize(Theta)
1
Theta_vec(array([-3,-2,-1,0,1,2,3]))
array([0, 0, 0, 1, 1, 1, 1])
We can also implement the function to accept vector input from the beginning (requires more effort but might give better performance):
1
2
3
4
5
defTheta(x):"""
Vector-aware implemenation of the Heaviside step function.
"""return1*(x>=0)
1
Theta(array([-3,-2,-1,0,1,2,3]))
array([0, 0, 0, 1, 1, 1, 1])
1
2
# still works for scalars as well
Theta(-1.2),Theta(2.6)
(0, 1)
Using arrays in conditions
When using arrays in conditions in for example if statements and other boolean expressions, one need to use one of any or all, which requires that any or all elements in the array evalutes to True:
1
M
array([[ 1, 4],
[ 9, 16]])
1
2
3
4
if(M>5).any():print("at least one element in M is larger than 5")else:print("no element in M is larger than 5")
at least one element in M is larger than 5
1
2
3
4
if(M>5).all():print("all elements in M are larger than 5")else:print("all elements in M are not larger than 5")
all elements in M are not larger than 5
Type casting
Since Numpy arrays are statically typed, the type of an array does not change once created. But we can explicitly cast an array of some type to another using the astype functions (see also the similar asarray function). This always create a new array of new type:
1
M.dtype
dtype('int64')
1
2
3
M2=M.astype(float)M2
array([[ 1., 4.],
[ 9., 16.]])
1
M2.dtype
dtype('float64')
1
2
3
M3=M.astype(bool)M3
array([[ True, True],
[ True, True]], dtype=bool)
Further reading
http://numpy.scipy.org
http://scipy.org/Tentative_NumPy_Tutorial
http://scipy.org/NumPy_for_Matlab_Users - A Numpy guide for MATLAB users.
Internet is very dear to me! That may sound strange but I come from a generation where I understand the difference between Internet and no Internet. I can st...