Monday, February 17, 2014

Post 21: Understanding SciPy and sci-kit learn

SciPy is a community (scipy.org). ScyPy stack contains: NumPy, Python, Matplotlib, Sympy, IPython, and Pandas.

We would now use SciPy packages to do some statistical computing. We would discuss the concepts of mean, median, mode, quartiles etc. in a separate post. We would consider that in this post that we have the knowledge of those basic statistical concepts, and see how we can use SciPy/NumPy to compute those values for a data set.

We would first import the New York weather data that we downloaded in earlier post, as shown below.

import numpy as np

dt = np.dtype([('Month', 'int8'), ('Day', 'int8'), ('Year', 'int16'), ('Temp', 'float64')])
data = np.loadtxt('weather/NYNEWYOR.txt', dtype=dt)
print data[:5]


and find the total number of records in the data, and also create an array "year2012" to extract and store the year 2012 NY weather data set into that array as shown below:

total_days = len(data)
print total_days


year2012 = data[data['Year'] == 2012]


Now we will plot the year 2012 data in a histogram, as below:

from scipy import stats
import scipy as sp
%pylab inline
import matplotlib.pyplot as plt
plt.hist(year2012['Temp'], bins=20);




Now we would define a NumPy function as below, which would calculate the number of data points, min/max of the data, variance, kurtosis, and median of data, using def as shown below.

def describe_data(x):
    n, min_max, mean, var, skew, kurt=sp.stats.describe(x)
    print 'number of points: ', n
    print 'min/max: ', min_max
    print 'mean: ', mean
    print 'variance: ', var
    print 'skew: ', skew
    print 'kurtosis: ', kurt
    print 'median: ', sp.median(x)


We would now call the function on the 2102 data set using:

describe_data(year2012['Temp'])



Scipy provides somewhat powerful mquantiles function. mquantiles allows to quickly divide our data into 4 quartiles by requesting points at 0.25, 0.5, 0.75, and 1.0.

sp.stats.mstats.mquantiles(year2012['Temp'], [0.25, 0.5, 0.75, 1.0])

That 0.25 number will correspond to the first value that is greater than the 25% of the data entries. Similarly, we expect .5 to be the median, .75 to be greater than 75% of the entries, and 1 to be responsible for the greatest entry. We can put any number we want in there in mquantiles, between 0-1, for example, if we are interested in temp that are hotter than 90% of all the other days, as well as the 95%, and 100% of the entries, we could request that as well using:

sp.stats.mstats.mquantiles(year2012['Temp'], [0.90, 0.95, 1.0])


Fit and Interpolate Data

Now we would be fitting data and interpolating it using the statistical package in ScyPy.

%pylab inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

 

Now let’s seed the NumPy random number generator with the value, 1. Then we are going to take 150 random samples, centered at 25, with a standard deviation of 1.

np.random.seed(1)
rnd = stats.norm.rvs(loc=25, scale=1, size=150)

 

Let’s go ahead and plot this sample.

plt.hist(rnd)



Now let’s see how to estimate a probability distribution function from the data that we have just sampled.

To calculate the mean and standard deviation in SciPy:

param = stats.norm.fit(rnd)
print “mean: “, param[0], “stddev: “, param[1]

Now let’s see how to fit the data compared to original samples. Here, we are going to draw 100 points, from the fitted density function.

x = np.linspace(20, 30, 100)
pdf_fitted = stats.norm.pdf(x, loc=param[0], scale=param[1])

Now, we are going to plot a normalized histogram against the probability distribution function that we just fitted.

plt.title('Normal Distribution')
plt.plot(x, pdf_fitted, 'r-')
plt.hist(rnd, bins=20, normed=1, alpha=0.3);



Now we are going to use slightly advanced analytics techniques on our weather data. The common way to estimate a unknown probability density function (pdf) from a collection of samples of random variables, like Temperature in our weather data, is to use the Gaussian KDE, where KDE stands for kernel density estimation. Let’s go ahead and run two fits, one against all of the 18 yrs of temperature data from NY City, and another for the 2012 data. Then we will plot both pdfs against the 2012 temperature data.

pdf_all = stats.gaussian_kde(data['Temp'])
year2012 = data[data['Year'] == 2012]
year2012t = year2012['Temp']
pdf_2012 = stats.gaussian_kde(year2012t)


We collect a sample of points between the temp values 10 and 100 and plot the pdfs over those points.

x = np.arange(10, 100)
plt.plot(x, pdf_all(x), x, pdf_2012(x));
plt.legend(['all data', '2012 data'])
plt.hist(year2012t, normed=True, alpha=0.3);
 




As we can see, the data still looks bi-modal.

One more thing that we can do is to analyze all of the weather station data, across all 18 yrs, and see if it looks a little bit more like a Gaussian.

We will import os.path and the glob module. “glob” module allows us to query a large number of files on the file system using the standard wildcard (*)  syntax.

import os.path
import glob
files = glob.glob('weather/*.txt')
names = [os.path.splitext(
  os.path.basename(file))[0]
  for file in files]
    print files
  
Now we would grab all of these temp data from all these weather stations, and put them in one big array. We will have to do this dynamically though. First, we would load each of these data sets independently, and append them into a list, called total_temps.

dt = np.dtype([('Month', 'int8'),
                          ('Day', 'int8'),
                          ('Year', 'int16'),
                          ('Temp', 'float64')])

total_temps = []
files = glob.glob('weather/*.txt')
for f in files:
           t_data = np.loadtxt(f, dtype=dt)
           total_temps.append(t_data['Temp'])

 


After we created that list, we will call NumPy vstack function. vstack allows us to stack the total temp arrays side-by-side. This works as they all have the same total length. After we call flatten(), we will expect a big long array, just in one dimension.

total_temps = np.vstack(total_temps).flatten()
print total_temps.shape

Let’s try to fit and then plot the Gaussian KDE against that data.

pdf = stats.gaussian_kde(total_temps)
x = np.arange(20, 100)
plt.hist(total_temps, bins = 20, normed = True)
plt.plot(x, pdf(x), 'r-');



Now, let’s try to fit a normal Gaussian though this distribution. So we call the norm.fit method, which outcomes a mean and a standard deviation. Then we can create a probability density function from the mean and the standard deviation calculated just now. We plot this against our histogram of data as:

mean, stddev = stats.norm.fit(total_temps)
pdf = stats.norm.pdf(x, loc=mean, scale=stddev)
plt.hist(total_temps, bins=20, normed=True)
plt.plot(x, pdf);

We can see that the function is not a good fit, as those -99 values are causing a problem. Let’s see what happens when we get rid of them.

First, let’s use the describe_data function that we have defined earlier.

describe_data(total_temps)

If we are looking at a regular Gaussian distribution, we won’t expect these values for kurtosis and skew to be so high.



Here is a straightforward method to remove those problematic -99 values from our data. We save it into a new variable called fixed_temps and let’s look at the describe_data function on the new values.

fixed_temps = total_temps[total_temps != -99]
describe_data(fixed_temps)

Now the skew and kurtosis are sharply removed.




Now let’s see how the fixed fit compares against our original fit.

x = np.arange(10, 100)
mean, stddev = stats.norm.fit(total_temps)
pdf = stats.norm.pdf(x, loc=mean, scale=stddev)
plt.hist(total_temps, bins=20, normed=True)
plt.plot(x, pdf);

mean1, stddev1 = stats.norm.fit(fixed_temps)
fixed_pdf = stats.norm.pdf(x, loc=mean1, scale=stddev1)

plt.hist(total_temps, bins=20, normed=True)
plt.plot(x, pdf, x, fixed_pdf);

 

But as we have deleted the data, we have missing data in our data set, and with this, we have misaligned our data with the original data set that we had. Let’s talk about how to fix this using interpolation.

First, let’s do a simple interpolation example. Let’s take a look at a coarsely sampled smooth function, cosine^2. We will do both linear and cubic interpolations. The interpolation f below is the linear one, and the f2 one is the cubic interpolator.


from scipy.interpolate import interp1d

x = np.linspace (0, 10, 10)
y = np.cos(-x**2/8.0)
f = interp1d(x, y)
f2 = interp1d(x, y, kind='cubic')

Now we will re-sample the data at a finer resolution. We will plot the original data points, the higher resolution cosine^2, and then our 2 interpolators.

xnew = np.linspace(0, 10, 40)
plt.plot(x, y, 'x',
              xnew, np.cos(-xnew**2/8.0), '-',
              xnew, f(xnew), '-',
              xnew, f2(xnew), '-')
plt.legend(['data', 'actual', 'linear', 'cubic'], loc='best')
plt.show()



We can see the original data in blue x points, and the linear interpolator would draw direct lines between them, the cubic interpolator on the other hand would try to smoothly fit, assuming that the derivative is slowly changing. We see that this was a much better job of fitting the cosine function until we do not have enough data to assess where the curve is going. We need to choose an interpolator that matches our underlying function. If we expect our data to be not very smooth, we are better of using a linear interpolator. On the other hand, if our data is smoothly varying, we may consider using a higher order interpolator, although those are computationally more expensive.

So let’s get back to our original data. Here, we are reloading the NY data using the same dtype as we were using before. We will be grabbing two different indices, one for the values that are correct, and another where we have missing data. We use na_* here to stand for missing data.

dt = np.dtype([('Month', 'int8'), ('Day', 'int8'), ('Year', 'int16'), ('Temp', 'float64')])
data = np.loadtxt('weather/NYNEWYOR.txt', dtype=dt)

v_idxs = np.where(data['Temp'] != -99)
print v_idxs[0].size

na_idxs = np.where(data['Temp'] == -99)
print na_idxs[0].size


So we have 6729 valid entries, and 20 missing entries. Let’s take a closer look at some of those missing entries.

data[['Day', 'Temp']][514:523]

If we delete those missing indices,

for na in na_idxs:
        data = np.delete(data, na, 0)

So if we see the data set again, we would see that those missing data sets are not available in our data anymore.

data[514:523][[ 'Day', 'Temp']]



Now if we want to bring that data back to our data set, we need to use linear interpolation, as below:

We take our modified data set, where we have deleted all the missing entries, and we build an interpolator over that.

x = v_idxs[0]
linterp = interp1d(x, data['Temp'], kind='linear')

Then we will reload the data, plot it, bad entries and all, and then use our new interpolator to show how to fix those holes.

original_data = np.loadtxt('weather/NYNEWYOR.txt', dtype=dt)
total_days = len(original_data)
plt.plot(original_data['Temp'][510:530])
xnew = np.arange(total_days)
plt.plot(linterp(xnew)[510:530]);



Here we can see that the original data (in blue) dividing down to -99, every time we are missing an entry. Using the interpolator though, we are smoothly interpolating between missing dates, providing continuous data set, which is easier to use for correct analysis.

Understand sci-kit learn

SciPy covers many of the canonical scientific computing applications. However, recent developments in statistical computing and data mining has spawned a new field, called Machine Learning. Scikit Learn is largely considered as a standard ML package for Python. Like many numerical computing packages, Scikit Learn leverages NumPy’s architecture and memory model under the hood.

At a high level, we can separate ML in 2 separate categories, Supervised Learning and Unsupervised Learning. Though there is a gradient between these two extremes, and many techniques leverage semi-supervised learning techniques. Generally, we speak of supervised learning when we are referencing algorithms where the data comes pre-labeled, that is these samples are good, these samples are bad, or these samples are category 1, 2, or so on. Often, Supervised Learning is useful for predicting the category or class, a new sample should belong to. Unsupervised learning defines a collection of algorithm where the data is not labeled. Often there are relationships between data and we want to infer or predict those relationships.

You can refer to Andreas Mueller’s cheat sheet for Scikit Learn at:
http://peekaboo-vision.blogspot.com/2013/01/machine-learning-cheat-sheet-for-scikit.html

where he mentions about 4 main techniques in ML.



Classification: When we are trying to categorize labeled data (e.g., detecting spam in email, determining what digit has been handwritten, or detecting if the signature has been forged)

Clustering: Clustering is when we are trying to categorize the unlabelled data (e.g., determining the categories for movies, or classifying our customers or clients, or trying to find patterns in natural phenomena or processes)


Regression: We use regression when we are trying to estimate or predict a response, usually from a continuous distribution. (e.g., weather forecasting,


Dimensionality Reduction: It can be further subdivided into Feature Selection, and Feature Extraction. Both techniques try to reduce the # of variables under consideration. In Feature selection, we try to identify the subset of variables that are most important for our data. In Feature Extraction, we try transform into a space with a fewer dimensions using techniques such as Principal Component Analysis.

In all of these techniques in Scikit Learn, there is a fit stage and a predict stage. First we try to build the classifier from the training data, and then we predict against the test data.

Let’s look at a quick example from the Scikit Learn tutorial.

%pylab inline
import matplotlib.pyplot as plt

from sklearn import datasets

Then we get the digits data set from Scikit Learn. The digits data set contains a series of digits, scanned at a low resolution.

digits = datasets.load_digits()
plt.imshow(digits.images[-1],
                     cmap=plt.cm.gray_r,
                     interpolation='nearest')
 

Here, we have a very vague outline of an A. The associated target array contains that information.

digits.target[-1]



Now, we are going to build a simple support vector classifier.

from sklearn import svm

In the first part, we fit it to the data. We exclude the final piece of data, because we are going to use that for a task.

clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(digits.data[:-1], digits.target[:-1])

After the classifier has been built, we can use it for prediction.

clf.predict(digits.data[-1])

As you can see, the classifier arrived at the same digit A.




Sunday, February 9, 2014

Post 20: Advanced NumPy Array Constructs

In this post, we would look at some of the advanced features of NumPy arrays. 

NumPy arrays have a shape (or dimension). So if we create an array of shape = (M, N), it would create it as:


Please note that there could few or more dimensions while creating the NumPy arrays, as NumPy arrays are really N-Dimensional.

- To create a 1-Dimensional array:
>> np.zeros(shape=5, dtype='float')

- To create a 3-Dimensional array of shape 3x3x3:
>> np.zeros(shape=(3,3,3), dtype='float')

Arrays have a specified data type.

- You can create arrays from python lists:
>> np.array([[1,2,3], [4,5,6]], dtype='float')

Creating NumPy arrays from python lists is not very efficient as native python data types are slow. So we often read and write directly from files instead Or use some other utilities, like, zeros(), diag(), ones(), arrange().

- Evenly spaced values on an interval:
>> arange([start,] end, [,step])

arange  allows fractional and negative steps

- Values equidistant on a linear scale:
>> np.linspace(start, end, num)

- Values equidistant on a log scale:
>> np.logspace(start, end, num)

- Zero-initialized array of shape 2x2:
>> np.zeros((2,2))

- One-initialized array of shape 1x5:
>> np.ones((1, 5))

- Uninitialized empty array:
>> np.empty((1, 3))

- Constant diagonal value of 1 of size 3x3:
>> np.eye(3)

- Multiple diagonal values of 1, 2, 3, 4 of size 4x4:
>> np.diag([1,2,3,4])

Note that Array slices never create copies.


Array Assignment:

- Assign all elements of an array to a scalar:
>> a[:] = x
This copying of the scalar, as needed, is an example of NumPy broadcasting.

- Row or column assignment to a scalar:
>> a[i, :] = x
>> a[i, :] = [A, B, C, D, E, F]


Array Math
- Operation with scalars apply to all elements
>> a = np.arange(10)
>> a + 20

- Operations on other arrays are element-wise
>> a + a

These operations create new arrays for the result

- Vectorized conditionals
    * Conditional operations make Boolean arrays
      >> a = np.arange(5)
      >> a > 2

    * np.where selects from an array
      >> np.where(a > 2)
      >> np.where(a > 2, a, False) 

           - Putting false in the parameter would make sure that the false values are populated with 0s.



    * Predicates: 
       >> a.any() - Returns true only if the conditional has at least one true
       >> a.all() - Returns true only if the conditional has all true

- Reductions: Few of the reduction functions in NumPy are -  
a.mean(), a.argmin(), a.argmax(), a.trace(), a.cumsum(), a.cumprod()

- Manipulation: a.argsort(), a.transpose(), a.reshape(…), a.ravel(), a.fill(…), a.clip(…) 
 
- Complex Numbers: These properties and functions can be used on the real number
a.real, a.imag, a.conj()

Reshape Arrays:

- Reshape will reshape an array to any matching size
- Takes tuples as arguments

- Use  –1 for a wildcard dimension
>> a = np.arange(30)
>> b = a.reshape((3, -1, 2))

>> b
>> b.shape
 

- np.ravel and np.flatten reshape arrays to one dimension
- np.squeeze removes singular dimensions: np.squeeze(b).shape





Array Data Type


- NumPy array elements have a single data type
- The type of object is accessible through the .dtype attribute
- Most common attributes of dtype object:
  - dtype.byteorder – big or little endian
  - dtype.itemsize – element size of the dtype
  - dtype.name – a name of this dtype object
  - dtype.type – type object used to create scalars etc.
- Array dtypes are usually inferred automatically
- But can be specified explicitly: a = np.array([1, 2,3], dtype=np.float32)



np.datetime64 is a new addition NumPy 1.7


NumPy Data Model
- Metadata: dtype, shape, strides
- Memory pointers to data
- When slicing arrays, NumPy creates new dtype/shape/stride information, but reuses the pointer to the original data
>>> a = np.arange(10)>>> b = a[3:7]>>> b
array([3, 4, 5, 6])
 

>>> b[:] = 0
>>> a
array([0, 1, 2, 0, 0, 0, 0, 7, 8, 9])


>>> b.flags.owndata
False



Array Memory Layout

When we extract values from NumPy arrays and return them into python, what NumPy does underneath the cover is extract row values from the C memory structures, wraps them in a small header object, and that is what is exposed to python. These array scalar as they called, are very light weight wrappers, and they make it so that python can deal with values inside the numpy array as transparently and as seamlessly as if the array was just a python list.






- Universal Functions
NumPy ufuncs are functions that operate element-wise on one or more arrays. Essentially ufuncs are two functions in one. In Python level, it returns a new array object. When called, ufuncs dispatch to optimized C loops on array dtype.


 Broadcasting
- Broadcasting is a key feature of NumPy.
  - Arrays with different, but compatible shapes can be used as arguments to ufuncs

>> c = a + 10

Here a scalar 10 is broadcasted to an array

- In order for a operation to broadcast, the size of all the trailing dimensions for both arrays must either: be equal or be one.

>> A (1-d array):       3
>> B (2-d array): 2 x 3
Result:                2 x 3

>> A (1-d array):       6 x 1
>> B (2-d array): 1 x 6 x 4
Result:                1 x 6 x 4

>> A (1-d array):  3 x 1 x 6 x 1
>> B (2-d array):        2 x 1 x 4
Result:                 3 x 2 x 6 x 4


NumPy.random Example:
- np.random.random((4, 5))
- np.random.normal(loc, scale, size)
- np.random.hypergeometric(20, 10, 5, (10,))
- np.random.permutation(20, 10, 5, (10,))


numpy.linalg Example:
- Transpose of a matrix - a.T
- Inverse of a matrix - np.linalg.inv(a)
- Dot product of Matrices - np.dot(a, x)
- np.linalg.solve(a, y)





Array Subclasses:
- numpy.matrix – Matrix Operators
- numpy.recarray – Record Arrays
- numpy.ma – Masked Arrays
- numpy.memmap – Memory-mapped Arrays



NumPy.matrix Example:

NumPy matrix objects can be created using matlab like syntax:
>> from numpy import matrix
>> A = matrix('1.0, 2.0; 3.0, 4.0')
>> Y = matrix ('5.0; 7.0')
>> from numpy.linalg import solve
>> x = solve(A, Y)
>> A * x
>> A.I





Saturday, February 8, 2014

Post 19: Working with Real-Life Data using NumPy - Data Plotting

In this post, we would take the data that we have loaded in Post 18, and start plotting/modifying it for our analysis.

For plotting the data, first import the pyplot as:
>> %pylab inline
>> import matplotlib.pyplot as plt

Then we can plot the temperature data using the command:
>> plt.plot(w_data['Temp']);

NumPy has vectorized conditionals that allow us to ask questions resulting in Boolean arrays of the same size of the query, which generated over. For example, if we ask all of the data entries where the year = 1995:
>> w_data['Year'] == 1995

These Boolean arrays can be used as an input to the original array, selecting all the rows where the conditional is true. For example to get the records where the year=1995:
>> w_data[w_data['Year'] == 1995]




If we want to know what the hottest day of the year 1995 was:
>> year1995 = w_data[w_data['Year'] == 1995]
>> np.argmax(year1995['Temp'])
>> year1995[248]

If we need to know the hottest/coldest day, we can use the max/min function:
>> year1995['Temp'].max()
>> year1995['Temp'].min()

If we need to know the mean and standard deviation of the data, we can use the mean/std function:
Average Temp:
>> year1995['Temp'].mean()

Standard Deviation:
>> year1995['Temp'].std()




To plot 2012 and 1995 data in a single plot:
>> %pylab inline>> year2012 = w_data[w_data['Year'] == 2012]>> plt.plot(year2012['Temp'], label='2012')>> plt.plot(year1995['Temp'], label='1995')

>> plt.legend();


If we need to see how the temp has changed from the day before, we can do that by taking two slices of the data (1st slice: We take every element of the array but the last / 2nd slice: we skip the first element) as:
>> arr1 = year1995['Temp'][:-1]
>> arr2 = year1995['Temp'][1:]

Then we subtract the 1st array from the 2nd, we get the temp change from the previous day into the array.
>> deltaT = arr2 - arr1

We then plot this subtracted values in a scatter plot as:
>> plt.plot(deltaT, '.');



Now if we see the first plot above, we see a few outliers in the data, where the temperature values are showing to be -99. 

Let's do some clean up of those outliers:
1) Let’s find out the outliers in the data using the min function to see which are those anomalous records that show such -ve values:
>> w_data['Temp'].min()


2) Let’s replace that outlier with the min temp of 1995
a. To find all the places where the data is set to outlier value, we use the np.where command as:
>> np.where(w_data['Temp'] == -99.0)

b. Store the indices of all those values in an array as:
>> idxs = np.where(w_data['Temp'] == -99.0)

c. To find the min temp for 1995, we use the command
>> year1995['Temp'].min()



d. Using the indices above to fix the bad data, use the command below as:
>> w_data['Temp'][idxs]=51.2


e. Plot the data after correction:
>> plt.plot(w_data['Temp'])





So as you can see, once the outliers are removed, the data distribution looks more in bound and logical.

3) To save this analysis data, use the npz format, as below. npz is a compressed format and so it is better to use this format.
>> np.savez('weather-data.npz', w_data)

4) If we need it use that data later from the npz file, we use the command:
>> dataz = np.load('weather-data.npz')


5) The data would be stored in arr_0 and can be retrieved as:
>> dataz['arr_0']



We have seen some of the NumPy constructs in this post. In the next post, we would see some more advanced array operations in NumPy.

Post 18: Working with Real-Life Data using NumPy - Data Extraction and Querying

In this example, we would take a weather data set from National Oceanic and Atmospheric Administration (NOAA) and analyze the data using NumPy constructs.


To get the data from the set from the figshare, you can use the code as below:

import numpy as np
import zipfile
import os

import urllib2

dirname='weather'
if os.path.exists(dirname): 
                print 'weather directory exists, skipping download/unzip'

else: 
                os.mkdir(dirname) 
                
                url = 'http://files.figshare.com/1378975/weather.zip'
                response = urllib2.urlopen(url)
                fname = 'weather.zip'
                with open(fname, 'wb') as f:
                                f.write(response.read())
                zfile = zipfile.ZipFile('weather.zip')

                for name in zfile.namelist()[1:]:
                                print name
                                with open(name, 'w') as f:
                                                f.write(zfile.read(name))
 

Now we can open one or any of the text files in the zip that we have downloaded. Each text file contains the data for one of the cities, as indicated in the file name. We can open the CA - Los Angeles file using the command:

>> with open('weather/CALOSANG.txt', 'r') as f:
       print '\n'.join(f.readlines()[:3])


To load that data in numpy using the following command:

>> w_data = np.loadtxt('weather/CALOSANG.txt')
>> w_data[0]





As shown  above, the above command would give the default datatype. We can convert it to user defined data type using the command as below:
>> dt=np.dtype([('Month', 'int8'), ('Day', 'int8'), ('Year', 'int16'), ('Temp', 'float64')])
>> w_data = np.loadtxt('weather/CALOSANG.txt', dtype=dt)
>> w_data[:5]


To see the first 2 rows:
>> w_data[0:2]

To slice in the middle of the array:
>> w_data[30:35]


To check out every 7 days, starting from the first:
>> w_data[0::7]




To see the last 5 entries in the data set:
>> w_data[-5:]

To reverse the entire array:
>> w_data[::-1]

To look into the column entries (Entering the custom data type helps in this), we can use the custom data type name. So if we want to see the Temp data:
>> w_data['Temp']

If we want to mix the column and the row entries and select first 5 temperatures, we can write the command as:
>> w_data['Temp'][:5]




In the next post, we would take this data and do some plotting and modifying this data to carry on with our analysis.

Friday, February 7, 2014

Post 17: Numerical Analysis with NumPy

In this blog post, we would take a look at NumPy ecosystem and see how to compute with the NumPy arrays. 

NumPy is a python package that provides multidimensional arrays, tables, and matrices for python. Data is either stored in contiguous or straightened arrays and the types are always homogeneous. They can be consisting of integers, numbers, character data, arrays of records, or nested records.


Green boxes are python's core scientific packages, built on top of NumPy.

We would run a couple of commands involving NumPy arrays to get a feel of operation that can be done on NumPy arrays.

a) NumPy Arrays:

Suppose I declare a list, sales, as:
>> Sales = [1.0, 2.0, 3.5, 4.5]

Now if we want to double the individual entries inside the list, and for that if we can write a command like:
>> print sales*2

it won't double the individual values inside the list, but rather would double the size of the list itself, with the individual array values being repeated, as show below in command 5.


So for doubling the individual values in such a list, we would need to run a command as below:
>> print [2*s for s in sales]

But if we convert the list to a NumPy array, these operations can be done very easily and intuitively as shown below:

- First we need to import the NumPy array as:
>> import numpy as np

- Then we convert the list, sales, into a NumPy array as
>> sales = np.array(sales)
>> print sales

- To double the individual values inside the array:
>> print sales * 2

- To add a number to all the elements in the array:
>> print sales + 1

- To square all the individual elements in the array:
>> print np.power(sales, 2)

All NumPy arrays are single element data type and the data type can be seen using the dtype parameter. 

We would talk about some more of operations with NumPy arrays as below:

- To create an array of size 5 of all zeros
>> np.zeros(5)

- To create an array of size 5 of all ones
>> np.ones(5)

- To multiply all elements of the ones array by 42
>> np.ones(5) * 42

- To create an array of random distribution
>> np.random.random(5)

- To create an array of diagonal matrix of 4
>> np.diag([1, 1, 1, 1])

- To create an array of 0-10
>> np.arrange(0, 10)


- To create an array of numbers between 10 and 30 (10 included), with a step of 2
>> np.arange(10,30,2)

- If we want to create a linear sample of a space, we use a linspace. The following command creates 100 sample points between 0 and 1 in a linear space.
>> np.linspace(0,1,100)





- To find the data type of the array
>> x.dtype

- To specify the data type while creating the array:
>> x = np.ones(5, dtype='int64')


- Define the shape of the array while we construct it:
>> np.ones(shape=(3,3), dtype='float64')


- To create a random, uniformly sampled array from 0 to 1, of size 5x5.
>> np.random.uniform(0,1,size=(5,5))


- To reshape a single row array in to a multidimensional one or vice versa, we would use the reshape command as:
>> data = np.arange(9)
>> data = data.reshape((3,3))

Note that the outer parenthesis is defining the function argument and the inner parenthesis is defining the tuple, (3,3) in this case.


- To access and change the first column:
>> data[0,0] = 12
>> data[1,0] = 13
>> data[2,0] = 14



- Selecting the complete first row
>> print data[0,:]

- Selecting the complete first column
>> print data[:,0]

- Assigning a whole column:
>> data[:,0] = 0



In the next post, we would be working with some real life data to show how to use these constructs in analyzing/modifying a real-life data set.