Parashar's Digital Notebook!: Post 19: Working with Real-Life Data using NumPy

In this post, we would take the data that we have loaded in Post 18, and start plotting/modifying it for our analysis.

For plotting the data, first import the pyplot as:
>> %pylab inline
>> import matplotlib.pyplot as plt

Then we can plot the temperature data using the command:
>> plt.plot(w_data['Temp']);

NumPy has vectorized conditionals that allow us to ask questions resulting in Boolean arrays of the same size of the query, which generated over. For example, if we ask all of the data entries where the year = 1995:
>> w_data['Year'] == 1995

These Boolean arrays can be used as an input to the original array, selecting all the rows where the conditional is true. For example to get the records where the year=1995:
>> w_data[w_data['Year'] == 1995]

If we want to know what the hottest day of the year 1995 was:
>> year1995 = w_data[w_data['Year'] == 1995]
>> np.argmax(year1995['Temp'])
>> year1995[248]

If we need to know the hottest/coldest day, we can use the max/min function:
>> year1995['Temp'].max()
>> year1995['Temp'].min()

If we need to know the mean and standard deviation of the data, we can use the mean/std function:
Average Temp:
>> year1995['Temp'].mean()

Standard Deviation:
>> year1995['Temp'].std()

To plot 2012 and 1995 data in a single plot:
>> %pylab inline>> year2012 = w_data[w_data['Year'] == 2012]>> plt.plot(year2012['Temp'], label='2012')>> plt.plot(year1995['Temp'], label='1995')

>> plt.legend();

If we need to see how the temp has changed from the day before, we can do that by taking two slices of the data (1st slice: We take every element of the array but the last / 2nd slice: we skip the first element) as:
>> arr1 = year1995['Temp'][:-1]
>> arr2 = year1995['Temp'][1:]

Then we subtract the 1st array from the 2nd, we get the temp change from the previous day into the array.
>> deltaT = arr2 - arr1

We then plot this subtracted values in a scatter plot as:
>> plt.plot(deltaT, '.');

Now if we see the first plot above, we see a few outliers in the data, where the temperature values are showing to be -99.

Let's do some clean up of those outliers:
1) Let’s find out the outliers in the data using the min function to see which are those anomalous records that show such -ve values:
>> w_data['Temp'].min()

2) Let’s replace that outlier with the min temp of 1995
a. To find all the places where the data is set to outlier value, we use the np.where command as:
>> np.where(w_data['Temp'] == -99.0)

b. Store the indices of all those values in an array as:
>> idxs = np.where(w_data['Temp'] == -99.0)

c. To find the min temp for 1995, we use the command
>> year1995['Temp'].min()

d. Using the indices above to fix the bad data, use the command below as:
>> w_data['Temp'][idxs]=51.2

e. Plot the data after correction:
>> plt.plot(w_data['Temp'])

So as you can see, once the outliers are removed, the data distribution looks more in bound and logical.

3) To save this analysis data, use the npz format, as below. npz is a compressed format and so it is better to use this format.
>> np.savez('weather-data.npz', w_data)

4) If we need it use that data later from the npz file, we use the command:
>> dataz = np.load('weather-data.npz')

5) The data would be stored in arr_0 and can be retrieved as:
>> dataz['arr_0']

We have seen some of the NumPy constructs in this post. In the next post, we would see some more advanced array operations in NumPy.

Parashar's Digital Notebook!

Saturday, February 8, 2014

Post 19: Working with Real-Life Data using NumPy - Data Plotting

No comments:

Post a Comment

About Me