The exercise would show how to retrieve the data from the web, load, access, and modify dictionaries of data, and analyze the data in python dictionaries. We would also see how to use some less commonly know data structures in python standard library to simplify our analysis. We would finally visualize our result using matplotlib.
In this post we would talk about the process of retrieving the data from the web and storing them in Python data structure. In post 15 and 16, we would talk about querying and visualizing this data, respectively.
1) Retrieve the data from the web
We would be working with some web traffic data. We would programmatically acquire that from a remote web source, load it into a simple python database, called the python dictionary, and then use the dictionary to perform basic queries and operation on our data. We would do a couple of simple visualizations and we will talk about some other data structures available in basic python toolkit. We will work with some freely available US gov. web traffic data at:
http://developer.usa.gov/1usagov
What we would see here in the above link is the live data, that get captured, when every time someone on the web clicks on a .mil or .gov url using a url shortcut. The details about this data can be obtained at the url:
http://usa.gov/About/developer-resources/1usagov.shtml
First we would import urllib2 (needed to get the data from the web on to our computer)
Set the url variable to the link from where we are going to get the data from. I have saved a copy of the web traffic data in the file as mentioned below, and saved it into figshare account.
url = 'http://files.figshare.com/' + \
'1374404/usagov_bitly_data2012_11_06_total.json'
As we are are not doing any authentication etc., we are not using the http library routines.
Now we need to open the request object and issue a read command.
request = urllib2.urlopen(url)
data = request.read()
This operation of reading data could have failed or timed out in a less stable internet connection. To write a slightly robust code for downloading this data, we can write the script as below:
Note:
1) shutil is used for high level file operations.
2) The file is opened for reading in the “with” block. That with block will automatically close the file when we exit.
After we are done downloading the file, we use the “with” open statement again to read all the data in the file using the readlines method of the file object.
Let's get the first record from the dictionary and take a look at it.
d) As you can see, the key's are cryptic values, and you can change it to user defined key name. So the key "a" can be changed to user defined value, "User Agent" using the command below:
irecords = imap(process_record, irecords)
def process_record(record):
record['User Agent'] = record['a']
del record['a']
return record
from itertools import imap
irecords = imap(json.loads, data)
irecords = imap(process_record, irecords)
irecords = (process_record(json.loads(line)) for line in data)
irecords.next()
So we have successfully loaded the data in python map and in the next post, we would show how to query this data.
In this post we would talk about the process of retrieving the data from the web and storing them in Python data structure. In post 15 and 16, we would talk about querying and visualizing this data, respectively.
1) Retrieve the data from the web
We would be working with some web traffic data. We would programmatically acquire that from a remote web source, load it into a simple python database, called the python dictionary, and then use the dictionary to perform basic queries and operation on our data. We would do a couple of simple visualizations and we will talk about some other data structures available in basic python toolkit. We will work with some freely available US gov. web traffic data at:
http://developer.usa.gov/1usagov
What we would see here in the above link is the live data, that get captured, when every time someone on the web clicks on a .mil or .gov url using a url shortcut. The details about this data can be obtained at the url:
http://usa.gov/About/developer-resources/1usagov.shtml
First we would import urllib2 (needed to get the data from the web on to our computer)
Set the url variable to the link from where we are going to get the data from. I have saved a copy of the web traffic data in the file as mentioned below, and saved it into figshare account.
url = 'http://files.figshare.com/' + \
'1374404/usagov_bitly_data2012_11_06_total.json'
As we are are not doing any authentication etc., we are not using the http library routines.
Now we need to open the request object and issue a read command.
request = urllib2.urlopen(url)
data = request.read()
This operation of reading data could have failed or timed out in a less stable internet connection. To write a slightly robust code for downloading this data, we can write the script as below:
import numpy as np
import shutil
import os.path
import urllib2
url = 'http://files.figshare.com/' + \
'1374404/usagov_bitly_data2012_11_06_total.json'
file = 'usagov_data.json'
if os.path.isfile (file):
print 'file ', file, 'already exists'
else:
print 'downloading ', file, ' from ', url
try:
request = urllib2.urlopen(url)
with open(file, 'wb') as f:
shutil.copyfileobj(request, f)
except urllib2.URLError as e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfil the request.'
print 'Error code: ', e.code
print 'finish downloading file'
import shutil
import os.path
import urllib2
url = 'http://files.figshare.com/' + \
'1374404/usagov_bitly_data2012_11_06_total.json'
file = 'usagov_data.json'
if os.path.isfile (file):
print 'file ', file, 'already exists'
else:
print 'downloading ', file, ' from ', url
try:
request = urllib2.urlopen(url)
with open(file, 'wb') as f:
shutil.copyfileobj(request, f)
except urllib2.URLError as e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfil the request.'
print 'Error code: ', e.code
print 'finish downloading file'
Note:
1) shutil is used for high level file operations.
2) The file is opened for reading in the “with” block. That with block will automatically close the file when we exit.
After we are done downloading the file, we use the “with” open statement again to read all the data in the file using the readlines method of the file object.
The read data will be in list format. We would now take
this list and transform that into Python dictionary (key/value format).
Since the underlying data is in json format, we will
parse it using the python json module. We would import the module using:
import json
Since the copy of our data feed easily sits into memory,
we can use a map to transform our list of strings into a list of dictionaries.
records = map(json.loads, data)
If we are working with a much larger dataset, we might
want to perform online analytics, which is only evaluating one structure at a
time. In python, we can achieve this using iterable map from the itertools
collection.
from itertools import imap
irecords = imap(json.loads, data)
imap returns a generator. It only provides data one at a
time. So to properly get the records from an imap, we have to use the next()
method.
irecords.next()
import json
from collections import Counter
file = 'usagov_data.json'
with open(file, 'rb') as f:
data = f.readlines()
records = map(json.loads, data)
from collections import Counter
file = 'usagov_data.json'
with open(file, 'rb') as f:
data = f.readlines()
records = map(json.loads, data)
Let's get the first record from the dictionary and take a look at it.
a) Fetch the first record
>> record = records[0]
b) Print the record key - value (it would be stored in key value format as it is stored in a map)
>> print record.keys()
>> print record.values()
c) If you want a bit of formatted output, you can use json's dumps method
d) As you can see, the key's are cryptic values, and you can change it to user defined key name. So the key "a" can be changed to user defined value, "User Agent" using the command below:
>> record['User Agent'] = record['a']
e) Once changed, you can delete the record with the key 'a' as 'User Agent' key too points to the same data.
>> del record['a']
f) If you want to do it for the records in the map, you can create a user defined function using define process as below:
>> def process_record(record):
record['User
Agent'] = record['a']
del record['a']
return record
del record['a']
return record
g) Then you can call the function while populating the generator
>> irecords = imap(json.loads, data)irecords = imap(process_record, irecords)
irecords = (process_record(json.loads(line)) for line in
data)
def process_record(record):
record['User Agent'] = record['a']
del record['a']
return record
from itertools import imap
irecords = imap(json.loads, data)
irecords = imap(process_record, irecords)
irecords = (process_record(json.loads(line)) for line in data)
irecords.next()
So we have successfully loaded the data in python map and in the next post, we would show how to query this data.
No comments:
Post a Comment