Tutorial:PracticalPython/2 Working with data
Working With Data
To write useful programs, you need to be able to work with data. This section introduces Python’s core data structures of tuples, lists, sets, and dictionaries and discusses common data handling idioms. The last part of this section dives a little deeper into Python’s underlying object model.
Datatypes and Data structures
This section introduces data structures in the form of tuples and dictionaries.
Primitive Datatypes
Python has a few primitive types of data:
- Integers
- Floating point numbers
- Strings (text)
We learned about these in the introduction.
None type
email_address = None
None
is often used as a placeholder for optional or missing value. It evaluates as False
in conditionals.
if email_address: send_email(email_address, msg)
Data Structures
Real programs have more complex data. For example information about a stock holding:
100 shares of GOOG at $490.10
This is an “object” with three parts:
- Name or symbol of the stock (“GOOG”, a string)
- Number of shares (100, an integer)
- Price (490.10 a float)
Tuples
A tuple is a collection of values grouped together.
Example:
s = ('GOOG', 100, 490.1)
Sometimes the ()
are omitted in the syntax.
s = 'GOOG', 100, 490.1
Special cases (0-tuple, 1-typle).
t = () # An empty tuple w = ('GOOG', ) # A 1-item tuple
Tuples are often used to represent simple records or structures. Typically, it is a single object of multiple parts. A good analogy: A tuple is like a single row in a database table.
Tuple contents are ordered (like an array).
s = ('GOOG', 100, 490.1) name = s[0] # 'GOOG' shares = s[1] # 100 price = s[2] # 490.1
However, the contents can’t be modified.
>>> s[1] = 75 TypeError: object does not support item assignment
You can, however, make a new tuple based on a current tuple.
s = (s[0], 75, s[2])
Tuple Packing
Tuples are more about packing related items together into a single entity.
s = ('GOOG', 100, 490.1)
The tuple is then easy to pass around to other parts of a program as a single object.
Tuple Unpacking
To use the tuple elsewhere, you can unpack its parts into variables.
name, shares, price = s print('Cost', shares * price)
The number of variables on the left must match the tuple structure.
name, shares = s # ERROR Traceback (most recent call last): ... ValueError: too many values to unpack
Tuples vs. Lists
Tuples look like read-only lists. However, tuples are most often used for a single item consisting of multiple parts. Lists are usually a collection of distinct items, usually all of the same type.
record = ('GOOG', 100, 490.1) # A tuple representing a record in a portfolio symbols = [ 'GOOG', 'AAPL', 'IBM' ] # A List representing three stock symbols
Dictionaries
A dictionary is mapping of keys to values. It’s also sometimes called a hash table or associative array. The keys serve as indices for accessing values.
s = { 'name': 'GOOG', 'shares': 100, 'price': 490.1 }
Common operations
To get values from a dictionary use the key names.
>>> print(s['name'], s['shares']) GOOG 100 >>> s['price'] 490.10 >>>
To add or modify values assign using the key names.
>>> s['shares'] = 75 >>> s['date'] = '6/6/2007' >>>
To delete a value use the del
statement.
>>> del s['date'] >>>
Why dictionaries?
Dictionaries are useful when there are many different values and those values might be modified or manipulated. Dictionaries make your code more readable.
s['price'] # vs s[2]
Exercises
In the last few exercises, you wrote a program that read a datafile Data/portfolio.csv
. Using the csv
module, it is easy to read the file row-by-row.
>>> import csv >>> f = open('Data/portfolio.csv') >>> rows = csv.reader(f) >>> next(rows) ['name', 'shares', 'price'] >>> row = next(rows) >>> row ['AA', '100', '32.20'] >>>
Although reading the file is easy, you often want to do more with the data than read it. For instance, perhaps you want to store it and start performing some calculations on it. Unfortunately, a raw “row” of data doesn’t give you enough to work with. For example, even a simple math calculation doesn’t work:
>>> row = ['AA', '100', '32.20'] >>> cost = row[1] * row[2] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't multiply sequence by non-int of type 'str' >>>
To do more, you typically want to interpret the raw data in some way and turn it into a more useful kind of object so that you can work with it later. Two simple options are tuples or dictionaries.
Exercise 2.1: Tuples
At the interactive prompt, create the following tuple that represents the above row, but with the numeric columns converted to proper numbers:
>>> t = (row[0], int(row[1]), float(row[2])) >>> t ('AA', 100, 32.2) >>>
Using this, you can now calculate the total cost by multiplying the shares and the price:
>>> cost = t[1] * t[2] >>> cost 3220.0000000000005 >>>
Is math broken in Python? What’s the deal with the answer of 3220.0000000000005?
This is an artifact of the floating point hardware on your computer only being able to accurately represent decimals in Base-2, not Base-10. For even simple calculations involving base-10 decimals, small errors are introduced. This is normal, although perhaps a bit surprising if you haven’t seen it before.
This happens in all programming languages that use floating point decimals, but it often gets hidden when printing. For example:
>>> print(f'{cost:0.2f}') 3220.00 >>>
Tuples are read-only. Verify this by trying to change the number of shares to 75.
>>> t[1] = 75 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'tuple' object does not support item assignment >>>
Although you can’t change tuple contents, you can always create a completely new tuple that replaces the old one.
>>> t = (t[0], 75, t[2]) >>> t ('AA', 75, 32.2) >>>
Whenever you reassign an existing variable name like this, the old value is discarded. Although the above assignment might look like you are modifying the tuple, you are actually creating a new tuple and throwing the old one away.
Tuples are often used to pack and unpack values into variables. Try the following:
>>> name, shares, price = t >>> name 'AA' >>> shares 75 >>> price 32.2 >>>
Take the above variables and pack them back into a tuple
>>> t = (name, 2*shares, price) >>> t ('AA', 150, 32.2) >>>
Exercise 2.2: Dictionaries as a data structure
An alternative to a tuple is to create a dictionary instead.
>>> d = { 'name' : row[0], 'shares' : int(row[1]), 'price' : float(row[2]) } >>> d {'name': 'AA', 'shares': 100, 'price': 32.2 } >>>
Calculate the total cost of this holding:
>>> cost = d['shares'] * d['price'] >>> cost 3220.0000000000005 >>>
Compare this example with the same calculation involving tuples above. Change the number of shares to 75.
>>> d['shares'] = 75 >>> d {'name': 'AA', 'shares': 75, 'price': 75} >>>
Unlike tuples, dictionaries can be freely modified. Add some attributes:
>>> d['date'] = (6, 11, 2007) >>> d['account'] = 12345 >>> d {'name': 'AA', 'shares': 75, 'price':32.2, 'date': (6, 11, 2007), 'account': 12345} >>>
Exercise 2.3: Some additional dictionary operations
If you turn a dictionary into a list, you’ll get all of its keys:
>>> list(d) ['name', 'shares', 'price', 'date', 'account'] >>>
Similarly, if you use the for
statement to iterate on a dictionary, you will get the keys:
>>> for k in d: print('k =', k) k = name k = shares k = price k = date k = account >>>
Try this variant that performs a lookup at the same time:
>>> for k in d: print(k, '=', d[k]) name = AA shares = 75 price = 32.2 date = (6, 11, 2007) account = 12345 >>>
You can also obtain all of the keys using the keys()
method:
>>> keys = d.keys() >>> keys dict_keys(['name', 'shares', 'price', 'date', 'account']) >>>
keys()
is a bit unusual in that it returns a special dict_keys
object.
This is an overlay on the original dictionary that always gives you the current keys—even if the dictionary changes. For example, try this:
>>> del d['account'] >>> keys dict_keys(['name', 'shares', 'price', 'date']) >>>
Carefully notice that the 'account'
disappeared from keys
even though you didn’t call d.keys()
again.
A more elegant way to work with keys and values together is to use the items()
method. This gives you (key, value)
tuples:
>>> items = d.items() >>> items dict_items([('name', 'AA'), ('shares', 75), ('price', 32.2), ('date', (6, 11, 2007))]) >>> for k, v in d.items(): print(k, '=', v) name = AA shares = 75 price = 32.2 date = (6, 11, 2007) >>>
If you have tuples such as items
, you can create a dictionary using the dict()
function. Try it:
>>> items dict_items([('name', 'AA'), ('shares', 75), ('price', 32.2), ('date', (6, 11, 2007))]) >>> d = dict(items) >>> d {'name': 'AA', 'shares': 75, 'price':32.2, 'date': (6, 11, 2007)} >>>
Containers
This section discusses lists, dictionaries, and sets.
Overview
Programs often have to work with many objects.
- A portfolio of stocks
- A table of stock prices
There are three main choices to use.
- Lists. Ordered data.
- Dictionaries. Unordered data.
- Sets. Unordered collection of unique items.
Lists as a Container
Use a list when the order of the data matters. Remember that lists can hold any kind of object. For example, a list of tuples.
portfolio = [ ('GOOG', 100, 490.1), ('IBM', 50, 91.3), ('CAT', 150, 83.44) ] portfolio[0] # ('GOOG', 100, 490.1) portfolio[2] # ('CAT', 150, 83.44)
List construction
Building a list from scratch.
records = [] # Initial empty list # Use .append() to add more items records.append(('GOOG', 100, 490.10)) records.append(('IBM', 50, 91.3)) ...
An example when reading records from a file.
records = [] # Initial empty list with open('Data/portfolio.csv', 'rt') as f: for line in f: row = line.split(',') records.append((row[0], int(row[1])), float(row[2]))
Dicts as a Container
Dictionaries are useful if you want fast random lookups (by key name). For example, a dictionary of stock prices:
prices = { 'GOOG': 513.25, 'CAT': 87.22, 'IBM': 93.37, 'MSFT': 44.12 }
Here are some simple lookups:
>>> prices['IBM'] 93.37 >>> prices['GOOG'] 513.25 >>>
Dict Construction
Example of building a dict from scratch.
prices = {} # Initial empty dict # Insert new items prices['GOOG'] = 513.25 prices['CAT'] = 87.22 prices['IBM'] = 93.37
An example populating the dict from the contents of a file.
prices = {} # Initial empty dict with open('Data/prices.csv', 'rt') as f: for line in f: row = line.split(',') prices[row[0]] = float(row[1])
Dictionary Lookups
You can test the existence of a key.
if key in d: # YES else: # NO
You can look up a value that might not exist and provide a default value in case it doesn’t.
name = d.get(key, default)
An example:
>>> prices.get('IBM', 0.0) 93.37 >>> prices.get('SCOX', 0.0) 0.0 >>>
Composite keys
Almost any type of value can be used as a dictionary key in Python. A dictionary key must be of a type that is immutable. For example, tuples:
holidays = { (1, 1) : 'New Years', (3, 14) : 'Pi day', (9, 13) : "Programmer's day", }
Then to access:
>>> holidays[3, 14] 'Pi day' >>>
Neither a list, a set, nor another dictionary can serve as a dictionary key, because lists and dictionaries are mutable.
Sets
Sets are collection of unordered unique items.
tech_stocks = { 'IBM','AAPL','MSFT' } # Alternative syntax tech_stocks = set(['IBM', 'AAPL', 'MSFT'])
Sets are useful for membership tests.
>>> tech_stocks set(['AAPL', 'IBM', 'MSFT']) >>> 'IBM' in tech_stocks True >>> 'FB' in tech_stocks False >>>
Sets are also useful for duplicate elimination.
names = ['IBM', 'AAPL', 'GOOG', 'IBM', 'GOOG', 'YHOO'] unique = set(names) # unique = set(['IBM', 'AAPL','GOOG','YHOO'])
Additional set operations:
names.add('CAT') # Add an item names.remove('YHOO') # Remove an item s1 | s2 # Set union s1 & s2 # Set intersection s1 - s2 # Set difference
Exercises
In these exercises, you start building one of the major programs used for the rest of this course. Do your work in the file Work/report.py
.
Exercise 2.4: A list of tuples
The file Data/portfolio.csv
contains a list of stocks in a portfolio. In [[../01_Introduction/07_Functions.md|Exercise 1.30]], you wrote a function portfolio_cost(filename)
that read this file and performed a simple calculation.
Your code should have looked something like this:
# pcost.py import csv def portfolio_cost(filename): '''Computes the total cost (shares*price) of a portfolio file''' total_cost = 0.0 with open(filename, 'rt') as f: rows = csv.reader(f) headers = next(rows) for row in rows: nshares = int(row[1]) price = float(row[2]) total_cost += nshares * price return total_cost
Using this code as a rough guide, create a new file report.py
. In that file, define a function read_portfolio(filename)
that opens a given portfolio file and reads it into a list of tuples. To do this, you’re going to make a few minor modifications to the above code.
First, instead of defining total_cost = 0
, you’ll make a variable that’s initially set to an empty list. For example:
portfolio = []
Next, instead of totaling up the cost, you’ll turn each row into a tuple exactly as you just did in the last exercise and append it to this list. For example:
for row in rows: holding = (row[0], int(row[1]), float(row[2])) portfolio.append(holding)
Finally, you’ll return the resulting portfolio
list.
Experiment with your function interactively (just a reminder that in order to do this, you first have to run the report.py
program in the interpreter):
Hint: Use -i
when executing the file in the terminal
>>> portfolio = read_portfolio('Data/portfolio.csv') >>> portfolio [('AA', 100, 32.2), ('IBM', 50, 91.1), ('CAT', 150, 83.44), ('MSFT', 200, 51.23), ('GE', 95, 40.37), ('MSFT', 50, 65.1), ('IBM', 100, 70.44)] >>> >>> portfolio[0] ('AA', 100, 32.2) >>> portfolio[1] ('IBM', 50, 91.1) >>> portfolio[1][1] 50 >>> total = 0.0 >>> for s in portfolio: total += s[1] * s[2] >>> print(total) 44671.15 >>>
This list of tuples that you have created is very similar to a 2-D array. For example, you can access a specific column and row using a lookup such as portfolio[row][column]
where row
and column
are integers.
That said, you can also rewrite the last for-loop using a statement like this:
>>> total = 0.0 >>> for name, shares, price in portfolio: total += shares*price >>> print(total) 44671.15 >>>
Exercise 2.5: List of Dictionaries
Take the function you wrote in Exercise 2.4 and modify to represent each stock in the portfolio with a dictionary instead of a tuple. In this dictionary use the field names of “name”, “shares”, and “price” to represent the different columns in the input file.
Experiment with this new function in the same manner as you did in Exercise 2.4.
>>> portfolio = read_portfolio('portfolio.csv') >>> portfolio [{'name': 'AA', 'shares': 100, 'price': 32.2}, {'name': 'IBM', 'shares': 50, 'price': 91.1}, {'name': 'CAT', 'shares': 150, 'price': 83.44}, {'name': 'MSFT', 'shares': 200, 'price': 51.23}, {'name': 'GE', 'shares': 95, 'price': 40.37}, {'name': 'MSFT', 'shares': 50, 'price': 65.1}, {'name': 'IBM', 'shares': 100, 'price': 70.44}] >>> portfolio[0] {'name': 'AA', 'shares': 100, 'price': 32.2} >>> portfolio[1] {'name': 'IBM', 'shares': 50, 'price': 91.1} >>> portfolio[1]['shares'] 50 >>> total = 0.0 >>> for s in portfolio: total += s['shares']*s['price'] >>> print(total) 44671.15 >>>
Here, you will notice that the different fields for each entry are accessed by key names instead of numeric column numbers. This is often preferred because the resulting code is easier to read later.
Viewing large dictionaries and lists can be messy. To clean up the output for debugging, considering using the pprint
function.
>>> from pprint import pprint >>> pprint(portfolio) [{'name': 'AA', 'price': 32.2, 'shares': 100}, {'name': 'IBM', 'price': 91.1, 'shares': 50}, {'name': 'CAT', 'price': 83.44, 'shares': 150}, {'name': 'MSFT', 'price': 51.23, 'shares': 200}, {'name': 'GE', 'price': 40.37, 'shares': 95}, {'name': 'MSFT', 'price': 65.1, 'shares': 50}, {'name': 'IBM', 'price': 70.44, 'shares': 100}] >>>
Exercise 2.6: Dictionaries as a container
A dictionary is a useful way to keep track of items where you want to look up items using an index other than an integer. In the Python shell, try playing with a dictionary:
>>> prices = { } >>> prices['IBM'] = 92.45 >>> prices['MSFT'] = 45.12 >>> prices ... look at the result ... >>> prices['IBM'] 92.45 >>> prices['AAPL'] ... look at the result ... >>> 'AAPL' in prices False >>>
The file Data/prices.csv
contains a series of lines with stock prices. The file looks something like this:
"AA",9.22 "AXP",24.85 "BA",44.85 "BAC",11.27 "C",3.72 ...
Write a function read_prices(filename)
that reads a set of prices such as this into a dictionary where the keys of the dictionary are the stock names and the values in the dictionary are the stock prices.
To do this, start with an empty dictionary and start inserting values into it just as you did above. However, you are reading the values from a file now.
We’ll use this data structure to quickly lookup the price of a given stock name.
A few little tips that you’ll need for this part. First, make sure you use the csv
module just as you did before—there’s no need to reinvent the wheel here.
>>> import csv >>> f = open('Data/prices.csv', 'r') >>> rows = csv.reader(f) >>> for row in rows: print(row) ['AA', '9.22'] ['AXP', '24.85'] ... [] >>>
The other little complication is that the Data/prices.csv
file may have some blank lines in it. Notice how the last row of data above is an empty list—meaning no data was present on that line.
There’s a possibility that this could cause your program to die with an exception. Use the try
and except
statements to catch this as appropriate. Thought: would it be better to guard against bad data with an if
-statement instead?
Once you have written your read_prices()
function, test it interactively to make sure it works:
>>> prices = read_prices('Data/prices.csv') >>> prices['IBM'] 106.28 >>> prices['MSFT'] 20.89 >>>
Exercise 2.7: Finding out if you can retire
Tie all of this work together by adding a few additional statements to your report.py
program that compute gain/loss. These statements should take the list of stocks in Exercise 2.5 and the dictionary of prices in Exercise 2.6 and computes the current value of the portfolio along with the gain/loss.
Formatting
This section is a slight digression, but when you work with data, you often want to produce structured output (tables, etc.). For example:
Name Shares Price ---------- ---------- ----------- AA 100 32.20 IBM 50 91.10 CAT 150 83.44 MSFT 200 51.23 GE 95 40.37 MSFT 50 65.10 IBM 100 70.44
String Formatting
One way to format string in Python 3.6+ is with f-strings
.
>>> name = 'IBM' >>> shares = 100 >>> price = 91.1 >>> f'{name:>10s} {shares:>10d} {price:>10.2f}' ' IBM 100 91.10' >>>
The part {expression:format}
is replaced.
It is commonly used with print
.
print(f'{name:>10s} {shares:>10d} {price:>10.2f}')
Format codes
Format codes (after the :
inside the {}
) are similar to C printf()
. Common codes include:
d Decimal integer b Binary integer x Hexadecimal integer f Float as [-]m.dddddd e Float as [-]m.dddddde+-xx g Float, but selective use of E notation s String c Character (from integer)
Common modifiers adjust the field width and decimal precision. This is a partial list:
:>10d Integer right aligned in 10-character field :<10d Integer left aligned in 10-character field :^10d Integer centered in 10-character field :0.2f Float with 2 digit precision
Dictionary Formatting
You can use the format_map()
method to apply string formatting to a dictionary of values:
>>> s = { 'name': 'IBM', 'shares': 100, 'price': 91.1 } >>> '{name:>10s} {shares:10d} {price:10.2f}'.format_map(s) ' IBM 100 91.10' >>>
It uses the same codes as f-strings
but takes the values from the supplied dictionary.
format() method
There is a method format()
that can apply formatting to arguments or keyword arguments.
>>> '{name:>10s} {shares:10d} {price:10.2f}'.format(name='IBM', shares=100, price=91.1) ' IBM 100 91.10' >>> '{:10s} {:10d} {:10.2f}'.format('IBM', 100, 91.1) ' IBM 100 91.10' >>>
Frankly, format()
is a bit verbose. I prefer f-strings.
C-Style Formatting
You can also use the formatting operator %
.
>>> 'The value is %d' % 3 'The value is 3' >>> '%5d %-5d %10d' % (3,4,5) ' 3 4 5' >>> '%0.2f' % (3.1415926,) '3.14'
This requires a single item or a tuple on the right. Format codes are modeled after the C printf()
as well.
Note: This is the only formatting available on byte strings.
>>> b'%s has %n messages' % (b'Dave', 37) b'Dave has 37 messages' >>>
Exercises
Exercise 2.8: How to format numbers
A common problem with printing numbers is specifying the number of decimal places. One way to fix this is to use f-strings. Try these examples:
>>> value = 42863.1 >>> print(value) 42863.1 >>> print(f'{value:0.4f}') 42863.1000 >>> print(f'{value:>16.2f}') 42863.10 >>> print(f'{value:<16.2f}') 42863.10 >>> print(f'{value:*>16,.2f}') *******42,863.10 >>>
Full documentation on the formatting codes used f-strings can be found here. Formatting is also sometimes performed using the %
operator of strings.
>>> print('%0.4f' % value) 42863.1000 >>> print('%16.2f' % value) 42863.10 >>>
Documentation on various codes used with %
can be found here.
Although it’s commonly used with print
, string formatting is not tied to printing. If you want to save a formatted string. Just assign it to a variable.
>>> f = '%0.4f' % value >>> f '42863.1000' >>>
Exercise 2.9: Collecting Data
In Exercise 2.7, you wrote a program called report.py
that computed the gain/loss of a stock portfolio. In this exercise, you’re going to start modifying it to produce a table like this:
Name Shares Price Change ---------- ---------- ---------- ---------- AA 100 9.22 -22.98 IBM 50 106.28 15.18 CAT 150 35.46 -47.98 MSFT 200 20.89 -30.34 GE 95 13.48 -26.89 MSFT 50 20.89 -44.21 IBM 100 106.28 35.84
In this report, “Price” is the current share price of the stock and “Change” is the change in the share price from the initial purchase price.
In order to generate the above report, you’ll first want to collect all of the data shown in the table. Write a function make_report()
that takes a list of stocks and dictionary of prices as input and returns a list of tuples containing the rows of the above table.
Add this function to your report.py
file. Here’s how it should work if you try it interactively:
>>> portfolio = read_portfolio('Data/portfolio.csv') >>> prices = read_prices('Data/prices.csv') >>> report = make_report(portfolio, prices) >>> for r in report: print(r) ('AA', 100, 9.22, -22.980000000000004) ('IBM', 50, 106.28, 15.180000000000007) ('CAT', 150, 35.46, -47.98) ('MSFT', 200, 20.89, -30.339999999999996) ('GE', 95, 13.48, -26.889999999999997) ... >>>
Exercise 2.10: Printing a formatted table
Redo the for-loop in Exercise 2.9, but change the print statement to format the tuples.
>>> for r in report: print('%10s %10d %10.2f %10.2f' % r) AA 100 9.22 -22.98 IBM 50 106.28 15.18 CAT 150 35.46 -47.98 MSFT 200 20.89 -30.34 ... >>>
You can also expand the values and use f-strings. For example:
>>> for name, shares, price, change in report: print(f'{name:>10s} {shares:>10d} {price:>10.2f} {change:>10.2f}') AA 100 9.22 -22.98 IBM 50 106.28 15.18 CAT 150 35.46 -47.98 MSFT 200 20.89 -30.34 ... >>>
Take the above statements and add them to your report.py
program. Have your program take the output of the make_report()
function and print a nicely formatted table as shown.
Exercise 2.11: Adding some headers
Suppose you had a tuple of header names like this:
headers = ('Name', 'Shares', 'Price', 'Change')
Add code to your program that takes the above tuple of headers and creates a string where each header name is right-aligned in a 10-character wide field and each field is separated by a single space.
' Name Shares Price Change'
Write code that takes the headers and creates the separator string between the headers and data to follow. This string is just a bunch of “-” characters under each field name. For example:
'---------- ---------- ---------- -----------'
When you’re done, your program should produce the table shown at the top of this exercise.
Name Shares Price Change ---------- ---------- ---------- ---------- AA 100 9.22 -22.98 IBM 50 106.28 15.18 CAT 150 35.46 -47.98 MSFT 200 20.89 -30.34 GE 95 13.48 -26.89 MSFT 50 20.89 -44.21 IBM 100 106.28 35.84
Exercise 2.12: Formatting Challenge
How would you modify your code so that the price includes the currency symbol ($) and the output looks like this:
Name Shares Price Change ---------- ---------- ---------- ---------- AA 100 $9.22 -22.98 IBM 50 $106.28 15.18 CAT 150 $35.46 -47.98 MSFT 200 $20.89 -30.34 GE 95 $13.48 -26.89 MSFT 50 $20.89 -44.21 IBM 100 $106.28 35.84
Sequences
Sequence Datatypes
Python has three sequence datatypes.
- String:
'Hello'
. A string is a sequence of characters. - List:
[1, 4, 5]
. - Tuple:
('GOOG', 100, 490.1)
.
All sequences are ordered, indexed by integers, and have a length.
a = 'Hello' # String b = [1, 4, 5] # List c = ('GOOG', 100, 490.1) # Tuple # Indexed order a[0] # 'H' b[-1] # 5 c[1] # 100 # Length of sequence len(a) # 5 len(b) # 3 len(c) # 3
Sequences can be replicated: s * n
.
>>> a = 'Hello' >>> a * 3 'HelloHelloHello' >>> b = [1, 2, 3] >>> b * 2 [1, 2, 3, 1, 2, 3] >>>
Sequences of the same type can be concatenated: s + t
.
>>> a = (1, 2, 3) >>> b = (4, 5) >>> a + b (1, 2, 3, 4, 5) >>> >>> c = [1, 5] >>> a + c Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can only concatenate tuple (not "list") to tuple
Slicing
Slicing means to take a subsequence from a sequence. The syntax is s[start:end]
. Where start
and end
are the indexes of the subsequence you want.
a = [0,1,2,3,4,5,6,7,8] a[2:5] # [2,3,4] a[-5:] # [4,5,6,7,8] a[:3] # [0,1,2]
- Indices
start
andend
must be integers. - Slices do not include the end value. It is like a half-open interval from math.
- If indices are omitted, they default to the beginning or end of the list.
Slice re-assignment
On lists, slices can be reassigned and deleted.
# Reassignment a = [0,1,2,3,4,5,6,7,8] a[2:4] = [10,11,12] # [0,1,10,11,12,4,5,6,7,8]
Note: The reassigned slice doesn’t need to have the same length.
# Deletion a = [0,1,2,3,4,5,6,7,8] del a[2:4] # [0,1,4,5,6,7,8]
Sequence Reductions
There are some common functions to reduce a sequence to a single value.
>>> s = [1, 2, 3, 4] >>> sum(s) 10 >>> min(s) 1 >>> max(s) 4 >>> t = ['Hello', 'World'] >>> max(t) 'World' >>>
Iteration over a sequence
The for-loop iterates over the elements in a sequence.
>>> s = [1, 4, 9, 16] >>> for i in s: ... print(i) ... 1 4 9 16 >>>
On each iteration of the loop, you get a new item to work with. This new value is placed into the iteration variable. In this example, the iteration variable is x
:
for x in s: # `x` is an iteration variable ...statements
On each iteration, the previous value of the iteration variable is overwritten (if any). After the loop finishes, the variable retains the last value.
break statement
You can use the break
statement to break out of a loop early.
for name in namelist: if name == 'Jake': break ... ... statements
When the break
statement executes, it exits the loop and moves on the next statements
. The break
statement only applies to the inner-most loop. If this loop is within another loop, it will not break the outer loop.
continue statement
To skip one element and move to the next one, use the continue
statement.
for line in lines: if line == '\n': # Skip blank lines continue # More statements ...
This is useful when the current item is not of interest or needs to be ignored in the processing.
Looping over integers
If you need to count, use range()
.
for i in range(100): # i = 0,1,...,99
The syntax is range([start,] end [,step])
for i in range(100): # i = 0,1,...,99 for j in range(10,20): # j = 10,11,..., 19 for k in range(10,50,2): # k = 10,12,...,48 # Notice how it counts in steps of 2, not 1.
- The ending value is never included. It mirrors the behavior of slices.
start
is optional. Default0
.step
is optional. Default1
.range()
computes values as needed. It does not actually store a large range of numbers.
enumerate() function
The enumerate
function adds an extra counter value to iteration.
names = ['Elwood', 'Jake', 'Curtis'] for i, name in enumerate(names): # Loops with i = 0, name = 'Elwood' # i = 1, name = 'Jake' # i = 2, name = 'Curtis'
The general form is enumerate(sequence [, start = 0])
. start
is optional. A good example of using enumerate()
is tracking line numbers while reading a file:
with open(filename) as f: for lineno, line in enumerate(f, start=1): ...
In the end, enumerate
is just a nice shortcut for:
i = 0 for x in s: statements i += 1
Using enumerate
is less typing and runs slightly faster.
For and tuples
You can iterate with multiple iteration variables.
points = [ (1, 4),(10, 40),(23, 14),(5, 6),(7, 8) ] for x, y in points: # Loops with x = 1, y = 4 # x = 10, y = 40 # x = 23, y = 14 # ...
When using multiple variables, each tuple is unpacked into a set of iteration variables. The number of variables must match the of items in each tuple.
zip() function
The zip
function takes multiple sequences and makes an iterator that combines them.
columns = ['name', 'shares', 'price'] values = ['GOOG', 100, 490.1 ] pairs = zip(columns, values) # ('name','GOOG'), ('shares',100), ('price',490.1)
To get the result you must iterate. You can use multiple variables to unpack the tuples as shown earlier.
for column, value in pairs: ...
A common use of zip
is to create key/value pairs for constructing dictionaries.
d = dict(zip(columns, values))
Exercises
Exercise 2.13: Counting
Try some basic counting examples:
>>> for n in range(10): # Count 0 ... 9 print(n, end=' ') 0 1 2 3 4 5 6 7 8 9 >>> for n in range(10,0,-1): # Count 10 ... 1 print(n, end=' ') 10 9 8 7 6 5 4 3 2 1 >>> for n in range(0,10,2): # Count 0, 2, ... 8 print(n, end=' ') 0 2 4 6 8 >>>
Exercise 2.14: More sequence operations
Interactively experiment with some of the sequence reduction operations.
>>> data = [4, 9, 1, 25, 16, 100, 49] >>> min(data) 1 >>> max(data) 100 >>> sum(data) 204 >>>
Try looping over the data.
>>> for x in data: print(x) 4 9 ... >>> for n, x in enumerate(data): print(n, x) 0 4 1 9 2 1 ... >>>
Sometimes the for
statement, len()
, and range()
get used by novices in some kind of horrible code fragment that looks like it emerged from the depths of a rusty C program.
>>> for n in range(len(data)): print(data[n]) 4 9 1 ... >>>
Don’t do that! Not only does reading it make everyone’s eyes bleed, it’s inefficient with memory and it runs a lot slower. Just use a normal for
loop if you want to iterate over data. Use enumerate()
if you happen to need the index for some reason.
Exercise 2.15: A practical enumerate() example
Recall that the file Data/missing.csv
contains data for a stock portfolio, but has some rows with missing data. Using enumerate()
, modify your pcost.py
program so that it prints a line number with the warning message when it encounters bad input.
>>> cost = portfolio_cost('Data/missing.csv') Row 4: Couldn't convert: ['MSFT', '', '51.23'] Row 7: Couldn't convert: ['IBM', '', '70.44'] >>>
To do this, you’ll need to change a few parts of your code.
... for rowno, row in enumerate(rows, start=1): try: ... except ValueError: print(f'Row {rowno}: Bad row: {row}')
Exercise 2.16: Using the zip() function
In the file Data/portfolio.csv
, the first line contains column headers. In all previous code, we’ve been discarding them.
>>> f = open('Data/portfolio.csv') >>> rows = csv.reader(f) >>> headers = next(rows) >>> headers ['name', 'shares', 'price'] >>>
However, what if you could use the headers for something useful? This is where the zip()
function enters the picture. First try this to pair the file headers with a row of data:
>>> row = next(rows) >>> row ['AA', '100', '32.20'] >>> list(zip(headers, row)) [ ('name', 'AA'), ('shares', '100'), ('price', '32.20') ] >>>
Notice how zip()
paired the column headers with the column values. We’ve used list()
here to turn the result into a list so that you can see it. Normally, zip()
creates an iterator that must be consumed by a for-loop.
This pairing is an intermediate step to building a dictionary. Now try this:
>>> record = dict(zip(headers, row)) >>> record {'price': '32.20', 'name': 'AA', 'shares': '100'} >>>
This transformation is one of the most useful tricks to know about when processing a lot of data files. For example, suppose you wanted to make the pcost.py
program work with various input files, but without regard for the actual column number where the name, shares, and price appear.
Modify the portfolio_cost()
function in pcost.py
so that it looks like this:
# pcost.py def portfolio_cost(filename): ... for rowno, row in enumerate(rows, start=1): record = dict(zip(headers, row)) try: nshares = int(record['shares']) price = float(record['price']) total_cost += nshares * price # This catches errors in int() and float() conversions above except ValueError: print(f'Row {rowno}: Bad row: {row}') ...
Now, try your function on a completely different data file Data/portfoliodate.csv
which looks like this:
name,date,time,shares,price "AA","6/11/2007","9:50am",100,32.20 "IBM","5/13/2007","4:20pm",50,91.10 "CAT","9/23/2006","1:30pm",150,83.44 "MSFT","5/17/2007","10:30am",200,51.23 "GE","2/1/2006","10:45am",95,40.37 "MSFT","10/31/2006","12:05pm",50,65.10 "IBM","7/9/2006","3:15pm",100,70.44
>>> portfolio_cost('Data/portfoliodate.csv') 44671.15 >>>
If you did it right, you’ll find that your program still works even though the data file has a completely different column format than before. That’s cool!
The change made here is subtle, but significant. Instead of portfolio_cost()
being hardcoded to read a single fixed file format, the new version reads any CSV file and picks the values of interest out of it. As long as the file has the required columns, the code will work.
Modify the report.py
program you wrote in Section 2.3 so that it uses the same technique to pick out column headers.
Try running the report.py
program on the Data/portfoliodate.csv
file and see that it produces the same answer as before.
Exercise 2.17: Inverting a dictionary
A dictionary maps keys to values. For example, a dictionary of stock prices.
>>> prices = { 'GOOG' : 490.1, 'AA' : 23.45, 'IBM' : 91.1, 'MSFT' : 34.23 } >>>
If you use the items()
method, you can get (key,value)
pairs:
>>> prices.items() dict_items([('GOOG', 490.1), ('AA', 23.45), ('IBM', 91.1), ('MSFT', 34.23)]) >>>
However, what if you wanted to get a list of (value, key)
pairs instead? Hint: use zip()
.
>>> pricelist = list(zip(prices.values(),prices.keys())) >>> pricelist [(490.1, 'GOOG'), (23.45, 'AA'), (91.1, 'IBM'), (34.23, 'MSFT')] >>>
Why would you do this? For one, it allows you to perform certain kinds of data processing on the dictionary data.
>>> min(pricelist) (23.45, 'AA') >>> max(pricelist) (490.1, 'GOOG') >>> sorted(pricelist) [(23.45, 'AA'), (34.23, 'MSFT'), (91.1, 'IBM'), (490.1, 'GOOG')] >>>
This also illustrates an important feature of tuples. When used in comparisons, tuples are compared element-by-element starting with the first item. Similar to how strings are compared character-by-character.
zip()
is often used in situations like this where you need to pair up data from different places. For example, pairing up the column names with column values in order to make a dictionary of named values.
Note that zip()
is not limited to pairs. For example, you can use it with any number of input lists:
>>> a = [1, 2, 3, 4] >>> b = ['w', 'x', 'y', 'z'] >>> c = [0.2, 0.4, 0.6, 0.8] >>> list(zip(a, b, c)) [(1, 'w', 0.2), (2, 'x', 0.4), (3, 'y', 0.6), (4, 'z', 0.8))] >>>
Also, be aware that zip()
stops once the shortest input sequence is exhausted.
>>> a = [1, 2, 3, 4, 5, 6] >>> b = ['x', 'y', 'z'] >>> list(zip(a,b)) [(1, 'x'), (2, 'y'), (3, 'z')] >>>
collections module
The collections
module provides a number of useful objects for data handling. This part briefly introduces some of these features.
Example: Counting Things
Let’s say you want to tabulate the total shares of each stock.
portfolio = [ ('GOOG', 100, 490.1), ('IBM', 50, 91.1), ('CAT', 150, 83.44), ('IBM', 100, 45.23), ('GOOG', 75, 572.45), ('AA', 50, 23.15) ]
There are two IBM
entries and two GOOG
entries in this list. The shares need to be combined together somehow.
Counters
Solution: Use a Counter
.
from collections import Counter total_shares = Counter() for name, shares, price in portfolio: total_shares[name] += shares total_shares['IBM'] # 150
Example: One-Many Mappings
Problem: You want to map a key to multiple values.
portfolio = [ ('GOOG', 100, 490.1), ('IBM', 50, 91.1), ('CAT', 150, 83.44), ('IBM', 100, 45.23), ('GOOG', 75, 572.45), ('AA', 50, 23.15) ]
Like in the previous example, the key IBM
should have two different tuples instead.
Solution: Use a defaultdict
.
from collections import defaultdict holdings = defaultdict(list) for name, shares, price in portfolio: holdings[name].append((shares, price)) holdings['IBM'] # [ (50, 91.1), (100, 45.23) ]
The defaultdict
ensures that every time you access a key you get a default value.
Example: Keeping a History
Problem: We want a history of the last N things. Solution: Use a deque
.
from collections import deque history = deque(maxlen=N) with open(filename) as f: for line in f: history.append(line) ...
Exercises
The collections
module might be one of the most useful library modules for dealing with special purpose kinds of data handling problems such as tabulating and indexing.
In this exercise, we’ll look at a few simple examples. Start by running your report.py
program so that you have the portfolio of stocks loaded in the interactive mode.
bash % python3 -i report.py
Exercise 2.18: Tabulating with Counters
Suppose you wanted to tabulate the total number of shares of each stock. This is easy using Counter
objects. Try it:
>>> portfolio = read_portfolio('Data/portfolio.csv') >>> from collections import Counter >>> holdings = Counter() >>> for s in portfolio: holdings[s['name']] += s['shares'] >>> holdings Counter({'MSFT': 250, 'IBM': 150, 'CAT': 150, 'AA': 100, 'GE': 95}) >>>
Carefully observe how the multiple entries for MSFT
and IBM
in portfolio
get combined into a single entry here.
You can use a Counter just like a dictionary to retrieve individual values:
>>> holdings['IBM'] 150 >>> holdings['MSFT'] 250 >>>
If you want to rank the values, do this:
>>> # Get three most held stocks >>> holdings.most_common(3) [('MSFT', 250), ('IBM', 150), ('CAT', 150)] >>>
Let’s grab another portfolio of stocks and make a new Counter:
>>> portfolio2 = read_portfolio('Data/portfolio2.csv') >>> holdings2 = Counter() >>> for s in portfolio2: holdings2[s['name']] += s['shares'] >>> holdings2 Counter({'HPQ': 250, 'GE': 125, 'AA': 50, 'MSFT': 25}) >>>
Finally, let’s combine all of the holdings doing one simple operation:
>>> holdings Counter({'MSFT': 250, 'IBM': 150, 'CAT': 150, 'AA': 100, 'GE': 95}) >>> holdings2 Counter({'HPQ': 250, 'GE': 125, 'AA': 50, 'MSFT': 25}) >>> combined = holdings + holdings2 >>> combined Counter({'MSFT': 275, 'HPQ': 250, 'GE': 220, 'AA': 150, 'IBM': 150, 'CAT': 150}) >>>
This is only a small taste of what counters provide. However, if you ever find yourself needing to tabulate values, you should consider using one.
Commentary: collections module
The collections
module is one of the most useful library modules in all of Python. In fact, we could do an extended tutorial on just that. However, doing so now would also be a distraction. For now, put collections
on your list of bedtime reading for later.
List Comprehensions
A common task is processing items in a list. This section introduces list comprehensions, a powerful tool for doing just that.
Creating new lists
A list comprehension creates a new list by applying an operation to each element of a sequence.
>>> a = [1, 2, 3, 4, 5] >>> b = [2*x for x in a ] >>> b [2, 4, 6, 8, 10] >>>
Another example:
>>> names = ['Elwood', 'Jake'] >>> a = [name.lower() for name in names] >>> a ['elwood', 'jake'] >>>
The general syntax is: [ <expression> for <variable_name> in <sequence> ]
.
Filtering
You can also filter during the list comprehension.
>>> a = [1, -5, 4, 2, -2, 10] >>> b = [2*x for x in a if x > 0 ] >>> b [2, 8, 4, 20] >>>
Use cases
List comprehensions are hugely useful. For example, you can collect values of a specific dictionary fields:
stocknames = [s['name'] for s in stocks]
You can perform database-like queries on sequences.
a = [s for s in stocks if s['price'] > 100 and s['shares'] > 50 ]
You can also combine a list comprehension with a sequence reduction:
cost = sum([s['shares']*s['price'] for s in stocks])
General Syntax
[ <expression> for <variable_name> in <sequence> if <condition>]
What it means:
result = [] for variable_name in sequence: if condition: result.append(expression)
Historical Digression
List comprehension come from math (set-builder notation).
a = [ x * x for x in s if x > 0 ] # Python a = { x^2 | x ∈ s, x > 0 } # Math
It is also implemented in several other languages. Most coders probably aren’t thinking about their math class though. So, it’s fine to view it as a cool list shortcut.
Exercises
Start by running your report.py
program so that you have the portfolio of stocks loaded in the interactive mode.
bash % python3 -i report.py
Now, at the Python interactive prompt, type statements to perform the operations described below. These operations perform various kinds of data reductions, transforms, and queries on the portfolio data.
Exercise 2.19: List comprehensions
Try a few simple list comprehensions just to become familiar with the syntax.
>>> nums = [1,2,3,4] >>> squares = [ x * x for x in nums ] >>> squares [1, 4, 9, 16] >>> twice = [ 2 * x for x in nums if x > 2 ] >>> twice [6, 8] >>>
Notice how the list comprehensions are creating a new list with the data suitably transformed or filtered.
Exercise 2.20: Sequence Reductions
Compute the total cost of the portfolio using a single Python statement.
>>> portfolio = read_portfolio('Data/portfolio.csv') >>> cost = sum([ s['shares'] * s['price'] for s in portfolio ]) >>> cost 44671.15 >>>
After you have done that, show how you can compute the current value of the portfolio using a single statement.
>>> value = sum([ s['shares'] * prices[s['name']] for s in portfolio ]) >>> value 28686.1 >>>
Both of the above operations are an example of a map-reduction. The list comprehension is mapping an operation across the list.
>>> [ s['shares'] * s['price'] for s in portfolio ] [3220.0000000000005, 4555.0, 12516.0, 10246.0, 3835.1499999999996, 3254.9999999999995, 7044.0] >>>
The sum()
function is then performing a reduction across the result:
>>> sum(_) 44671.15 >>>
With this knowledge, you are now ready to go launch a big-data startup company.
Exercise 2.21: Data Queries
Try the following examples of various data queries.
First, a list of all portfolio holdings with more than 100 shares.
>>> more100 = [ s for s in portfolio if s['shares'] > 100 ] >>> more100 [{'price': 83.44, 'name': 'CAT', 'shares': 150}, {'price': 51.23, 'name': 'MSFT', 'shares': 200}] >>>
All portfolio holdings for MSFT and IBM stocks.
>>> msftibm = [ s for s in portfolio if s['name'] in {'MSFT','IBM'} ] >>> msftibm [{'price': 91.1, 'name': 'IBM', 'shares': 50}, {'price': 51.23, 'name': 'MSFT', 'shares': 200}, {'price': 65.1, 'name': 'MSFT', 'shares': 50}, {'price': 70.44, 'name': 'IBM', 'shares': 100}] >>>
A list of all portfolio holdings that cost more than $10000.
>>> cost10k = [ s for s in portfolio if s['shares'] * s['price'] > 10000 ] >>> cost10k [{'price': 83.44, 'name': 'CAT', 'shares': 150}, {'price': 51.23, 'name': 'MSFT', 'shares': 200}] >>>
Exercise 2.22: Data Extraction
Show how you could build a list of tuples (name, shares)
where name
and shares
are taken from portfolio
.
>>> name_shares =[ (s['name'], s['shares']) for s in portfolio ] >>> name_shares [('AA', 100), ('IBM', 50), ('CAT', 150), ('MSFT', 200), ('GE', 95), ('MSFT', 50), ('IBM', 100)] >>>
If you change the the square brackets ([
,]
) to curly braces ({
, }
), you get something known as a set comprehension. This gives you unique or distinct values.
For example, this determines the set of unique stock names that appear in portfolio
:
>>> names = { s['name'] for s in portfolio } >>> names { 'AA', 'GE', 'IBM', 'MSFT', 'CAT'] } >>>
If you specify key:value
pairs, you can build a dictionary. For example, make a dictionary that maps the name of a stock to the total number of shares held.
>>> holdings = { name: 0 for name in names } >>> holdings {'AA': 0, 'GE': 0, 'IBM': 0, 'MSFT': 0, 'CAT': 0} >>>
This latter feature is known as a dictionary comprehension. Let’s tabulate:
>>> for s in portfolio: holdings[s['name']] += s['shares'] >>> holdings { 'AA': 100, 'GE': 95, 'IBM': 150, 'MSFT':250, 'CAT': 150 } >>>
Try this example that filters the prices
dictionary down to only those names that appear in the portfolio:
>>> portfolio_prices = { name: prices[name] for name in names } >>> portfolio_prices {'AA': 9.22, 'GE': 13.48, 'IBM': 106.28, 'MSFT': 20.89, 'CAT': 35.46} >>>
Exercise 2.23: Extracting Data From CSV Files
Knowing how to use various combinations of list, set, and dictionary comprehensions can be useful in various forms of data processing. Here’s an example that shows how to extract selected columns from a CSV file.
First, read a row of header information from a CSV file:
>>> import csv >>> f = open('Data/portfoliodate.csv') >>> rows = csv.reader(f) >>> headers = next(rows) >>> headers ['name', 'date', 'time', 'shares', 'price'] >>>
Next, define a variable that lists the columns that you actually care about:
>>> select = ['name', 'shares', 'price'] >>>
Now, locate the indices of the above columns in the source CSV file:
>>> indices = [ headers.index(colname) for colname in select ] >>> indices [0, 3, 4] >>>
Finally, read a row of data and turn it into a dictionary using a dictionary comprehension:
>>> row = next(rows) >>> record = { colname: row[index] for colname, index in zip(select, indices) } # dict-comprehension >>> record {'price': '32.20', 'name': 'AA', 'shares': '100'} >>>
If you’re feeling comfortable with what just happened, read the rest of the file:
>>> portfolio = [ { colname: row[index] for colname, index in zip(select, indices) } for row in rows ] >>> portfolio [{'price': '91.10', 'name': 'IBM', 'shares': '50'}, {'price': '83.44', 'name': 'CAT', 'shares': '150'}, {'price': '51.23', 'name': 'MSFT', 'shares': '200'}, {'price': '40.37', 'name': 'GE', 'shares': '95'}, {'price': '65.10', 'name': 'MSFT', 'shares': '50'}, {'price': '70.44', 'name': 'IBM', 'shares': '100'}] >>>
Oh my, you just reduced much of the read_portfolio()
function to a single statement.
Commentary
List comprehensions are commonly used in Python as an efficient means for transforming, filtering, or collecting data. Due to the syntax, you don’t want to go overboard—try to keep each list comprehension as simple as possible. It’s okay to break things into multiple steps. For example, it’s not clear that you would want to spring that last example on your unsuspecting co-workers.
That said, knowing how to quickly manipulate data is a skill that’s incredibly useful. There are numerous situations where you might have to solve some kind of one-off problem involving data imports, exports, extraction, and so forth. Becoming a guru master of list comprehensions can substantially reduce the time spent devising a solution. Also, don’t forget about the collections
module.
Objects
This section introduces more details about Python’s internal object model and discusses some matters related to memory management, copying, and type checking.
Assignment
Many operations in Python are related to assigning or storing values.
a = value # Assignment to a variable s[n] = value # Assignment to an list s.append(value) # Appending to a list d['key'] = value # Adding to a dictionary
A caution: assignment operations never make a copy of the value being assigned. All assignments are merely reference copies (or pointer copies if you prefer).
Assignment example
Consider this code fragment.
a = [1,2,3] b = a c = [a,b]
A picture of the underlying memory operations. In this example, there is only one list object [1,2,3]
, but there are four different references to it.
This means that modifying a value affects all references.
>>> a.append(999) >>> a [1,2,3,999] >>> b [1,2,3,999] >>> c [[1,2,3,999], [1,2,3,999]] >>>
Notice how a change in the original list shows up everywhere else (yikes!). This is because no copies were ever made. Everything is pointing to the same thing.
Reassigning values
Reassigning a value never overwrites the memory used by the previous value.
a = [1,2,3] b = a a = [4,5,6] print(a) # [4, 5, 6] print(b) # [1, 2, 3] Holds the original value
Remember: Variables are names, not memory locations.
Some Dangers
If you don’t know about this sharing, you will shoot yourself in the foot at some point. Typical scenario. You modify some data thinking that it’s your own private copy and it accidentally corrupts some data in some other part of the program.
Comment: This is one of the reasons why the primitive datatypes (int, float, string) are immutable (read-only).
Identity and References
Use the is
operator to check if two values are exactly the same object.
>>> a = [1,2,3] >>> b = a >>> a is b True >>>
is
compares the object identity (an integer). The identity can be obtained using id()
.
>>> id(a) 3588944 >>> id(b) 3588944 >>>
Note: It is almost always better to use ==
for checking objects. The behavior of is
is often unexpected:
>>> a = [1,2,3] >>> b = a >>> c = [1,2,3] >>> a is b True >>> a is c False >>> a == c True >>>
Shallow copies
Lists and dicts have methods for copying.
>>> a = [2,3,[100,101],4] >>> b = list(a) # Make a copy >>> a is b False
It’s a new list, but the list items are shared.
>>> a[2].append(102) >>> b[2] [100,101,102] >>> >>> a[2] is b[2] True >>>
For example, the inner list [100, 101, 102]
is being shared. This is known as a shallow copy. Here is a picture.
Deep copies
Sometimes you need to make a copy of an object and all the objects contained withn it. You can use the copy
module for this:
>>> a = [2,3,[100,101],4] >>> import copy >>> b = copy.deepcopy(a) >>> a[2].append(102) >>> b[2] [100,101] >>> a[2] is b[2] False >>>
Names, Values, Types
Variable names do not have a type. It’s only a name. However, values do have an underlying type.
>>> a = 42 >>> b = 'Hello World' >>> type(a) <type 'int'> >>> type(b) <type 'str'>
type()
will tell you what it is. The type name is usually used as a function that creates or converts a value to that type.
Type Checking
How to tell if an object is a specific type.
if isinstance(a, list): print('a is a list')
Checking for one of many possible types.
if isinstance(a, (list,tuple)): print('a is a list or tuple')
Caution: Don’t go overboard with type checking. It can lead to excessive code complexity. Usually you’d only do it if doing so would prevent common mistakes made by others using your code.
Everything is an object
Numbers, strings, lists, functions, exceptions, classes, instances, etc. are all objects. It means that all objects that can be named can be passed around as data, placed in containers, etc., without any restrictions. There are no special kinds of objects. Sometimes it is said that all objects are “first-class”.
A simple example:
>>> import math >>> items = [abs, math, ValueError ] >>> items [<built-in function abs>, <module 'math' (builtin)>, <type 'exceptions.ValueError'>] >>> items[0](-45) 45 >>> items[1].sqrt(2) 1.4142135623730951 >>> try: x = int('not a number') except items[2]: print('Failed!') Failed! >>>
Here, items
is a list containing a function, a module and an exception. You can directly use the items in the list in place of the original names:
items[0](-45) # abs items[1].sqrt(2) # math except items[2]: # ValueError
With great power come responsibility. Just because you can do that doesn’t mean you should.
Exercises
In this set of exercises, we look at some of the power that comes from first-class objects.
Exercise 2.24: First-class Data
In the file Data/portfolio.csv
, we read data organized as columns that look like this:
name,shares,price "AA",100,32.20 "IBM",50,91.10 ...
In previous code, we used the csv
module to read the file, but still had to perform manual type conversions. For example:
for row in rows: name = row[0] shares = int(row[1]) price = float(row[2])
This kind of conversion can also be performed in a more clever manner using some list basic operations.
Make a Python list that contains the names of the conversion functions you would use to convert each column into the appropriate type:
>>> types = [str, int, float] >>>
The reason you can even create this list is that everything in Python is first-class. So, if you want to have a list of functions, that’s fine. The items in the list you created are functions for converting a value x
into a given type (e.g., str(x)
, int(x)
, float(x)
).
Now, read a row of data from the above file:
>>> import csv >>> f = open('Data/portfolio.csv') >>> rows = csv.reader(f) >>> headers = next(rows) >>> row = next(rows) >>> row ['AA', '100', '32.20'] >>>
As noted, this row isn’t enough to do calculations because the types are wrong. For example:
>>> row[1] * row[2] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't multiply sequence by non-int of type 'str' >>>
However, maybe the data can be paired up with the types you specified in types
. For example:
>>> types[1] <type 'int'> >>> row[1] '100' >>>
Try converting one of the values:
>>> types[1](row[1]) # Same as int(row[1]) 100 >>>
Try converting a different value:
>>> types[2](row[2]) # Same as float(row[2]) 32.2 >>>
Try the calculation with converted values:
>>> types[1](row[1])*types[2](row[2]) 3220.0000000000005 >>>
Zip the column types with the fields and look at the result:
>>> r = list(zip(types, row)) >>> r [(<type 'str'>, 'AA'), (<type 'int'>, '100'), (<type 'float'>,'32.20')] >>>
You will notice that this has paired a type conversion with a value. For example, int
is paired with the value '100'
.
The zipped list is useful if you want to perform conversions on all of the values, one after the other. Try this:
>>> converted = [] >>> for func, val in zip(types, row): converted.append(func(val)) ... >>> converted ['AA', 100, 32.2] >>> converted[1] * converted[2] 3220.0000000000005 >>>
Make sure you understand what’s happening in the above code. In the loop, the func
variable is one of the type conversion functions (e.g., str
, int
, etc.) and the val
variable is one of the values like 'AA'
, '100'
. The expression func(val)
is converting a value (kind of like a type cast).
The above code can be compressed into a single list comprehension.
>>> converted = [func(val) for func, val in zip(types, row)] >>> converted ['AA', 100, 32.2] >>>
Exercise 2.25: Making dictionaries
Remember how the dict()
function can easily make a dictionary if you have a sequence of key names and values? Let’s make a dictionary from the column headers:
>>> headers ['name', 'shares', 'price'] >>> converted ['AA', 100, 32.2] >>> dict(zip(headers, converted)) {'price': 32.2, 'name': 'AA', 'shares': 100} >>>
Of course, if you’re up on your list-comprehension fu, you can do the whole conversion in a single step using a dict-comprehension:
>>> { name: func(val) for name, func, val in zip(headers, types, row) } {'price': 32.2, 'name': 'AA', 'shares': 100} >>>
Exercise 2.26: The Big Picture
Using the techniques in this exercise, you could write statements that easily convert fields from just about any column-oriented datafile into a Python dictionary.
Just to illustrate, suppose you read data from a different datafile like this:
>>> f = open('Data/dowstocks.csv') >>> rows = csv.reader(f) >>> headers = next(rows) >>> row = next(rows) >>> headers ['name', 'price', 'date', 'time', 'change', 'open', 'high', 'low', 'volume'] >>> row ['AA', '39.48', '6/11/2007', '9:36am', '-0.18', '39.67', '39.69', '39.45', '181800'] >>>
Let’s convert the fields using a similar trick:
>>> types = [str, float, str, str, float, float, float, float, int] >>> converted = [func(val) for func, val in zip(types, row)] >>> record = dict(zip(headers, converted)) >>> record {'volume': 181800, 'name': 'AA', 'price': 39.48, 'high': 39.69, 'low': 39.45, 'time': '9:36am', 'date': '6/11/2007', 'open': 39.67, 'change': -0.18} >>> record['name'] 'AA' >>> record['price'] 39.48 >>>
Bonus: How would you modify this example to additionally parse the date
entry into a tuple such as (6, 11, 2007)
?
Spend some time to ponder what you’ve done in this exercise. We’ll revisit these ideas a little later.