Tutorial:PracticalPython/2 Working with data

From HandWiki


Working With Data

To write useful programs, you need to be able to work with data. This section introduces Python’s core data structures of tuples, lists, sets, and dictionaries and discusses common data handling idioms. The last part of this section dives a little deeper into Python’s underlying object model.


Datatypes and Data structures

This section introduces data structures in the form of tuples and dictionaries.

Primitive Datatypes

Python has a few primitive types of data:

  • Integers
  • Floating point numbers
  • Strings (text)

We learned about these in the introduction.

None type

email_address = None

None is often used as a placeholder for optional or missing value. It evaluates as False in conditionals.

if email_address:
    send_email(email_address, msg)

Data Structures

Real programs have more complex data. For example information about a stock holding:

100 shares of GOOG at $490.10

This is an “object” with three parts:

  • Name or symbol of the stock (“GOOG”, a string)
  • Number of shares (100, an integer)
  • Price (490.10 a float)

Tuples

A tuple is a collection of values grouped together.

Example:

s = ('GOOG', 100, 490.1)

Sometimes the () are omitted in the syntax.

s = 'GOOG', 100, 490.1

Special cases (0-tuple, 1-typle).

t = ()            # An empty tuple
w = ('GOOG', )    # A 1-item tuple

Tuples are often used to represent simple records or structures. Typically, it is a single object of multiple parts. A good analogy: A tuple is like a single row in a database table.

Tuple contents are ordered (like an array).

s = ('GOOG', 100, 490.1)
name = s[0]                 # 'GOOG'
shares = s[1]               # 100
price = s[2]                # 490.1

However, the contents can’t be modified.

>>> s[1] = 75
TypeError: object does not support item assignment

You can, however, make a new tuple based on a current tuple.

s = (s[0], 75, s[2])

Tuple Packing

Tuples are more about packing related items together into a single entity.

s = ('GOOG', 100, 490.1)

The tuple is then easy to pass around to other parts of a program as a single object.

Tuple Unpacking

To use the tuple elsewhere, you can unpack its parts into variables.

name, shares, price = s
print('Cost', shares * price)

The number of variables on the left must match the tuple structure.

name, shares = s     # ERROR
Traceback (most recent call last):
...
ValueError: too many values to unpack

Tuples vs. Lists

Tuples look like read-only lists. However, tuples are most often used for a single item consisting of multiple parts. Lists are usually a collection of distinct items, usually all of the same type.

record = ('GOOG', 100, 490.1)       # A tuple representing a record in a portfolio

symbols = [ 'GOOG', 'AAPL', 'IBM' ]  # A List representing three stock symbols

Dictionaries

A dictionary is mapping of keys to values. It’s also sometimes called a hash table or associative array. The keys serve as indices for accessing values.

s = {
    'name': 'GOOG',
    'shares': 100,
    'price': 490.1
}

Common operations

To get values from a dictionary use the key names.

>>> print(s['name'], s['shares'])
GOOG 100
>>> s['price']
490.10
>>>

To add or modify values assign using the key names.

>>> s['shares'] = 75
>>> s['date'] = '6/6/2007'
>>>

To delete a value use the del statement.

>>> del s['date']
>>>

Why dictionaries?

Dictionaries are useful when there are many different values and those values might be modified or manipulated. Dictionaries make your code more readable.

s['price']
# vs
s[2]

Exercises

In the last few exercises, you wrote a program that read a datafile Data/portfolio.csv. Using the csv module, it is easy to read the file row-by-row.

>>> import csv
>>> f = open('Data/portfolio.csv')
>>> rows = csv.reader(f)
>>> next(rows)
['name', 'shares', 'price']
>>> row = next(rows)
>>> row
['AA', '100', '32.20']
>>>

Although reading the file is easy, you often want to do more with the data than read it. For instance, perhaps you want to store it and start performing some calculations on it. Unfortunately, a raw “row” of data doesn’t give you enough to work with. For example, even a simple math calculation doesn’t work:

>>> row = ['AA', '100', '32.20']
>>> cost = row[1] * row[2]
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
TypeError: can't multiply sequence by non-int of type 'str'
>>>

To do more, you typically want to interpret the raw data in some way and turn it into a more useful kind of object so that you can work with it later. Two simple options are tuples or dictionaries.

Exercise 2.1: Tuples

At the interactive prompt, create the following tuple that represents the above row, but with the numeric columns converted to proper numbers:

>>> t = (row[0], int(row[1]), float(row[2]))
>>> t
('AA', 100, 32.2)
>>>

Using this, you can now calculate the total cost by multiplying the shares and the price:

>>> cost = t[1] * t[2]
>>> cost
3220.0000000000005
>>>

Is math broken in Python? What’s the deal with the answer of 3220.0000000000005?

This is an artifact of the floating point hardware on your computer only being able to accurately represent decimals in Base-2, not Base-10. For even simple calculations involving base-10 decimals, small errors are introduced. This is normal, although perhaps a bit surprising if you haven’t seen it before.

This happens in all programming languages that use floating point decimals, but it often gets hidden when printing. For example:

>>> print(f'{cost:0.2f}')
3220.00
>>>

Tuples are read-only. Verify this by trying to change the number of shares to 75.

>>> t[1] = 75
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
>>>

Although you can’t change tuple contents, you can always create a completely new tuple that replaces the old one.

>>> t = (t[0], 75, t[2])
>>> t
('AA', 75, 32.2)
>>>

Whenever you reassign an existing variable name like this, the old value is discarded. Although the above assignment might look like you are modifying the tuple, you are actually creating a new tuple and throwing the old one away.

Tuples are often used to pack and unpack values into variables. Try the following:

>>> name, shares, price = t
>>> name
'AA'
>>> shares
75
>>> price
32.2
>>>

Take the above variables and pack them back into a tuple

>>> t = (name, 2*shares, price)
>>> t
('AA', 150, 32.2)
>>>

Exercise 2.2: Dictionaries as a data structure

An alternative to a tuple is to create a dictionary instead.

>>> d = {
        'name' : row[0],
        'shares' : int(row[1]),
        'price'  : float(row[2])
    }
>>> d
{'name': 'AA', 'shares': 100, 'price': 32.2 }
>>>

Calculate the total cost of this holding:

>>> cost = d['shares'] * d['price']
>>> cost
3220.0000000000005
>>>

Compare this example with the same calculation involving tuples above. Change the number of shares to 75.

>>> d['shares'] = 75
>>> d
{'name': 'AA', 'shares': 75, 'price': 75}
>>>

Unlike tuples, dictionaries can be freely modified. Add some attributes:

>>> d['date'] = (6, 11, 2007)
>>> d['account'] = 12345
>>> d
{'name': 'AA', 'shares': 75, 'price':32.2, 'date': (6, 11, 2007), 'account': 12345}
>>>

Exercise 2.3: Some additional dictionary operations

If you turn a dictionary into a list, you’ll get all of its keys:

>>> list(d)
['name', 'shares', 'price', 'date', 'account']
>>>

Similarly, if you use the for statement to iterate on a dictionary, you will get the keys:

>>> for k in d:
        print('k =', k)

k = name
k = shares
k = price
k = date
k = account
>>>

Try this variant that performs a lookup at the same time:

>>> for k in d:
        print(k, '=', d[k])

name = AA
shares = 75
price = 32.2
date = (6, 11, 2007)
account = 12345
>>>

You can also obtain all of the keys using the keys() method:

>>> keys = d.keys()
>>> keys
dict_keys(['name', 'shares', 'price', 'date', 'account'])
>>>

keys() is a bit unusual in that it returns a special dict_keys object.

This is an overlay on the original dictionary that always gives you the current keys—even if the dictionary changes. For example, try this:

>>> del d['account']
>>> keys
dict_keys(['name', 'shares', 'price', 'date'])
>>>

Carefully notice that the 'account' disappeared from keys even though you didn’t call d.keys() again.

A more elegant way to work with keys and values together is to use the items() method. This gives you (key, value) tuples:

>>> items = d.items()
>>> items
dict_items([('name', 'AA'), ('shares', 75), ('price', 32.2), ('date', (6, 11, 2007))])
>>> for k, v in d.items():
        print(k, '=', v)

name = AA
shares = 75
price = 32.2
date = (6, 11, 2007)
>>>

If you have tuples such as items, you can create a dictionary using the dict() function. Try it:

>>> items
dict_items([('name', 'AA'), ('shares', 75), ('price', 32.2), ('date', (6, 11, 2007))])
>>> d = dict(items)
>>> d
{'name': 'AA', 'shares': 75, 'price':32.2, 'date': (6, 11, 2007)}
>>>

Containers

This section discusses lists, dictionaries, and sets.

Overview

Programs often have to work with many objects.

  • A portfolio of stocks
  • A table of stock prices

There are three main choices to use.

  • Lists. Ordered data.
  • Dictionaries. Unordered data.
  • Sets. Unordered collection of unique items.

Lists as a Container

Use a list when the order of the data matters. Remember that lists can hold any kind of object. For example, a list of tuples.

portfolio = [
    ('GOOG', 100, 490.1),
    ('IBM', 50, 91.3),
    ('CAT', 150, 83.44)
]

portfolio[0]            # ('GOOG', 100, 490.1)
portfolio[2]            # ('CAT', 150, 83.44)

List construction

Building a list from scratch.

records = []  # Initial empty list

# Use .append() to add more items
records.append(('GOOG', 100, 490.10))
records.append(('IBM', 50, 91.3))
...

An example when reading records from a file.

records = []  # Initial empty list

with open('Data/portfolio.csv', 'rt') as f:
    for line in f:
        row = line.split(',')
        records.append((row[0], int(row[1])), float(row[2]))

Dicts as a Container

Dictionaries are useful if you want fast random lookups (by key name). For example, a dictionary of stock prices:

prices = {
   'GOOG': 513.25,
   'CAT': 87.22,
   'IBM': 93.37,
   'MSFT': 44.12
}

Here are some simple lookups:

>>> prices['IBM']
93.37
>>> prices['GOOG']
513.25
>>>

Dict Construction

Example of building a dict from scratch.

prices = {} # Initial empty dict

# Insert new items
prices['GOOG'] = 513.25
prices['CAT'] = 87.22
prices['IBM'] = 93.37

An example populating the dict from the contents of a file.

prices = {} # Initial empty dict

with open('Data/prices.csv', 'rt') as f:
    for line in f:
        row = line.split(',')
        prices[row[0]] = float(row[1])

Dictionary Lookups

You can test the existence of a key.

if key in d:
    # YES
else:
    # NO

You can look up a value that might not exist and provide a default value in case it doesn’t.

name = d.get(key, default)

An example:

>>> prices.get('IBM', 0.0)
93.37
>>> prices.get('SCOX', 0.0)
0.0
>>>

Composite keys

Almost any type of value can be used as a dictionary key in Python. A dictionary key must be of a type that is immutable. For example, tuples:

holidays = {
  (1, 1) : 'New Years',
  (3, 14) : 'Pi day',
  (9, 13) : "Programmer's day",
}

Then to access:

>>> holidays[3, 14]
'Pi day'
>>>

Neither a list, a set, nor another dictionary can serve as a dictionary key, because lists and dictionaries are mutable.

Sets

Sets are collection of unordered unique items.

tech_stocks = { 'IBM','AAPL','MSFT' }
# Alternative syntax
tech_stocks = set(['IBM', 'AAPL', 'MSFT'])

Sets are useful for membership tests.

>>> tech_stocks
set(['AAPL', 'IBM', 'MSFT'])
>>> 'IBM' in tech_stocks
True
>>> 'FB' in tech_stocks
False
>>>

Sets are also useful for duplicate elimination.

names = ['IBM', 'AAPL', 'GOOG', 'IBM', 'GOOG', 'YHOO']

unique = set(names)
# unique = set(['IBM', 'AAPL','GOOG','YHOO'])

Additional set operations:

names.add('CAT')        # Add an item
names.remove('YHOO')    # Remove an item

s1 | s2                 # Set union
s1 & s2                 # Set intersection
s1 - s2                 # Set difference

Exercises

In these exercises, you start building one of the major programs used for the rest of this course. Do your work in the file Work/report.py.

Exercise 2.4: A list of tuples

The file Data/portfolio.csv contains a list of stocks in a portfolio. In [[../01_Introduction/07_Functions.md|Exercise 1.30]], you wrote a function portfolio_cost(filename) that read this file and performed a simple calculation.

Your code should have looked something like this:

# pcost.py

import csv

def portfolio_cost(filename):
    '''Computes the total cost (shares*price) of a portfolio file'''
    total_cost = 0.0

    with open(filename, 'rt') as f:
        rows = csv.reader(f)
        headers = next(rows)
        for row in rows:
            nshares = int(row[1])
            price = float(row[2])
            total_cost += nshares * price
    return total_cost

Using this code as a rough guide, create a new file report.py. In that file, define a function read_portfolio(filename) that opens a given portfolio file and reads it into a list of tuples. To do this, you’re going to make a few minor modifications to the above code.

First, instead of defining total_cost = 0, you’ll make a variable that’s initially set to an empty list. For example:

portfolio = []

Next, instead of totaling up the cost, you’ll turn each row into a tuple exactly as you just did in the last exercise and append it to this list. For example:

for row in rows:
    holding = (row[0], int(row[1]), float(row[2]))
    portfolio.append(holding)

Finally, you’ll return the resulting portfolio list.

Experiment with your function interactively (just a reminder that in order to do this, you first have to run the report.py program in the interpreter):

Hint: Use -i when executing the file in the terminal

>>> portfolio = read_portfolio('Data/portfolio.csv')
>>> portfolio
[('AA', 100, 32.2), ('IBM', 50, 91.1), ('CAT', 150, 83.44), ('MSFT', 200, 51.23),
    ('GE', 95, 40.37), ('MSFT', 50, 65.1), ('IBM', 100, 70.44)]
>>>
>>> portfolio[0]
('AA', 100, 32.2)
>>> portfolio[1]
('IBM', 50, 91.1)
>>> portfolio[1][1]
50
>>> total = 0.0
>>> for s in portfolio:
        total += s[1] * s[2]

>>> print(total)
44671.15
>>>

This list of tuples that you have created is very similar to a 2-D array. For example, you can access a specific column and row using a lookup such as portfolio[row][column] where row and column are integers.

That said, you can also rewrite the last for-loop using a statement like this:

>>> total = 0.0
>>> for name, shares, price in portfolio:
            total += shares*price

>>> print(total)
44671.15
>>>

Exercise 2.5: List of Dictionaries

Take the function you wrote in Exercise 2.4 and modify to represent each stock in the portfolio with a dictionary instead of a tuple. In this dictionary use the field names of “name”, “shares”, and “price” to represent the different columns in the input file.

Experiment with this new function in the same manner as you did in Exercise 2.4.

>>> portfolio = read_portfolio('portfolio.csv')
>>> portfolio
[{'name': 'AA', 'shares': 100, 'price': 32.2}, {'name': 'IBM', 'shares': 50, 'price': 91.1},
    {'name': 'CAT', 'shares': 150, 'price': 83.44}, {'name': 'MSFT', 'shares': 200, 'price': 51.23},
    {'name': 'GE', 'shares': 95, 'price': 40.37}, {'name': 'MSFT', 'shares': 50, 'price': 65.1},
    {'name': 'IBM', 'shares': 100, 'price': 70.44}]
>>> portfolio[0]
{'name': 'AA', 'shares': 100, 'price': 32.2}
>>> portfolio[1]
{'name': 'IBM', 'shares': 50, 'price': 91.1}
>>> portfolio[1]['shares']
50
>>> total = 0.0
>>> for s in portfolio:
        total += s['shares']*s['price']

>>> print(total)
44671.15
>>>

Here, you will notice that the different fields for each entry are accessed by key names instead of numeric column numbers. This is often preferred because the resulting code is easier to read later.

Viewing large dictionaries and lists can be messy. To clean up the output for debugging, considering using the pprint function.

>>> from pprint import pprint
>>> pprint(portfolio)
[{'name': 'AA', 'price': 32.2, 'shares': 100},
    {'name': 'IBM', 'price': 91.1, 'shares': 50},
    {'name': 'CAT', 'price': 83.44, 'shares': 150},
    {'name': 'MSFT', 'price': 51.23, 'shares': 200},
    {'name': 'GE', 'price': 40.37, 'shares': 95},
    {'name': 'MSFT', 'price': 65.1, 'shares': 50},
    {'name': 'IBM', 'price': 70.44, 'shares': 100}]
>>>

Exercise 2.6: Dictionaries as a container

A dictionary is a useful way to keep track of items where you want to look up items using an index other than an integer. In the Python shell, try playing with a dictionary:

>>> prices = { }
>>> prices['IBM'] = 92.45
>>> prices['MSFT'] = 45.12
>>> prices
... look at the result ...
>>> prices['IBM']
92.45
>>> prices['AAPL']
... look at the result ...
>>> 'AAPL' in prices
False
>>>

The file Data/prices.csv contains a series of lines with stock prices. The file looks something like this:

"AA",9.22
"AXP",24.85
"BA",44.85
"BAC",11.27
"C",3.72
...

Write a function read_prices(filename) that reads a set of prices such as this into a dictionary where the keys of the dictionary are the stock names and the values in the dictionary are the stock prices.

To do this, start with an empty dictionary and start inserting values into it just as you did above. However, you are reading the values from a file now.

We’ll use this data structure to quickly lookup the price of a given stock name.

A few little tips that you’ll need for this part. First, make sure you use the csv module just as you did before—there’s no need to reinvent the wheel here.

>>> import csv
>>> f = open('Data/prices.csv', 'r')
>>> rows = csv.reader(f)
>>> for row in rows:
        print(row)


['AA', '9.22']
['AXP', '24.85']
...
[]
>>>

The other little complication is that the Data/prices.csv file may have some blank lines in it. Notice how the last row of data above is an empty list—meaning no data was present on that line.

There’s a possibility that this could cause your program to die with an exception. Use the try and except statements to catch this as appropriate. Thought: would it be better to guard against bad data with an if-statement instead?

Once you have written your read_prices() function, test it interactively to make sure it works:

>>> prices = read_prices('Data/prices.csv')
>>> prices['IBM']
106.28
>>> prices['MSFT']
20.89
>>>

Exercise 2.7: Finding out if you can retire

Tie all of this work together by adding a few additional statements to your report.py program that compute gain/loss. These statements should take the list of stocks in Exercise 2.5 and the dictionary of prices in Exercise 2.6 and computes the current value of the portfolio along with the gain/loss.

Formatting

This section is a slight digression, but when you work with data, you often want to produce structured output (tables, etc.). For example:

      Name      Shares        Price
----------  ----------  -----------
        AA         100        32.20
       IBM          50        91.10
       CAT         150        83.44
      MSFT         200        51.23
        GE          95        40.37
      MSFT          50        65.10
       IBM         100        70.44

String Formatting

One way to format string in Python 3.6+ is with f-strings.

>>> name = 'IBM'
>>> shares = 100
>>> price = 91.1
>>> f'{name:>10s} {shares:>10d} {price:>10.2f}'
'       IBM        100      91.10'
>>>

The part {expression:format} is replaced.

It is commonly used with print.

print(f'{name:>10s} {shares:>10d} {price:>10.2f}')

Format codes

Format codes (after the : inside the {}) are similar to C printf(). Common codes include:

d       Decimal integer
b       Binary integer
x       Hexadecimal integer
f       Float as [-]m.dddddd
e       Float as [-]m.dddddde+-xx
g       Float, but selective use of E notation s String
c       Character (from integer)

Common modifiers adjust the field width and decimal precision. This is a partial list:

:>10d   Integer right aligned in 10-character field
:<10d   Integer left aligned in 10-character field
:^10d   Integer centered in 10-character field :0.2f Float with 2 digit precision

Dictionary Formatting

You can use the format_map() method to apply string formatting to a dictionary of values:

>>> s = {
    'name': 'IBM',
    'shares': 100,
    'price': 91.1
}
>>> '{name:>10s} {shares:10d} {price:10.2f}'.format_map(s)
'       IBM        100      91.10'
>>>

It uses the same codes as f-strings but takes the values from the supplied dictionary.

format() method

There is a method format() that can apply formatting to arguments or keyword arguments.

>>> '{name:>10s} {shares:10d} {price:10.2f}'.format(name='IBM', shares=100, price=91.1)
'       IBM        100      91.10'
>>> '{:10s} {:10d} {:10.2f}'.format('IBM', 100, 91.1)
'       IBM        100      91.10'
>>>

Frankly, format() is a bit verbose. I prefer f-strings.

C-Style Formatting

You can also use the formatting operator %.

>>> 'The value is %d' % 3
'The value is 3'
>>> '%5d %-5d %10d' % (3,4,5)
'    3 4              5'
>>> '%0.2f' % (3.1415926,)
'3.14'

This requires a single item or a tuple on the right. Format codes are modeled after the C printf() as well.

Note: This is the only formatting available on byte strings.

>>> b'%s has %n messages' % (b'Dave', 37)
b'Dave has 37 messages'
>>>

Exercises

Exercise 2.8: How to format numbers

A common problem with printing numbers is specifying the number of decimal places. One way to fix this is to use f-strings. Try these examples:

>>> value = 42863.1
>>> print(value)
42863.1
>>> print(f'{value:0.4f}')
42863.1000
>>> print(f'{value:>16.2f}')
        42863.10
>>> print(f'{value:<16.2f}')
42863.10
>>> print(f'{value:*>16,.2f}')
*******42,863.10
>>>

Full documentation on the formatting codes used f-strings can be found here. Formatting is also sometimes performed using the % operator of strings.

>>> print('%0.4f' % value)
42863.1000
>>> print('%16.2f' % value)
        42863.10
>>>

Documentation on various codes used with % can be found here.

Although it’s commonly used with print, string formatting is not tied to printing. If you want to save a formatted string. Just assign it to a variable.

>>> f = '%0.4f' % value
>>> f
'42863.1000'
>>>

Exercise 2.9: Collecting Data

In Exercise 2.7, you wrote a program called report.py that computed the gain/loss of a stock portfolio. In this exercise, you’re going to start modifying it to produce a table like this:

      Name     Shares      Price     Change
---------- ---------- ---------- ----------
        AA        100       9.22     -22.98
       IBM         50     106.28      15.18
       CAT        150      35.46     -47.98
      MSFT        200      20.89     -30.34
        GE         95      13.48     -26.89
      MSFT         50      20.89     -44.21
       IBM        100     106.28      35.84

In this report, “Price” is the current share price of the stock and “Change” is the change in the share price from the initial purchase price.

In order to generate the above report, you’ll first want to collect all of the data shown in the table. Write a function make_report() that takes a list of stocks and dictionary of prices as input and returns a list of tuples containing the rows of the above table.

Add this function to your report.py file. Here’s how it should work if you try it interactively:

>>> portfolio = read_portfolio('Data/portfolio.csv')
>>> prices = read_prices('Data/prices.csv')
>>> report = make_report(portfolio, prices)
>>> for r in report:
        print(r)

('AA', 100, 9.22, -22.980000000000004)
('IBM', 50, 106.28, 15.180000000000007)
('CAT', 150, 35.46, -47.98)
('MSFT', 200, 20.89, -30.339999999999996)
('GE', 95, 13.48, -26.889999999999997)
...
>>>

Exercise 2.10: Printing a formatted table

Redo the for-loop in Exercise 2.9, but change the print statement to format the tuples.

>>> for r in report:
        print('%10s %10d %10.2f %10.2f' % r)

          AA        100       9.22     -22.98
         IBM         50     106.28      15.18
         CAT        150      35.46     -47.98
        MSFT        200      20.89     -30.34
...
>>>

You can also expand the values and use f-strings. For example:

>>> for name, shares, price, change in report:
        print(f'{name:>10s} {shares:>10d} {price:>10.2f} {change:>10.2f}')

          AA        100       9.22     -22.98
         IBM         50     106.28      15.18
         CAT        150      35.46     -47.98
        MSFT        200      20.89     -30.34
...
>>>

Take the above statements and add them to your report.py program. Have your program take the output of the make_report() function and print a nicely formatted table as shown.

Exercise 2.11: Adding some headers

Suppose you had a tuple of header names like this:

headers = ('Name', 'Shares', 'Price', 'Change')

Add code to your program that takes the above tuple of headers and creates a string where each header name is right-aligned in a 10-character wide field and each field is separated by a single space.

'      Name     Shares      Price      Change'

Write code that takes the headers and creates the separator string between the headers and data to follow. This string is just a bunch of “-” characters under each field name. For example:

'---------- ---------- ---------- -----------'

When you’re done, your program should produce the table shown at the top of this exercise.

      Name     Shares      Price     Change
---------- ---------- ---------- ----------
        AA        100       9.22     -22.98
       IBM         50     106.28      15.18
       CAT        150      35.46     -47.98
      MSFT        200      20.89     -30.34
        GE         95      13.48     -26.89
      MSFT         50      20.89     -44.21
       IBM        100     106.28      35.84

Exercise 2.12: Formatting Challenge

How would you modify your code so that the price includes the currency symbol ($) and the output looks like this:

      Name     Shares      Price     Change
---------- ---------- ---------- ----------
        AA        100      $9.22     -22.98
       IBM         50    $106.28      15.18
       CAT        150     $35.46     -47.98
      MSFT        200     $20.89     -30.34
        GE         95     $13.48     -26.89
      MSFT         50     $20.89     -44.21
       IBM        100    $106.28      35.84

Sequences

Sequence Datatypes

Python has three sequence datatypes.

  • String: 'Hello'. A string is a sequence of characters.
  • List: [1, 4, 5].
  • Tuple: ('GOOG', 100, 490.1).

All sequences are ordered, indexed by integers, and have a length.

a = 'Hello'               # String
b = [1, 4, 5]             # List
c = ('GOOG', 100, 490.1)  # Tuple

# Indexed order
a[0]                      # 'H'
b[-1]                     # 5
c[1]                      # 100

# Length of sequence
len(a)                    # 5
len(b)                    # 3
len(c)                    # 3

Sequences can be replicated: s * n.

>>> a = 'Hello'
>>> a * 3
'HelloHelloHello'
>>> b = [1, 2, 3]
>>> b * 2
[1, 2, 3, 1, 2, 3]
>>>

Sequences of the same type can be concatenated: s + t.

>>> a = (1, 2, 3)
>>> b = (4, 5)
>>> a + b
(1, 2, 3, 4, 5)
>>>
>>> c = [1, 5]
>>> a + c
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate tuple (not "list") to tuple

Slicing

Slicing means to take a subsequence from a sequence. The syntax is s[start:end]. Where start and end are the indexes of the subsequence you want.

a = [0,1,2,3,4,5,6,7,8]

a[2:5]    # [2,3,4]
a[-5:]    # [4,5,6,7,8]
a[:3]     # [0,1,2]
  • Indices start and end must be integers.
  • Slices do not include the end value. It is like a half-open interval from math.
  • If indices are omitted, they default to the beginning or end of the list.

Slice re-assignment

On lists, slices can be reassigned and deleted.

# Reassignment
a = [0,1,2,3,4,5,6,7,8]
a[2:4] = [10,11,12]       # [0,1,10,11,12,4,5,6,7,8]

Note: The reassigned slice doesn’t need to have the same length.

# Deletion
a = [0,1,2,3,4,5,6,7,8]
del a[2:4]                # [0,1,4,5,6,7,8]

Sequence Reductions

There are some common functions to reduce a sequence to a single value.

>>> s = [1, 2, 3, 4]
>>> sum(s)
10
>>> min(s) 1
>>> max(s) 4
>>> t = ['Hello', 'World']
>>> max(t)
'World'
>>>

Iteration over a sequence

The for-loop iterates over the elements in a sequence.

>>> s = [1, 4, 9, 16]
>>> for i in s:
...     print(i)
...
1
4
9
16
>>>

On each iteration of the loop, you get a new item to work with. This new value is placed into the iteration variable. In this example, the iteration variable is x:

for x in s:         # `x` is an iteration variable
    ...statements

On each iteration, the previous value of the iteration variable is overwritten (if any). After the loop finishes, the variable retains the last value.

break statement

You can use the break statement to break out of a loop early.

for name in namelist:
    if name == 'Jake':
        break
    ...
    ...
statements

When the break statement executes, it exits the loop and moves on the next statements. The break statement only applies to the inner-most loop. If this loop is within another loop, it will not break the outer loop.

continue statement

To skip one element and move to the next one, use the continue statement.

for line in lines:
    if line == '\n':    # Skip blank lines
        continue
    # More statements
    ...

This is useful when the current item is not of interest or needs to be ignored in the processing.

Looping over integers

If you need to count, use range().

for i in range(100):
    # i = 0,1,...,99

The syntax is range([start,] end [,step])

for i in range(100):
    # i = 0,1,...,99
for j in range(10,20):
    # j = 10,11,..., 19
for k in range(10,50,2):
    # k = 10,12,...,48
    # Notice how it counts in steps of 2, not 1.
  • The ending value is never included. It mirrors the behavior of slices.
  • start is optional. Default 0.
  • step is optional. Default 1.
  • range() computes values as needed. It does not actually store a large range of numbers.

enumerate() function

The enumerate function adds an extra counter value to iteration.

names = ['Elwood', 'Jake', 'Curtis']
for i, name in enumerate(names):
    # Loops with i = 0, name = 'Elwood'
    # i = 1, name = 'Jake'
    # i = 2, name = 'Curtis'

The general form is enumerate(sequence [, start = 0]). start is optional. A good example of using enumerate() is tracking line numbers while reading a file:

with open(filename) as f:
    for lineno, line in enumerate(f, start=1):
        ...

In the end, enumerate is just a nice shortcut for:

i = 0
for x in s:
    statements
    i += 1

Using enumerate is less typing and runs slightly faster.

For and tuples

You can iterate with multiple iteration variables.

points = [
  (1, 4),(10, 40),(23, 14),(5, 6),(7, 8)
]
for x, y in points:
    # Loops with x = 1, y = 4
    #            x = 10, y = 40
    #            x = 23, y = 14
    #            ...

When using multiple variables, each tuple is unpacked into a set of iteration variables. The number of variables must match the of items in each tuple.

zip() function

The zip function takes multiple sequences and makes an iterator that combines them.

columns = ['name', 'shares', 'price']
values = ['GOOG', 100, 490.1 ]
pairs = zip(columns, values)
# ('name','GOOG'), ('shares',100), ('price',490.1)

To get the result you must iterate. You can use multiple variables to unpack the tuples as shown earlier.

for column, value in pairs:
    ...

A common use of zip is to create key/value pairs for constructing dictionaries.

d = dict(zip(columns, values))

Exercises

Exercise 2.13: Counting

Try some basic counting examples:

>>> for n in range(10):            # Count 0 ... 9
        print(n, end=' ')

0 1 2 3 4 5 6 7 8 9
>>> for n in range(10,0,-1):       # Count 10 ... 1
        print(n, end=' ')

10 9 8 7 6 5 4 3 2 1
>>> for n in range(0,10,2):        # Count 0, 2, ... 8
        print(n, end=' ')

0 2 4 6 8
>>>

Exercise 2.14: More sequence operations

Interactively experiment with some of the sequence reduction operations.

>>> data = [4, 9, 1, 25, 16, 100, 49]
>>> min(data)
1
>>> max(data)
100
>>> sum(data)
204
>>>

Try looping over the data.

>>> for x in data:
        print(x)

4
9
...
>>> for n, x in enumerate(data):
        print(n, x)

0 4
1 9
2 1
...
>>>

Sometimes the for statement, len(), and range() get used by novices in some kind of horrible code fragment that looks like it emerged from the depths of a rusty C program.

>>> for n in range(len(data)):
        print(data[n])

4
9
1
...
>>>

Don’t do that! Not only does reading it make everyone’s eyes bleed, it’s inefficient with memory and it runs a lot slower. Just use a normal for loop if you want to iterate over data. Use enumerate() if you happen to need the index for some reason.

Exercise 2.15: A practical enumerate() example

Recall that the file Data/missing.csv contains data for a stock portfolio, but has some rows with missing data. Using enumerate(), modify your pcost.py program so that it prints a line number with the warning message when it encounters bad input.

>>> cost = portfolio_cost('Data/missing.csv')
Row 4: Couldn't convert: ['MSFT', '', '51.23']
Row 7: Couldn't convert: ['IBM', '', '70.44']
>>>

To do this, you’ll need to change a few parts of your code.

...
for rowno, row in enumerate(rows, start=1):
    try:
        ...
    except ValueError:
        print(f'Row {rowno}: Bad row: {row}')

Exercise 2.16: Using the zip() function

In the file Data/portfolio.csv, the first line contains column headers. In all previous code, we’ve been discarding them.

>>> f = open('Data/portfolio.csv')
>>> rows = csv.reader(f)
>>> headers = next(rows)
>>> headers
['name', 'shares', 'price']
>>>

However, what if you could use the headers for something useful? This is where the zip() function enters the picture. First try this to pair the file headers with a row of data:

>>> row = next(rows)
>>> row
['AA', '100', '32.20']
>>> list(zip(headers, row))
[ ('name', 'AA'), ('shares', '100'), ('price', '32.20') ]
>>>

Notice how zip() paired the column headers with the column values. We’ve used list() here to turn the result into a list so that you can see it. Normally, zip() creates an iterator that must be consumed by a for-loop.

This pairing is an intermediate step to building a dictionary. Now try this:

>>> record = dict(zip(headers, row))
>>> record
{'price': '32.20', 'name': 'AA', 'shares': '100'}
>>>

This transformation is one of the most useful tricks to know about when processing a lot of data files. For example, suppose you wanted to make the pcost.py program work with various input files, but without regard for the actual column number where the name, shares, and price appear.

Modify the portfolio_cost() function in pcost.py so that it looks like this:

# pcost.py

def portfolio_cost(filename):
    ...
        for rowno, row in enumerate(rows, start=1):
            record = dict(zip(headers, row))
            try:
                nshares = int(record['shares'])
                price = float(record['price'])
                total_cost += nshares * price
            # This catches errors in int() and float() conversions above
            except ValueError:
                print(f'Row {rowno}: Bad row: {row}')
        ...

Now, try your function on a completely different data file Data/portfoliodate.csv which looks like this:

name,date,time,shares,price
"AA","6/11/2007","9:50am",100,32.20
"IBM","5/13/2007","4:20pm",50,91.10
"CAT","9/23/2006","1:30pm",150,83.44
"MSFT","5/17/2007","10:30am",200,51.23
"GE","2/1/2006","10:45am",95,40.37
"MSFT","10/31/2006","12:05pm",50,65.10
"IBM","7/9/2006","3:15pm",100,70.44
>>> portfolio_cost('Data/portfoliodate.csv')
44671.15
>>>

If you did it right, you’ll find that your program still works even though the data file has a completely different column format than before. That’s cool!

The change made here is subtle, but significant. Instead of portfolio_cost() being hardcoded to read a single fixed file format, the new version reads any CSV file and picks the values of interest out of it. As long as the file has the required columns, the code will work.

Modify the report.py program you wrote in Section 2.3 so that it uses the same technique to pick out column headers.

Try running the report.py program on the Data/portfoliodate.csv file and see that it produces the same answer as before.

Exercise 2.17: Inverting a dictionary

A dictionary maps keys to values. For example, a dictionary of stock prices.

>>> prices = {
        'GOOG' : 490.1,
        'AA' : 23.45,
        'IBM' : 91.1,
        'MSFT' : 34.23
    }
>>>

If you use the items() method, you can get (key,value) pairs:

>>> prices.items()
dict_items([('GOOG', 490.1), ('AA', 23.45), ('IBM', 91.1), ('MSFT', 34.23)])
>>>

However, what if you wanted to get a list of (value, key) pairs instead? Hint: use zip().

>>> pricelist = list(zip(prices.values(),prices.keys()))
>>> pricelist
[(490.1, 'GOOG'), (23.45, 'AA'), (91.1, 'IBM'), (34.23, 'MSFT')]
>>>

Why would you do this? For one, it allows you to perform certain kinds of data processing on the dictionary data.

>>> min(pricelist)
(23.45, 'AA')
>>> max(pricelist)
(490.1, 'GOOG')
>>> sorted(pricelist)
[(23.45, 'AA'), (34.23, 'MSFT'), (91.1, 'IBM'), (490.1, 'GOOG')]
>>>

This also illustrates an important feature of tuples. When used in comparisons, tuples are compared element-by-element starting with the first item. Similar to how strings are compared character-by-character.

zip() is often used in situations like this where you need to pair up data from different places. For example, pairing up the column names with column values in order to make a dictionary of named values.

Note that zip() is not limited to pairs. For example, you can use it with any number of input lists:

>>> a = [1, 2, 3, 4]
>>> b = ['w', 'x', 'y', 'z']
>>> c = [0.2, 0.4, 0.6, 0.8]
>>> list(zip(a, b, c))
[(1, 'w', 0.2), (2, 'x', 0.4), (3, 'y', 0.6), (4, 'z', 0.8))]
>>>

Also, be aware that zip() stops once the shortest input sequence is exhausted.

>>> a = [1, 2, 3, 4, 5, 6]
>>> b = ['x', 'y', 'z']
>>> list(zip(a,b))
[(1, 'x'), (2, 'y'), (3, 'z')]
>>>

collections module

The collections module provides a number of useful objects for data handling. This part briefly introduces some of these features.

Example: Counting Things

Let’s say you want to tabulate the total shares of each stock.

portfolio = [
    ('GOOG', 100, 490.1),
    ('IBM', 50, 91.1),
    ('CAT', 150, 83.44),
    ('IBM', 100, 45.23),
    ('GOOG', 75, 572.45),
    ('AA', 50, 23.15)
]

There are two IBM entries and two GOOG entries in this list. The shares need to be combined together somehow.

Counters

Solution: Use a Counter.

from collections import Counter
total_shares = Counter()
for name, shares, price in portfolio:
    total_shares[name] += shares

total_shares['IBM']     # 150

Example: One-Many Mappings

Problem: You want to map a key to multiple values.

portfolio = [
    ('GOOG', 100, 490.1),
    ('IBM', 50, 91.1),
    ('CAT', 150, 83.44),
    ('IBM', 100, 45.23),
    ('GOOG', 75, 572.45),
    ('AA', 50, 23.15)
]

Like in the previous example, the key IBM should have two different tuples instead.

Solution: Use a defaultdict.

from collections import defaultdict
holdings = defaultdict(list)
for name, shares, price in portfolio:
    holdings[name].append((shares, price))
holdings['IBM'] # [ (50, 91.1), (100, 45.23) ]

The defaultdict ensures that every time you access a key you get a default value.

Example: Keeping a History

Problem: We want a history of the last N things. Solution: Use a deque.

from collections import deque

history = deque(maxlen=N)
with open(filename) as f:
    for line in f:
        history.append(line)
        ...

Exercises

The collections module might be one of the most useful library modules for dealing with special purpose kinds of data handling problems such as tabulating and indexing.

In this exercise, we’ll look at a few simple examples. Start by running your report.py program so that you have the portfolio of stocks loaded in the interactive mode.

bash % python3 -i report.py

Exercise 2.18: Tabulating with Counters

Suppose you wanted to tabulate the total number of shares of each stock. This is easy using Counter objects. Try it:

>>> portfolio = read_portfolio('Data/portfolio.csv')
>>> from collections import Counter
>>> holdings = Counter()
>>> for s in portfolio:
        holdings[s['name']] += s['shares']

>>> holdings
Counter({'MSFT': 250, 'IBM': 150, 'CAT': 150, 'AA': 100, 'GE': 95})
>>>

Carefully observe how the multiple entries for MSFT and IBM in portfolio get combined into a single entry here.

You can use a Counter just like a dictionary to retrieve individual values:

>>> holdings['IBM']
150
>>> holdings['MSFT']
250
>>>

If you want to rank the values, do this:

>>> # Get three most held stocks
>>> holdings.most_common(3)
[('MSFT', 250), ('IBM', 150), ('CAT', 150)]
>>>

Let’s grab another portfolio of stocks and make a new Counter:

>>> portfolio2 = read_portfolio('Data/portfolio2.csv')
>>> holdings2 = Counter()
>>> for s in portfolio2:
          holdings2[s['name']] += s['shares']

>>> holdings2
Counter({'HPQ': 250, 'GE': 125, 'AA': 50, 'MSFT': 25})
>>>

Finally, let’s combine all of the holdings doing one simple operation:

>>> holdings
Counter({'MSFT': 250, 'IBM': 150, 'CAT': 150, 'AA': 100, 'GE': 95})
>>> holdings2
Counter({'HPQ': 250, 'GE': 125, 'AA': 50, 'MSFT': 25})
>>> combined = holdings + holdings2
>>> combined
Counter({'MSFT': 275, 'HPQ': 250, 'GE': 220, 'AA': 150, 'IBM': 150, 'CAT': 150})
>>>

This is only a small taste of what counters provide. However, if you ever find yourself needing to tabulate values, you should consider using one.

Commentary: collections module

The collections module is one of the most useful library modules in all of Python. In fact, we could do an extended tutorial on just that. However, doing so now would also be a distraction. For now, put collections on your list of bedtime reading for later.

List Comprehensions

A common task is processing items in a list. This section introduces list comprehensions, a powerful tool for doing just that.

Creating new lists

A list comprehension creates a new list by applying an operation to each element of a sequence.

>>> a = [1, 2, 3, 4, 5]
>>> b = [2*x for x in a ]
>>> b
[2, 4, 6, 8, 10]
>>>

Another example:

>>> names = ['Elwood', 'Jake']
>>> a = [name.lower() for name in names]
>>> a
['elwood', 'jake']
>>>

The general syntax is: [ <expression> for <variable_name> in <sequence> ].

Filtering

You can also filter during the list comprehension.

>>> a = [1, -5, 4, 2, -2, 10]
>>> b = [2*x for x in a if x > 0 ]
>>> b
[2, 8, 4, 20]
>>>

Use cases

List comprehensions are hugely useful. For example, you can collect values of a specific dictionary fields:

stocknames = [s['name'] for s in stocks]

You can perform database-like queries on sequences.

a = [s for s in stocks if s['price'] > 100 and s['shares'] > 50 ]

You can also combine a list comprehension with a sequence reduction:

cost = sum([s['shares']*s['price'] for s in stocks])

General Syntax

[ <expression> for <variable_name> in <sequence> if <condition>]

What it means:

result = []
for variable_name in sequence:
    if condition:
        result.append(expression)

Historical Digression

List comprehension come from math (set-builder notation).

a = [ x * x for x in s if x > 0 ] # Python

a = { x^2 | x ∈ s, x > 0 }         # Math

It is also implemented in several other languages. Most coders probably aren’t thinking about their math class though. So, it’s fine to view it as a cool list shortcut.

Exercises

Start by running your report.py program so that you have the portfolio of stocks loaded in the interactive mode.

bash % python3 -i report.py

Now, at the Python interactive prompt, type statements to perform the operations described below. These operations perform various kinds of data reductions, transforms, and queries on the portfolio data.

Exercise 2.19: List comprehensions

Try a few simple list comprehensions just to become familiar with the syntax.

>>> nums = [1,2,3,4]
>>> squares = [ x * x for x in nums ]
>>> squares
[1, 4, 9, 16]
>>> twice = [ 2 * x for x in nums if x > 2 ]
>>> twice
[6, 8]
>>>

Notice how the list comprehensions are creating a new list with the data suitably transformed or filtered.

Exercise 2.20: Sequence Reductions

Compute the total cost of the portfolio using a single Python statement.

>>> portfolio = read_portfolio('Data/portfolio.csv')
>>> cost = sum([ s['shares'] * s['price'] for s in portfolio ])
>>> cost
44671.15
>>>

After you have done that, show how you can compute the current value of the portfolio using a single statement.

>>> value = sum([ s['shares'] * prices[s['name']] for s in portfolio ])
>>> value
28686.1
>>>

Both of the above operations are an example of a map-reduction. The list comprehension is mapping an operation across the list.

>>> [ s['shares'] * s['price'] for s in portfolio ]
[3220.0000000000005, 4555.0, 12516.0, 10246.0, 3835.1499999999996, 3254.9999999999995, 7044.0]
>>>

The sum() function is then performing a reduction across the result:

>>> sum(_)
44671.15
>>>

With this knowledge, you are now ready to go launch a big-data startup company.

Exercise 2.21: Data Queries

Try the following examples of various data queries.

First, a list of all portfolio holdings with more than 100 shares.

>>> more100 = [ s for s in portfolio if s['shares'] > 100 ]
>>> more100
[{'price': 83.44, 'name': 'CAT', 'shares': 150}, {'price': 51.23, 'name': 'MSFT', 'shares': 200}]
>>>

All portfolio holdings for MSFT and IBM stocks.

>>> msftibm = [ s for s in portfolio if s['name'] in {'MSFT','IBM'} ]
>>> msftibm
[{'price': 91.1, 'name': 'IBM', 'shares': 50}, {'price': 51.23, 'name': 'MSFT', 'shares': 200},
  {'price': 65.1, 'name': 'MSFT', 'shares': 50}, {'price': 70.44, 'name': 'IBM', 'shares': 100}]
>>>

A list of all portfolio holdings that cost more than $10000.

>>> cost10k = [ s for s in portfolio if s['shares'] * s['price'] > 10000 ]
>>> cost10k
[{'price': 83.44, 'name': 'CAT', 'shares': 150}, {'price': 51.23, 'name': 'MSFT', 'shares': 200}]
>>>

Exercise 2.22: Data Extraction

Show how you could build a list of tuples (name, shares) where name and shares are taken from portfolio.

>>> name_shares =[ (s['name'], s['shares']) for s in portfolio ]
>>> name_shares
[('AA', 100), ('IBM', 50), ('CAT', 150), ('MSFT', 200), ('GE', 95), ('MSFT', 50), ('IBM', 100)]
>>>

If you change the the square brackets ([,]) to curly braces ({, }), you get something known as a set comprehension. This gives you unique or distinct values.

For example, this determines the set of unique stock names that appear in portfolio:

>>> names = { s['name'] for s in portfolio }
>>> names
{ 'AA', 'GE', 'IBM', 'MSFT', 'CAT'] }
>>>

If you specify key:value pairs, you can build a dictionary. For example, make a dictionary that maps the name of a stock to the total number of shares held.

>>> holdings = { name: 0 for name in names }
>>> holdings
{'AA': 0, 'GE': 0, 'IBM': 0, 'MSFT': 0, 'CAT': 0}
>>>

This latter feature is known as a dictionary comprehension. Let’s tabulate:

>>> for s in portfolio:
        holdings[s['name']] += s['shares']

>>> holdings
{ 'AA': 100, 'GE': 95, 'IBM': 150, 'MSFT':250, 'CAT': 150 }
>>>

Try this example that filters the prices dictionary down to only those names that appear in the portfolio:

>>> portfolio_prices = { name: prices[name] for name in names }
>>> portfolio_prices
{'AA': 9.22, 'GE': 13.48, 'IBM': 106.28, 'MSFT': 20.89, 'CAT': 35.46}
>>>

Exercise 2.23: Extracting Data From CSV Files

Knowing how to use various combinations of list, set, and dictionary comprehensions can be useful in various forms of data processing. Here’s an example that shows how to extract selected columns from a CSV file.

First, read a row of header information from a CSV file:

>>> import csv
>>> f = open('Data/portfoliodate.csv')
>>> rows = csv.reader(f)
>>> headers = next(rows)
>>> headers
['name', 'date', 'time', 'shares', 'price']
>>>

Next, define a variable that lists the columns that you actually care about:

>>> select = ['name', 'shares', 'price']
>>>

Now, locate the indices of the above columns in the source CSV file:

>>> indices = [ headers.index(colname) for colname in select ]
>>> indices
[0, 3, 4]
>>>

Finally, read a row of data and turn it into a dictionary using a dictionary comprehension:

>>> row = next(rows)
>>> record = { colname: row[index] for colname, index in zip(select, indices) }   # dict-comprehension
>>> record
{'price': '32.20', 'name': 'AA', 'shares': '100'}
>>>

If you’re feeling comfortable with what just happened, read the rest of the file:

>>> portfolio = [ { colname: row[index] for colname, index in zip(select, indices) } for row in rows ]
>>> portfolio
[{'price': '91.10', 'name': 'IBM', 'shares': '50'}, {'price': '83.44', 'name': 'CAT', 'shares': '150'},
  {'price': '51.23', 'name': 'MSFT', 'shares': '200'}, {'price': '40.37', 'name': 'GE', 'shares': '95'},
  {'price': '65.10', 'name': 'MSFT', 'shares': '50'}, {'price': '70.44', 'name': 'IBM', 'shares': '100'}]
>>>

Oh my, you just reduced much of the read_portfolio() function to a single statement.

Commentary

List comprehensions are commonly used in Python as an efficient means for transforming, filtering, or collecting data. Due to the syntax, you don’t want to go overboard—try to keep each list comprehension as simple as possible. It’s okay to break things into multiple steps. For example, it’s not clear that you would want to spring that last example on your unsuspecting co-workers.

That said, knowing how to quickly manipulate data is a skill that’s incredibly useful. There are numerous situations where you might have to solve some kind of one-off problem involving data imports, exports, extraction, and so forth. Becoming a guru master of list comprehensions can substantially reduce the time spent devising a solution. Also, don’t forget about the collections module.

Objects

This section introduces more details about Python’s internal object model and discusses some matters related to memory management, copying, and type checking.

Assignment

Many operations in Python are related to assigning or storing values.

a = value         # Assignment to a variable
s[n] = value      # Assignment to an list
s.append(value)   # Appending to a list
d['key'] = value  # Adding to a dictionary

A caution: assignment operations never make a copy of the value being assigned. All assignments are merely reference copies (or pointer copies if you prefer).

Assignment example

Consider this code fragment.

a = [1,2,3]
b = a
c = [a,b]

A picture of the underlying memory operations. In this example, there is only one list object [1,2,3], but there are four different references to it.

caption References

This means that modifying a value affects all references.

>>> a.append(999)
>>> a
[1,2,3,999]
>>> b
[1,2,3,999]
>>> c
[[1,2,3,999], [1,2,3,999]]
>>>

Notice how a change in the original list shows up everywhere else (yikes!). This is because no copies were ever made. Everything is pointing to the same thing.

Reassigning values

Reassigning a value never overwrites the memory used by the previous value.

a = [1,2,3]
b = a
a = [4,5,6]

print(a)      # [4, 5, 6]
print(b)      # [1, 2, 3]    Holds the original value

Remember: Variables are names, not memory locations.

Some Dangers

If you don’t know about this sharing, you will shoot yourself in the foot at some point. Typical scenario. You modify some data thinking that it’s your own private copy and it accidentally corrupts some data in some other part of the program.

Comment: This is one of the reasons why the primitive datatypes (int, float, string) are immutable (read-only).

Identity and References

Use the is operator to check if two values are exactly the same object.

>>> a = [1,2,3]
>>> b = a
>>> a is b
True
>>>

is compares the object identity (an integer). The identity can be obtained using id().

>>> id(a)
3588944
>>> id(b)
3588944
>>>

Note: It is almost always better to use == for checking objects. The behavior of is is often unexpected:

>>> a = [1,2,3]
>>> b = a
>>> c = [1,2,3]
>>> a is b
True
>>> a is c
False
>>> a == c
True
>>>

Shallow copies

Lists and dicts have methods for copying.

>>> a = [2,3,[100,101],4]
>>> b = list(a) # Make a copy
>>> a is b
False

It’s a new list, but the list items are shared.

>>> a[2].append(102)
>>> b[2]
[100,101,102]
>>>
>>> a[2] is b[2]
True
>>>

For example, the inner list [100, 101, 102] is being shared. This is known as a shallow copy. Here is a picture.

File:Shallow.png
caption Shallow copy

Deep copies

Sometimes you need to make a copy of an object and all the objects contained withn it. You can use the copy module for this:

>>> a = [2,3,[100,101],4]
>>> import copy
>>> b = copy.deepcopy(a)
>>> a[2].append(102)
>>> b[2]
[100,101]
>>> a[2] is b[2]
False
>>>

Names, Values, Types

Variable names do not have a type. It’s only a name. However, values do have an underlying type.

>>> a = 42
>>> b = 'Hello World'
>>> type(a)
<type 'int'>
>>> type(b)
<type 'str'>

type() will tell you what it is. The type name is usually used as a function that creates or converts a value to that type.

Type Checking

How to tell if an object is a specific type.

if isinstance(a, list):
    print('a is a list')

Checking for one of many possible types.

if isinstance(a, (list,tuple)):
    print('a is a list or tuple')

Caution: Don’t go overboard with type checking. It can lead to excessive code complexity. Usually you’d only do it if doing so would prevent common mistakes made by others using your code.

Everything is an object

Numbers, strings, lists, functions, exceptions, classes, instances, etc. are all objects. It means that all objects that can be named can be passed around as data, placed in containers, etc., without any restrictions. There are no special kinds of objects. Sometimes it is said that all objects are “first-class”.

A simple example:

>>> import math
>>> items = [abs, math, ValueError ]
>>> items
[<built-in function abs>,
  <module 'math' (builtin)>,
  <type 'exceptions.ValueError'>]
>>> items[0](-45)
45
>>> items[1].sqrt(2)
1.4142135623730951
>>> try:
        x = int('not a number')
    except items[2]:
        print('Failed!')
Failed!
>>>

Here, items is a list containing a function, a module and an exception. You can directly use the items in the list in place of the original names:

items[0](-45)       # abs
items[1].sqrt(2)    # math
except items[2]:    # ValueError

With great power come responsibility. Just because you can do that doesn’t mean you should.

Exercises

In this set of exercises, we look at some of the power that comes from first-class objects.

Exercise 2.24: First-class Data

In the file Data/portfolio.csv, we read data organized as columns that look like this:

name,shares,price
"AA",100,32.20
"IBM",50,91.10
...

In previous code, we used the csv module to read the file, but still had to perform manual type conversions. For example:

for row in rows:
    name   = row[0]
    shares = int(row[1])
    price  = float(row[2])

This kind of conversion can also be performed in a more clever manner using some list basic operations.

Make a Python list that contains the names of the conversion functions you would use to convert each column into the appropriate type:

>>> types = [str, int, float]
>>>

The reason you can even create this list is that everything in Python is first-class. So, if you want to have a list of functions, that’s fine. The items in the list you created are functions for converting a value x into a given type (e.g., str(x), int(x), float(x)).

Now, read a row of data from the above file:

>>> import csv
>>> f = open('Data/portfolio.csv')
>>> rows = csv.reader(f)
>>> headers = next(rows)
>>> row = next(rows)
>>> row
['AA', '100', '32.20']
>>>

As noted, this row isn’t enough to do calculations because the types are wrong. For example:

>>> row[1] * row[2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't multiply sequence by non-int of type 'str'
>>>

However, maybe the data can be paired up with the types you specified in types. For example:

>>> types[1]
<type 'int'>
>>> row[1]
'100'
>>>

Try converting one of the values:

>>> types[1](row[1])     # Same as int(row[1])
100
>>>

Try converting a different value:

>>> types[2](row[2])     # Same as float(row[2])
32.2
>>>

Try the calculation with converted values:

>>> types[1](row[1])*types[2](row[2])
3220.0000000000005
>>>

Zip the column types with the fields and look at the result:

>>> r = list(zip(types, row))
>>> r
[(<type 'str'>, 'AA'), (<type 'int'>, '100'), (<type 'float'>,'32.20')]
>>>

You will notice that this has paired a type conversion with a value. For example, int is paired with the value '100'.

The zipped list is useful if you want to perform conversions on all of the values, one after the other. Try this:

>>> converted = []
>>> for func, val in zip(types, row):
          converted.append(func(val))
...
>>> converted
['AA', 100, 32.2]
>>> converted[1] * converted[2]
3220.0000000000005
>>>

Make sure you understand what’s happening in the above code. In the loop, the func variable is one of the type conversion functions (e.g., str, int, etc.) and the val variable is one of the values like 'AA', '100'. The expression func(val) is converting a value (kind of like a type cast).

The above code can be compressed into a single list comprehension.

>>> converted = [func(val) for func, val in zip(types, row)]
>>> converted
['AA', 100, 32.2]
>>>

Exercise 2.25: Making dictionaries

Remember how the dict() function can easily make a dictionary if you have a sequence of key names and values? Let’s make a dictionary from the column headers:

>>> headers
['name', 'shares', 'price']
>>> converted
['AA', 100, 32.2]
>>> dict(zip(headers, converted))
{'price': 32.2, 'name': 'AA', 'shares': 100}
>>>

Of course, if you’re up on your list-comprehension fu, you can do the whole conversion in a single step using a dict-comprehension:

>>> { name: func(val) for name, func, val in zip(headers, types, row) }
{'price': 32.2, 'name': 'AA', 'shares': 100}
>>>

Exercise 2.26: The Big Picture

Using the techniques in this exercise, you could write statements that easily convert fields from just about any column-oriented datafile into a Python dictionary.

Just to illustrate, suppose you read data from a different datafile like this:

>>> f = open('Data/dowstocks.csv')
>>> rows = csv.reader(f)
>>> headers = next(rows)
>>> row = next(rows)
>>> headers
['name', 'price', 'date', 'time', 'change', 'open', 'high', 'low', 'volume']
>>> row
['AA', '39.48', '6/11/2007', '9:36am', '-0.18', '39.67', '39.69', '39.45', '181800']
>>>

Let’s convert the fields using a similar trick:

>>> types = [str, float, str, str, float, float, float, float, int]
>>> converted = [func(val) for func, val in zip(types, row)]
>>> record = dict(zip(headers, converted))
>>> record
{'volume': 181800, 'name': 'AA', 'price': 39.48, 'high': 39.69,
'low': 39.45, 'time': '9:36am', 'date': '6/11/2007', 'open': 39.67,
'change': -0.18}
>>> record['name']
'AA'
>>> record['price']
39.48
>>>

Bonus: How would you modify this example to additionally parse the date entry into a tuple such as (6, 11, 2007)?

Spend some time to ponder what you’ve done in this exercise. We’ll revisit these ideas a little later.