Tutorial:PracticalPython/6 Generators
Generators
Iteration (the for
-loop) is one of the most common programming patterns in Python. Programs do a lot of iteration to process lists, read files, query databases, and more. One of the most powerful features of Python is the ability to customize and redefine iteration in the form of a so-called “generator function.” This section introduces this topic. By the end, you’ll write some programs that process some real-time streaming data in an interesting way.
- 6.1 Iteration Protocol
- 6.2 Customizing Iteration with Generators
- 6.3 Producer/Consumer Problems and Workflows
- 6.4 Generator Expressions
Iteration Protocol
This section looks at the underlying process of iteration.
Iteration Everywhere
Many different objects support iteration.
a = 'hello' for c in a: # Loop over characters in a ... b = { 'name': 'Dave', 'password':'foo'} for k in b: # Loop over keys in dictionary ... c = [1,2,3,4] for i in c: # Loop over items in a list/tuple ... f = open('foo.txt') for x in f: # Loop over lines in a file ...
Iteration: Protocol
Consider the for
-statement.
for x in obj: # statements
What happens under the hood?
_iter = obj.__iter__() # Get iterator object while True: try: x = _iter.__next__() # Get next item except StopIteration: # No more items break # statements ...
All the objects that work with the for-loop
implement this low-level iteration protocol.
Example: Manual iteration over a list.
>>> x = [1,2,3] >>> it = x.__iter__() >>> it <listiterator object at 0x590b0> >>> it.__next__() 1 >>> it.__next__() 2 >>> it.__next__() 3 >>> it.__next__() Traceback (most recent call last): File "<stdin>", line 1, in ? StopIteration >>>
Supporting Iteration
Knowing about iteration is useful if you want to add it to your own objects. For example, making a custom container.
class Portfolio: def __init__(self): self.holdings = [] def __iter__(self): return self.holdings.__iter__() ... port = Portfolio() for s in port: ...
Exercises
Exercise 6.1: Iteration Illustrated
Create the following list:
a = [1,9,4,25,16]
Manually iterate over this list. Call __iter__()
to get an iterator and call the __next__()
method to obtain successive elements.
>>> i = a.__iter__() >>> i <listiterator object at 0x64c10> >>> i.__next__() 1 >>> i.__next__() 9 >>> i.__next__() 4 >>> i.__next__() 25 >>> i.__next__() 16 >>> i.__next__() Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration >>>
The next()
built-in function is a shortcut for calling the __next__()
method of an iterator. Try using it on a file:
>>> f = open('Data/portfolio.csv') >>> f.__iter__() # Note: This returns the file itself <_io.TextIOWrapper name='Data/portfolio.csv' mode='r' encoding='UTF-8'> >>> next(f) 'name,shares,price\n' >>> next(f) '"AA",100,32.20\n' >>> next(f) '"IBM",50,91.10\n' >>>
Keep calling next(f)
until you reach the end of the file. Watch what happens.
Exercise 6.2: Supporting Iteration
On occasion, you might want to make one of your own objects support iteration–especially if your object wraps around an existing list or other iterable. In a new file portfolio.py
, define the following class:
# portfolio.py class Portfolio: def __init__(self, holdings): self._holdings = holdings @property def total_cost(self): return sum([s.cost for s in self._holdings]) def tabulate_shares(self): from collections import Counter total_shares = Counter() for s in self._holdings: total_shares[s.name] += s.shares return total_shares
This class is meant to be a layer around a list, but with some extra methods such as the total_cost
property. Modify the read_portfolio()
function in report.py
so that it creates a Portfolio
instance like this:
# report.py ... import fileparse from stock import Stock from portfolio import Portfolio def read_portfolio(filename): ''' Read a stock portfolio file into a list of dictionaries with keys name, shares, and price. ''' with open(filename) as file: portdicts = fileparse.parse_csv(file, select=['name','shares','price'], types=[str,int,float]) portfolio = [ Stock(d['name'], d['shares'], d['price']) for d in portdicts ] return Portfolio(portfolio) ...
Try running the report.py
program. You will find that it fails spectacularly due to the fact that Portfolio
instances aren’t iterable.
>>> import report >>> report.portfolio_report('Data/portfolio.csv', 'Data/prices.csv') ... crashes ...
Fix this by modifying the Portfolio
class to support iteration:
class Portfolio: def __init__(self, holdings): self._holdings = holdings def __iter__(self): return self._holdings.__iter__() @property def total_cost(self): return sum([s.shares*s.price for s in self._holdings]) def tabulate_shares(self): from collections import Counter total_shares = Counter() for s in self._holdings: total_shares[s.name] += s.shares return total_shares
After you’ve made this change, your report.py
program should work again. While you’re at it, fix up your pcost.py
program to use the new Portfolio
object. Like this:
# pcost.py import report def portfolio_cost(filename): ''' Computes the total cost (shares*price) of a portfolio file ''' portfolio = report.read_portfolio(filename) return portfolio.total_cost ...
Test it to make sure it works:
>>> import pcost >>> pcost.portfolio_cost('Data/portfolio.csv') 44671.15 >>>
Exercise 6.3: Making a more proper container
If making a container class, you often want to do more than just iteration. Modify the Portfolio
class so that it has some other special methods like this:
class Portfolio: def __init__(self, holdings): self._holdings = holdings def __iter__(self): return self._holdings.__iter__() def __len__(self): return len(self._holdings) def __getitem__(self, index): return self._holdings[index] def __contains__(self, name): return any([s.name == name for s in self._holdings]) @property def total_cost(self): return sum([s.shares*s.price for s in self._holdings]) def tabulate_shares(self): from collections import Counter total_shares = Counter() for s in self._holdings: total_shares[s.name] += s.shares return total_shares
Now, try some experiments using this new class:
>>> import report >>> portfolio = report.read_portfolio('Data/portfolio.csv') >>> len(portfolio) 7 >>> portfolio[0] Stock('AA', 100, 32.2) >>> portfolio[1] Stock('IBM', 50, 91.1) >>> portfolio[0:3] [Stock('AA', 100, 32.2), Stock('IBM', 50, 91.1), Stock('CAT', 150, 83.44)] >>> 'IBM' in portfolio True >>> 'AAPL' in portfolio False >>>
One important observation about this–generally code is considered “Pythonic” if it speaks the common vocabulary of how other parts of Python normally work. For container objects, supporting iteration, indexing, containment, and other kinds of operators is an important part of this.
Customizing Iteration
This section looks at how you can customize iteration using a generator function.
A problem
Suppose you wanted to create your own custom iteration pattern.
For example, a countdown.
>>> for x in countdown(10): ... print(x, end=' ') ... 10 9 8 7 6 5 4 3 2 1 >>>
There is an easy way to do this.
Generators
A generator is a function that defines iteration.
def countdown(n): while n > 0: yield n n -= 1
For example:
>>> for x in countdown(10): ... print(x, end=' ') ... 10 9 8 7 6 5 4 3 2 1 >>>
A generator is any function that uses the yield
statement.
The behavior of generators is different than a normal function. Calling a generator function creates a generator object. It does not immediately execute the function.
def countdown(n): # Added a print statement print('Counting down from', n) while n > 0: yield n n -= 1
>>> x = countdown(10) # There is NO PRINT STATEMENT >>> x # x is a generator object <generator object at 0x58490> >>>
The function only executes on __next__()
call.
>>> x = countdown(10) >>> x <generator object at 0x58490> >>> x.__next__() Counting down from 10 10 >>>
yield
produces a value, but suspends the function execution. The function resumes on next call to __next__()
.
>>> x.__next__() 9 >>> x.__next__() 8
When the generator finally returns, the iteration raises an error.
>>> x.__next__() 1 >>> x.__next__() Traceback (most recent call last): File "<stdin>", line 1, in ? StopIteration >>>
Observation: A generator function implements the same low-level protocol that the for statements uses on lists, tuples, dicts, files, etc.
Exercises
Exercise 6.4: A Simple Generator
If you ever find yourself wanting to customize iteration, you should always think generator functions. They’re easy to write—make a function that carries out the desired iteration logic and use yield
to emit values.
For example, try this generator that searches a file for lines containing a matching substring:
>>> def filematch(filename, substr): with open(filename, 'r') as f: for line in f: if substr in line: yield line >>> for line in open('Data/portfolio.csv'): print(line, end='') name,shares,price "AA",100,32.20 "IBM",50,91.10 "CAT",150,83.44 "MSFT",200,51.23 "GE",95,40.37 "MSFT",50,65.10 "IBM",100,70.44 >>> for line in filematch('Data/portfolio.csv', 'IBM'): print(line, end='') "IBM",50,91.10 "IBM",100,70.44 >>>
This is kind of interesting–the idea that you can hide a bunch of custom processing in a function and use it to feed a for-loop. The next example looks at a more unusual case.
Exercise 6.5: Monitoring a streaming data source
Generators can be an interesting way to monitor real-time data sources such as log files or stock market feeds. In this part, we’ll explore this idea. To start, follow the next instructions carefully.
The program Data/stocksim.py
is a program that simulates stock market data. As output, the program constantly writes real-time data to a file Data/stocklog.csv
. In a separate command window go into the Data/
directory and run this program:
bash % python3 stocksim.py
If you are on Windows, just locate the stocksim.py
program and double-click on it to run it. Now, forget about this program (just let it run). Using another window, look at the file Data/stocklog.csv
being written by the simulator. You should see new lines of text being added to the file every few seconds. Again, just let this program run in the background—it will run for several hours (you shouldn’t need to worry about it).
Once the above program is running, let’s write a little program to open the file, seek to the end, and watch for new output. Create a file follow.py
and put this code in it:
# follow.py import os import time f = open('Data/stocklog.csv') f.seek(0, os.SEEK_END) # Move file pointer 0 bytes from end of file while True: line = f.readline() if line == '': time.sleep(0.1) # Sleep briefly and retry continue fields = line.split(',') name = fields[0].strip('"') price = float(fields[1]) change = float(fields[4]) if change < 0: print(f'{name:>10s} {price:>10.2f} {change:>10.2f}')
If you run the program, you’ll see a real-time stock ticker. Under the hood, this code is kind of like the Unix tail -f
command that’s used to watch a log file.
Note: The use of the readline()
method in this example is somewhat unusual in that it is not the usual way of reading lines from a file (normally you would just use a for
-loop). However, in this case, we are using it to repeatedly probe the end of the file to see if more data has been added (readline()
will either return new data or an empty string).
Exercise 6.6: Using a generator to produce data
If you look at the code in Exercise 6.5, the first part of the code is producing lines of data whereas the statements at the end of the while
loop are consuming the data. A major feature of generator functions is that you can move all of the data production code into a reusable function.
Modify the code in Exercise 6.5 so that the file-reading is performed by a generator function follow(filename)
. Make it so the following code works:
>>> for line in follow('Data/stocklog.csv'): print(line, end='') ... Should see lines of output produced here ...
Modify the stock ticker code so that it looks like this:
if __name__ == '__main__': for line in follow('Data/stocklog.csv'): fields = line.split(',') name = fields[0].strip('"') price = float(fields[1]) change = float(fields[4]) if change < 0: print(f'{name:>10s} {price:>10.2f} {change:>10.2f}')
Exercise 6.7: Watching your portfolio
Modify the follow.py
program so that it watches the stream of stock data and prints a ticker showing information for only those stocks in a portfolio. For example:
if __name__ == '__main__': import report portfolio = report.read_portfolio('Data/portfolio.csv') for line in follow('Data/stocklog.csv'): fields = line.split(',') name = fields[0].strip('"') price = float(fields[1]) change = float(fields[4]) if name in portfolio: print(f'{name:>10s} {price:>10.2f} {change:>10.2f}')
Note: For this to work, your Portfolio
class must support the in
operator. See Exercise 6.3 and make sure you implement the __contains__()
operator.
Discussion
Something very powerful just happened here. You moved an interesting iteration pattern (reading lines at the end of a file) into its own little function. The follow()
function is now this completely general purpose utility that you can use in any program. For example, you could use it to watch server logs, debugging logs, and other similar data sources. That’s kind of cool.
Producers, Consumers and Pipelines
Generators are a useful tool for setting various kinds of producer/consumer problems and dataflow pipelines. This section discusses that.
Producer-Consumer Problems
Generators are closely related to various forms of producer-consumer problems.
# Producer def follow(f): ... while True: ... yield line # Produces value in `line` below ... # Consumer for line in follow(f): # Consumes vale from `yield` above ...
yield
produces values that for
consumes.
Generator Pipelines
You can use this aspect of generators to set up processing pipelines (like Unix pipes).
producer → processing → processing → consumer
Processing pipes have an initial data producer, some set of intermediate processing stages and a final consumer.
producer → processing → processing → consumer
def producer(): ... yield item ...
The producer is typically a generator. Although it could also be a list of some other sequence. yield
feeds data into the pipeline.
producer → processing → processing → consumer
def consumer(s): for item in s: ...
Consumer is a for-loop. It gets items and does something with them.
producer → processing → processing → consumer
def processing(s): for item in s: ... yield newitem ...
Intermediate processing stages simultaneously consume and produce items. They might modify the data stream. They can also filter (discarding items).
producer → processing → processing → consumer
def producer(): ... yield item # yields the item that is received by the `processing` ... def processing(s): for item in s: # Comes from the `producer` ... yield newitem # yields a new item ... def consumer(s): for item in s: # Comes from the `processing` ...
Code to setup the pipeline
a = producer() b = processing(a) c = consumer(b)
You will notice that data incrementally flows through the different functions.
Exercises
For this exercise the stocksim.py
program should still be running in the background. You’re going to use the follow()
function you wrote in the previous exercise.
Exercise 6.8: Setting up a simple pipeline
Let’s see the pipelining idea in action. Write the following function:
>>> def filematch(lines, substr): for line in lines: if substr in line: yield line >>>
This function is almost exactly the same as the first generator example in the previous exercise except that it’s no longer opening a file–it merely operates on a sequence of lines given to it as an argument. Now, try this:
>>> lines = follow('Data/stocklog.csv') >>> ibm = filematch(lines, 'IBM') >>> for line in ibm: print(line) ... wait for output ...
It might take awhile for output to appear, but eventually you should see some lines containing data for IBM.
Exercise 6.9: Setting up a more complex pipeline
Take the pipelining idea a few steps further by performing more actions.
>>> from follow import follow >>> import csv >>> lines = follow('Data/stocklog.csv') >>> rows = csv.reader(lines) >>> for row in rows: print(row) ['BA', '98.35', '6/11/2007', '09:41.07', '0.16', '98.25', '98.35', '98.31', '158148'] ['AA', '39.63', '6/11/2007', '09:41.07', '-0.03', '39.67', '39.63', '39.31', '270224'] ['XOM', '82.45', '6/11/2007', '09:41.07', '-0.23', '82.68', '82.64', '82.41', '748062'] ['PG', '62.95', '6/11/2007', '09:41.08', '-0.12', '62.80', '62.97', '62.61', '454327'] ...
Well, that’s interesting. What you’re seeing here is that the output of the follow()
function has been piped into the csv.reader()
function and we’re now getting a sequence of split rows.
Exercise 6.10: Making more pipeline components
Let’s extend the whole idea into a larger pipeline. In a separate file ticker.py
, start by creating a function that reads a CSV file as you did above:
# ticker.py from follow import follow import csv def parse_stock_data(lines): rows = csv.reader(lines) return rows if __name__ == '__main__': lines = follow('Data/stocklog.csv') rows = parse_stock_data(lines) for row in rows: print(row)
Write a new function that selects specific columns:
# ticker.py ... def select_columns(rows, indices): for row in rows: yield [row[index] for index in indices] ... def parse_stock_data(lines): rows = csv.reader(lines) rows = select_columns(rows, [0, 1, 4]) return rows
Run your program again. You should see output narrowed down like this:
['BA', '98.35', '0.16'] ['AA', '39.63', '-0.03'] ['XOM', '82.45','-0.23'] ['PG', '62.95', '-0.12'] ...
Write generator functions that convert data types and build dictionaries. For example:
# ticker.py ... def convert_types(rows, types): for row in rows: yield [func(val) for func, val in zip(types, row)] def make_dicts(rows, headers): for row in rows: yield dict(zip(headers, row)) ... def parse_stock_data(lines): rows = csv.reader(lines) rows = select_columns(rows, [0, 1, 4]) rows = convert_types(rows, [str, float, float]) rows = make_dicts(rows, ['name', 'price', 'change']) return rows ...
Run your program again. You should now a stream of dictionaries like this:
{ 'name':'BA', 'price':98.35, 'change':0.16 } { 'name':'AA', 'price':39.63, 'change':-0.03 } { 'name':'XOM', 'price':82.45, 'change': -0.23 } { 'name':'PG', 'price':62.95, 'change':-0.12 } ...
Exercise 6.11: Filtering data
Write a function that filters data. For example:
# ticker.py ... def filter_symbols(rows, names): for row in rows: if row['name'] in names: yield row
Use this to filter stocks to just those in your portfolio:
import report portfolio = report.read_portfolio('Data/portfolio.csv') rows = parse_stock_data(follow('Data/stocklog.csv')) rows = filter_symbols(rows, portfolio) for row in rows: print(row)
Exercise 6.12: Putting it all together
In the ticker.py
program, write a function ticker(portfile, logfile, fmt)
that creates a real-time stock ticker from a given portfolio, logfile, and table format. For example::
>>> from ticker import ticker >>> ticker('Data/portfolio.csv', 'Data/stocklog.csv', 'txt') Name Price Change ---------- ---------- ---------- GE 37.14 -0.18 MSFT 29.96 -0.09 CAT 78.03 -0.49 AA 39.34 -0.32 ... >>> ticker('Data/portfolio.csv', 'Data/stocklog.csv', 'csv') Name,Price,Change IBM,102.79,-0.28 CAT,78.04,-0.48 AA,39.35,-0.31 CAT,78.05,-0.47 ...
Discussion
Some lessons learned: You can create various generator functions and chain them together to perform processing involving data-flow pipelines. In addition, you can create functions that package a series of pipeline stages into a single function call (for example, the parse_stock_data()
function).
More Generators
This section introduces a few additional generator related topics including generator expressions and the itertools module.
Generator Expressions
A generator version of a list comprehension.
>>> a = [1,2,3,4] >>> b = (2*x for x in a) >>> b <generator object at 0x58760> >>> for i in b: ... print(i, end=' ') ... 2 4 6 8 >>>
Differences with List Comprehensions.
- Does not construct a list.
- Only useful purpose is iteration.
- Once consumed, can’t be reused.
General syntax.
(<expression> for i in s if <conditional>)
It can also serve as a function argument.
sum(x*x for x in a)
It can be applied to any iterable.
>>> a = [1,2,3,4] >>> b = (x*x for x in a) >>> c = (-x for x in b) >>> for i in c: ... print(i, end=' ') ... -1 -4 -9 -16 >>>
The main use of generator expressions is in code that performs some calculation on a sequence, but only uses the result once. For example, strip all comments from a file.
f = open('somefile.txt') lines = (line for line in f if not line.startswith('#')) for line in lines: ... f.close()
With generators, the code runs faster and uses little memory. It’s like a filter applied to a stream.
Why Generators
- Many problems are much more clearly expressed in terms of iteration.
- Looping over a collection of items and performing some kind of operation (searching, replacing, modifying, etc.).
- Processing pipelines can be applied to a wide range of data processing problems.
- Better memory efficiency.
- Only produce values when needed.
- Contrast to constructing giant lists.
- Can operate on streaming data
- Generators encourage code reuse
- Separates the iteration from code that uses the iteration
- You can build a toolbox of interesting iteration functions and mix-n-match.
itertools
module
The itertools
is a library module with various functions designed to help with iterators/generators.
itertools.chain(s1,s2) itertools.count(n) itertools.cycle(s) itertools.dropwhile(predicate, s) itertools.groupby(s) itertools.ifilter(predicate, s) itertools.imap(function, s1, ... sN) itertools.repeat(s, n) itertools.tee(s, ncopies) itertools.izip(s1, ... , sN)
All functions process data iteratively. They implement various kinds of iteration patterns.
More information at Generator Tricks for Systems Programmers tutorial from PyCon ’08.
Exercises
In the previous exercises, you wrote some code that followed lines being written to a log file and parsed them into a sequence of rows. This exercise continues to build upon that. Make sure the Data/stocksim.py
is still running.
Exercise 6.13: Generator Expressions
Generator expressions are a generator version of a list comprehension. For example:
>>> nums = [1, 2, 3, 4, 5] >>> squares = (x*x for x in nums) >>> squares <generator object <genexpr> at 0x109207e60> >>> for n in squares: ... print(n) ... 1 4 9 16 25
Unlike a list a comprehension, a generator expression can only be used once. Thus, if you try another for-loop, you get nothing:
>>> for n in squares: ... print(n) ... >>>
Exercise 6.14: Generator Expressions in Function Arguments
Generator expressions are sometimes placed into function arguments. It looks a little weird at first, but try this experiment:
>>> nums = [1,2,3,4,5] >>> sum([x*x for x in nums]) # A list comprehension 55 >>> sum(x*x for x in nums) # A generator expression 55 >>>
In the above example, the second version using generators would use significantly less memory if a large list was being manipulated.
In your portfolio.py
file, you performed a few calculations involving list comprehensions. Try replacing these with generator expressions.
Exercise 6.15: Code simplification
Generators expressions are often a useful replacement for small generator functions. For example, instead of writing a function like this:
def filter_symbols(rows, names): for row in rows: if row['name'] in names: yield row
You could write something like this:
rows = (row for row in rows if row['name'] in names)
Modify the ticker.py
program to use generator expressions as appropriate.
[[../Contents.md|Contents]] | Previous (6.3 Producer/Consumer) | [[../07_Advanced_Topics/00_Overview.md|Next (7 Advanced Topics)]]