# Linux:6 Text in Linux

## Manipulations with text

Dealing with textual matter is the meat of Linux (and of most computing), so there are going to be many chapters about the various aspects of text. This first chapter in this part of the book shows how to view text on your display screen.

There are many ways to view or otherwise output text. When your intention is to edit the text of a file, open it in a text editor. Some kinds of files--such as PostScript, DVI, and PDF files--often contain text in them, but they are technically not text files.

## Perusing Text

Use less to peruse text, viewing it one screen (or "page") at a time. The less tool works on either files or standard output--it is popularly used as the last command on a pipeline so that you can page through the text output of some commands.

zless is identical to less, but you use it to view compressed text files; it allows you to read a compressed text file's contents without having to uncompress it first. Most of the system documentation in the /usr/doc' and /usr/share/doc' directories, for example, consists of compressed text files.

You may, on occasion, be confronted with a reference to a command for paging text called more. It was the standard tool for paging text until it gave way to less in the early to mid-1990s; less comes with many more options--its most notable advantage being the ability to scroll backward through a file--but at the expense of being almost exactly three times the size of more. Hence there are two meanings to the saying, "less is more."

### Perusing a Text File

To peruse or page through a text file, give the name of the file as an argument to less.

• To page through the text file README', type:

$less README RET This command starts less and displays the file README' on the screen. You can more forward through the document a line at a time by typing @downarrow, and you can move forward through the document a screenful at a time by typing PgDn. To move backward by a line, type @uparrow, and type PgUp to move backward by a screenful. [GNU INFO BUG: any <> in the preceding line should be the <- and/or -> arrow keys.] To stop viewing and exit less, press Q. ### Perusing Multiple Text Files You can specify more than one file to page through with less, and you can specify file patterns in order to open all of the files that match that pattern. • To page through all of the Unix FAQ files in /usr/doc/FAQ', type: $ less /usr/doc/FAQ/unix-faq-part* RET

This command starts less, opens in it all of the files that match the given pattern /usr/doc/FAQ/unix-faq-part*', and begins displaying the first one.

NOTE: When you specify more than one file to page, less displays each file in turn, beginning with the first file you specify or the first file that matches the given pattern. To move to the next file, press N; to move to the previous file, press P.

### Commands Available While Perusing Text

The following table gives a summary of the keyboard commands that you can use while paging through text in less. It lists the keystrokes and describes the commands.

KEYSTROKE - COMMAND

@uparrow - Scroll back through the text ("up") one line. [GNU INFO BUG: any <> in the preceding line should be the <- and/or -> arrow keys.]

@downarrow - Scroll forward through the text ("down") one line. [GNU INFO BUG: any <> in the preceding line should be the <- and/or -> arrow keys.]

@leftarrow or @rightarrow - Scroll horizontally (left or right) one tab stop; useful for perusing files that contain long lines. [GNU INFO BUG: any <> in the preceding line should be the <- and/or -> arrow keys.]

PgUp or SPC - Scroll forward through the text by one screenful.

PgDn - Scroll backward through the text by one screenful.

C-l - Redraw the screen.

/pattern - Search forward through the file for lines containing pattern.

?pattern - Search backward through the file for lines containing pattern.

< - Move to beginning of the file.

> - Move to end of the file.

h - Display a help screen.

q - Quit viewing the file and exit less.

## Outputting Text

The simplest way to view text is to output it to standard output. This is useful for quickly looking at part of a text, or for passing part of a text to other tools in a command line.

Many people still use cat to view a text file, especially if it is a very small file. To output all of a file's contents on the screen, use cat and give the file name as an argument.

This isn't always the best way to peruse or read text--a very large text will scroll off the top of the screen, for example--but sometimes the simple outputting of text is quite appropriate, such as when you just want to output one line of a file, or when you want to output several files into one new file.

This section describes the tools used for such purposes. These tools are best used as filters, often at the end of a pipeline, outputting the standard input from other commands.

### Showing Non-printing Characters

Use cat with the -v' option to output non-printing characters, such as control characters, in such a way so that you can see them. With this option, cat outputs those characters in hat notation, where they are represented by a ^' and the character corresponding to the actual control character (for example, a bell character would be output as ^G').

• To peruse the file translation' with non-printing characters displayed in hat notation, type:

$cat -v translation | less RET In this example, the output of cat is piped to less for viewing on the screen; you could have piped it to another command, or redirected it to a file instead. To visually display the end of each line, use the -E' option; it specifies that a $' should be output after the end of each line. This is useful for determining whether lines contain trailing space characters.

Also useful is the -T' option, which outputs tab characters as ^I'.

The -A' option combines all three of these options--it is the same as specifying -vET'.

### Outputting a Beginning Part of a Text

Use head to output the beginning of a text. By default, it outputs the first ten lines of its input.

• To output the first ten lines of file placement-list', type:

$head placement-list RET You can specify as a numeric option the number of lines to output. If you specify more lines than a file contains, head just outputs the entire text. • To output the first line of file placement-list', type: $ head -1 placement-list RET
• To output the first sixty-six lines of file placement-list', type:

$head -66 placement-list RET To output a given number of characters instead of lines, give the number of characters to output as an argument to the -c' option. • To output the first character in the file placement-list', type: $ head -c1 placement-list RET

### Outputting an Ending Part of a Text

The tail tool works like head, but outputs the last part of its input. Like head, it outputs ten lines by default.

• To output the last ten lines of file placement-list', type:

$tail placement-list RET • To output the last fourteen lines of file placement-list', type: $ tail -14 placement-list RET

It is sometimes useful to view the end of a file on a continuing basis; this can be useful for a "growing" file, a file that is being written to by another process. To keep viewing the end of such a file, use tail with the -f' ("follow") option. Type C-c to stop viewing the file.

• To follow the end of the file access_log', type:

$tail -f access_log RET ### Outputting a Middle Part of a Text There are a few ways to output only a middle portion of a text. To output a particular line of a file, use the sed tool. Give as a quoted argument the line number to output followed by !d'. Give the file name as the second argument. • To output line 47 of file placement-list', type: $ sed '47!d' placement-list RET

To output a region of more than one line, give the starting and ending line numbers, separated by a comma.

• To output lines 47 to 108 of file placement-list', type:

$sed '47,108!d' placement-list RET You can also combine multiple head or tail commands on a pipeline to get the desired result. • To output the tenth line in the file placement-list', type: $ head placement-list | tail -1 RET
• To output the fifth and fourth lines from the bottom of file placement-list', type:

$tail -5 placement-list | head -2 RET • To output the 500th character in placement-list', type: $ head -c500 placement-list | tail -c1 RET
• To output the first character on the fifth line of the file placement-list', type:

$head -5 placement-list | tail -1 | head -c1 RET In the preceding example, three commands were used: the first five lines of the file placement-list' are passed to tail, which outputs the last line in the output (the fifth line in the file); then, the last head command outputs the first character in that last line, which achieves the desired result. ### Outputting the Text between Strings Use sed to select lines of text between strings and output either just that section of text, or all of the lines of text except that section. The strings can be words or even regular expressions. Use the -n' option followed by '/first/,/last/p to output just the text between the strings first and last, inclusive. This is useful for outputting, say, just one chapter or section of a text file when you know the text used to begin the sections with. • To output all the text from file book-draft' between Chapter 3' and Chapter 4', type: $ sed -n '/Chapter 3/,/Chapter 4/p' book-draft RET

To output all of the lines of text except those between two patterns, omit the -n' option.

• To output all the text from file book-draft', except that which lies between the text Chapter 3' and Chapter 4', type:

$sed '/Chapter 3/,/Chapter 4/p' book-draft RET ### Outputting Text in a Dialect @sf{Debian}: filters' @sf{WWW}: http://www.princeton.edu/~mkporwit/pub_links/davido/slang/ @sf{WWW}: http://www.mathlab.sunysb.edu/~elijah/src.html There are all kinds of tools that work as filters on text; this recipe describes a specific group of filters--those that filter their standard input to give the text an accent or dialect, and are intended to be humorous. Generally speaking, a filter is a tool that works on standard input, changing it in some way, and then passing it to standard output. • To apply the kraut filter to the text in the file /etc/motd', type: $ cat /etc/motd | kraut RET

These commands pass the contents of the file /etc/motd' to the kraut filter, whose output is then sent to standard output. The contents of /etc/motd' are not changed.

Some of the dialect filters available include nyc, which gives a "New Yawker" dialect to text, and newspeak, which translates text into the approved language of the thought police, as described in George Orwell's novel, 1984. Hail Big Brother!

## Streaming Text

It's been demonstrated that people read and comprehend printed text faster than they read and comprehend text displayed on a computer display screen. Rapid serial visual presentation, or RSVP, is a technique that aims to increase reading speed and comprehension with the use of computer display screens. With this technique, text is displayed streamed on the screen, one word at a time, with pauses between words and punctuation. The average reading time is lowered and comprehension is increased significantly with this technique.

GNOME sview is a "streaming viewer" for X; it streams text a word at a time on the screen, at a default rate of 450 words per minute. Use it to read text files and the X selection, which is text you have selected with the mouse.

To open a file in sview, either specify it as an argument to the command, or choose Open from the File menu in sview, and select the file from there.

• To view the contents of the text file alice-in-wonderland' in sview, type:

$sview alice-in-wonderland RET To start streaming the text, either press S once, or left-click on the button marked RSVP. Both S and the RSVP button toggle the streaming; the left and right arrow keys control the speed. The large area with the word beginning' in it is where the text is being streamed. The text in the lower-left window is a shrunken view of the entire file, the text in the lower-right window is the paragraph from which the current word comes from. To open another file, choose it from the menu; you can have many files open in sview at once. sview places each file in its own buffer. You can also paste the X selection into a buffer of its own--to switch to a different buffer, choose its name from the Buffer menu. Type Q to quit reading and exit sview. The following table lists the keyboard commands used in sview and describes their meaning. KEYSTROKE - DESCRIPTION @leftarrow Decrease the stream speed. [GNU INFO BUG: any <> in the preceding line should be the <- arrow key.] @rightarrow Increase the stream speed. [GNU INFO BUG: any <> in the preceding line should be the -> arrow key.] C-o Open a file. C-q Quit viewing text and exit sview. C-w Erase the current text buffer. M-n Move forward to the next word. M-p Move backward to the previous word. S Toggle the streaming of text. X Display the X selection in its own buffer. N Move forward to the next paragraph. P Move backward to the previous paragraph. ## Viewing a Character Chart To view a character chart containing a list of all the valid characters in the ASCII character set and the character codes to use to type them, view the ascii man page. • To view an ASCII character set, type: $ man ascii RET

You can use the octal codes listed for each character to type them. The default Linux character set, the ISO 8859-1 ("Latin 1") character set, contains all of the standard ASCII character set plus an additional 128 characters.

To view the ISO 8859-1 character set, which contains an extended set of characters above the standard 127 ASCII characters, view the iso_8859_1 man page.

• To view the ISO 8859-1 character set, type:

$man iso_8859_1 RET You can use this page to see all of the characters in this character set and how to input them. # Analyzing Text There are many ways to use command-line tools to analyze text in various ways, such as counting the number of words in a text, creating a concordance, and comparing texts to see if (and where) they differ. There are also other tricks you can do with text that count as analysis, such as finding anagrams and palindromes, or cutting up text to generate unexpected combinations of words. This chapter covers all these topics. ## Counting Text Use the "word count" tool, wc, to count characters, words, and lines in text. Give the name of a file as an argument; if none is given, wc works on standard input. By default, wc outputs three columns, displaying the counts for lines, words, and characters in the text. • To output the number of lines, words, and characters in file outline', type: $ wc outline RET

The following subsections describe how to specify just one kind of count with wc, and how to count text in Emacs.

### Counting the Characters in a Text

Use wc with the -c' option to specify that just the number of characters be counted and output.

• To output the number of characters in file classified.ad', type:

$wc -c classified.ad RET ### Counting the Words in a Text Use wc with the -w' option to specify that just the number of words be counted and output. • To output the number of words in the file story', type: $ wc -w story RET

To output counts for several files, first concatenate the files with cat, and then pipe the output to wc.

• To output the combined number of words for all the files with a .txt' file name extension in the current directory, type:

### Counting the Occurrences of Something

To find the number of occurrences of some text string or pattern in a file or files, use grep to search the file(s) for the text string, and pipe the output to wc with the -l' option.

• To find the number of lines in the file outline' that contain the string chapter', type:

$grep chapter outline | wc -l RET NOTE: For more recipes for searching text, and more about grep. ## Making a Concordance of a Text A concordance is an index of all the words in a text, along with their contexts. A concordance-like functionality--an alphabetical listing of all words in a text and their frequency--can be made fairly easily with some basic shell tools: tr, sort, and uniq. • To output a word-frequency list of the text file naked_lunch', type: $ tr ' ' ' RET
> ' < naked_lunch | sort | uniq -c RET

These commands translate all space characters to newline characters, outputting the text with each word on its own line; this is then sorted alphabetically, and that output is passed to uniq, which outputs only the unique lines--that is, all non-duplicate lines--while the -c' option precedes each line with its count (the number of times it occurs in the text).

To get a word frequency count--that is, the total number of different words in a text--just pipe the output of the frequency list to wc with the -l' option. This counts all the lines of its input, which in this case will be the list of unique words, one per line.

• To output a count of the number of unique words in the text file naked_lunch', type:

$tr ' ' ' RET > ' < naked_lunch | sort | uniq -c | wc -l RET ## Text Relevance The following recipes show how to analyze a given text for its relevancy to other text, either to keywords or to whole files of text. You can also use the diff family of tools to analyze differences in text; those tools are especially good for comparing different revisions of the same file. ### Sorting Text in Order of Relevance Use rel to analyze text files for relevance to a given set of keywords. It outputs the names of those files that are relevant to the given keywords, ranked in order of relevance; if a file does not meet the criteria, it is not output in the relevance listing. rel takes as an option the keyword to search for in quotes; you can build a boolean expression by grouping multiple keywords in parentheses and using any of the following operators between them: CODE - DESCRIPTION | Logical "or." & Logical "and." ! Logical "not." Give as arguments the names of the files to rank. • To rank the files report.a', report.b', and report.c' in order of relevance to the keywords saving' and profit', type: $ rel "(saving & profit)" report.a report.b report.c RET

Give the name of a directory tree to analyze all files in the directory tree.

• To output a list of any files containing either invitation' or request' in the ~/mail' directory, ranked in order of relevancy, type:

$rel "(invitation | request)" ~/mail RET • To output a list of any files containing invitation' and not wedding' in the ~/mail' directory, ranked in order of relevancy, type: $ rel "(invitation ! wedding)" ~/mail RET
• To output a list of any files containing invitation' and party' in the ~/mail' directory, ranked in order of relevancy, type:

$rel "(invitation & party)" ~/mail RET ### Listing Relevant Files in Emacs The purpose of the Remembrance Agent is to analyze the text you type in an Emacs session and, in the background, find similar or relevant passages of text within your other files. It then outputs in a smaller window a list of suggestions--those files that it has found--which you can open in a new buffer. When installing the Remembrance Agent, you create three databases of the files to use when making relevance suggestions; when remembrance-agent is running, it searches these three databases in parallel, looking for relevant text. You could create, for example, one database of saved email, one of your own writings, and one of saved documents. • To toggle the Remembrance Agent in the current buffer, type: C-c r t When remembrance-agent is running, suggested buffers will be displayed in the small *Remembrance*' buffer at the bottom of the screen. To open a suggestion in a new buffer, type C-c r number, where number is the number of the suggestion. • To open the second suggested file in a new buffer, type: C-c r 2 ## Finding Anagrams in Text An anagram is a word or phrase whose characters consist entirely of all the characters of a given word or phrase--for example, stop' and tops' are both anagrams of pots'. Use an to find and output anagrams. Give as an argument the word or quoted phrase to use; an writes its results to the standard output. • To output all anagrams of the word lake', type: $ an lake RET
• To output all anagrams of the phrase lakes and oceans', type:

$an 'lakes and oceans' RET To limit the anagrams output to those containing a given string, specify that string with the -c' option. • To output only anagrams of the phrase lakes and oceans' which contain the string seas', type: $ an -c seas 'lakes and oceans' RET

To print all of the words that some or all letters in a given word or phrase can make, use the -w' option. This outputs words that are not anagrams, since anagrams must contain all of the letters of the other word or phrase.

• To output all of the words that can be made from the letters of the word seas', type:

$an -w seas RET This command outputs all of the words that can be formed from all or some of the characters in seas', including see' and as'. ## Finding Palindromes in Text A palindrome is a word that reads the same both forwards and backwards; for example, "Mom," "madam," and "nun" are all palindromes. To find palindromes in a file, use this simple Perl "one-liner," and substitute file for the name of the file to check: perl -lne 'print if$_ eq reverse' file

To check for palindromes in the standard input, specify -' as the file name to check. This is useful for putting at the end of a pipeline.

• To output all of the palindromes in the system dictionary, type:

$perl -lne 'print if$_ eq reverse' /usr/dict/words RET

## 13.6 Text Cut-Ups

A cut-up is a random rearrangement of a physical layout of text, made with the intention of finding unique or interesting phrases in the rearrangement. Software for rearranging text in random ways has existed since the earliest text-processing tools; the popularity of these tools will never die.

The cut-up technique in literature was discovered by painter Brion Gysin and American writer William S. Burroughs in 1959; they believed it brought the montage technique of painting to the written word.

These recipes describe a few of the common ways to make text cut-ups; more free software tools for making cut-ups are listed at http://dsl.org/comp/cutups.shtml.

### 13.6.1 Making Simple Text Cut-Ups

To perform a simple cut-up of a text, use cutup. It takes the name of a file as input and cuts it both horizontally and vertically along the middle, rearranges the four sections to their diagonally opposite corners, and then writes that cut-up to the standard output. The original file is not modified.

• To make a cut-up from a file called nova', type:

$cutup nova RET ### Making Random Word Cut-Ups No simple cut-up filter, Jamie Zawinski's dadadodo uses the computer to go one step beyond--it generates passages of random text whose structure and characters are similar to the text input you give it. The program works better on larger texts, where more subtleties can be analyzed and hence more realistic-looking text is output. Give as an argument the name of the text file to be used; by default, dadadodo outputs text to standard output until you interrupt it by typing C-c. • To output random text based on the text in the file nova', type: $ dadadodo nova RET

This command will output passages of random text based on the text in the file nova' until it is interrupted by the user.

You can analyze a text and save the analysis to a file of compiled data; this analysis can then be used to generate random text when the original input text is not present. The following table describes this and other dadadodo options.

DESCRIPTION

-c integer Generate integer sentences (default is 0, meaning "generate an infinite amount until interrupted").

-l file Load compiled data in file and use it to generate text.

-o file Output compiled data to file file for later use.

-p integer Pause for integer seconds between paragraphs.

# Formatting Text

Methods and tools for changing the arrangement or presentation of text are often useful for preparing text for printing. This chapter discusses ways of changing the spacing of text and setting up pages, of underlining and sorting and reversing text, and of numbering lines of text.

## Spacing Text

These recipes are for changing the spacing of text--the whitespace that exists between words, lines, and paragraphs.

The filters described in this section send output to standard output by default; to save their output to a file, use shell redirection .

### Eliminating Extra Spaces in Text

To eliminate extra whitespaces within lines of text, use the fmt filter; to eliminate extra whitespace between lines of text, use cat.

Use fmt with the -u' option to output text with "uniform spacing," where the space between words is reduced to one space character and the space between sentences is reduced to two space characters.

• To output the file term-paper' with uniform spacing, type:

$fmt -u term-paper RET Use cat with the -s' option to "squeeze" multiple adjacent blank lines into one. • To output the file term-paper' with multiple blank lines output as only one blank line, type: $ cat -s term-paper RET

You can combine both of these commands to output text with multiple adjacent lines removed and give it a unified spacing between words. The following example shows how the output of the combined commands is sent to less so that it can be perused on the screen.

• To peruse the text file term-paper' with multiple blank lines removed and giving the text unified spacing between words, type:

$cat -s term-paper | fmt -u | less RET Notice that in this example, both fmt and less worked on their standard input instead of on a file--the standard output of cat (the contents of term-paper' with extra blank lines squeezed out) was passed to the standard input of fmt, and its standard output (the space-squeezed term-paper', now with uniform spacing) was sent to the standard input of less, which displayed it on the screen. ### Single-Spacing Text There are many methods for single-spacing text. To remove all empty lines from text output, use grep with the regular expression .', which matches any character, and therefore matches any line that isn't empty. You can then redirect this output to a file, or pipe it to other commands; the original file is not altered. • To output all non-empty lines from the file term-paper', type: $ grep . term-paper RET

This command outputs all lines that are not empty--so lines containing only non-printing characters, such as spaces and tabs, will still be output.

To remove from the output all empty lines, and all lines that consist of only space characters, use [^ ].' as the regexp to search for. But this regexp will still output lines that contain only tab characters; to remove from the output all empty lines and lines that contain only a combination of tab or space characters, use [^[:space:]].' as the regexp to search for. It uses the special predefined [:space:]' regexp class, which matches any kind of space character at all, including tabs.

• To output only the lines from the file term-paper' that contain more than just space characters, type:

$grep '[^ ].' term-paper RET To output only the lines from the file term-paper' that contain more than just space or tab characters, type: $ grep '[^[:space:]].' term-paper RET

If a file is already double-spaced, where all even lines are blank, you can remove those lines from the output by using sed with the n;d' expression.

• To output only the odd lines from file term-paper', type:

$sed 'n;d' term-paper RET ### Double-Spacing Text To double-space text, where one blank line is inserted between each line in the original text, use the pr tool with the -d' option. By default, pr paginates text and puts a header at the top of each page with the current date, time, and page number; give the -t' option to omit this header. • To double-space the file term-paper' and write the output to the file term-paper.print', type: $ pr -d -t term-paper > term-paper.print RET

To send the output directly to the printer for printing, you would pipe the output to lpr:

$pr -d -t term-paper | lpr RET NOTE: The pr ("print") tool is a text pre-formatter, often used to paginate and otherwise prepare text files for printing; ### Triple-Spacing Text To triple-space text, where two blank lines are inserted between each line of the original text, use sed with the 'G;G expression. • To triple-space the file term-paper' and write the output to the file term-paper.print', type: $ sed 'G;G' term-paper > term-paper.print RET

The G' expression appends one blank line to each line of sed's output; using ;' you can specify more than one blank line to append (but you must quote this command, because the semicolon (;') has meaning to the shell. You can use multiple G' characters to output text with more than double or triple spaces.

• To quadruple-space the file term-paper', and write the output to the file term-paper.print', type:

$sed 'G;G;G' term-paper > term-paper.print RET ### Adding Line Breaks to Text Sometimes a file will not have line breaks at the end of each line (this commonly happens during file conversions between operating systems). To add line breaks to a file that does not have them, use the text formatter fmt. It outputs text with lines arranged up to a specified width; if no length is specified, it formats text up to a width of 75 characters per line. • To output the file term-paper' with lines up to 75 characters long, type: $ fmt term-paper RET

Use the -w' option to specify the maximum line width.

• To output the file term-paper' with lines up to 80 characters long, type:

$fmt -w 80 term-paper RET ### Adding Margins to Text Giving text an extra left margin is especially good when you want to print a copy and punch holes in it for use with a three-ring binder. To output a text file with a larger left margin, use pr with the file name as an argument; give the -t' option (to disable headers and footers), and, as an argument to the -o' option, give the number of spaces to offset the text. Add the number of spaces to the page width (whose default is 72) and specify this new width as an argument to the -w' option. • To output the file owners-manual' with a five-space (or five-column) margin to a new file, owners-manual.pr', type: $ pr -t -o 5 -w 77 owners-manual > owners-manual.pr RET

This command is almost always used for printing, so the output is usually just piped to lpr instead of saved to a file. Many text documents have a width of 80 and not 72 columns; if you are printing such a document and need to keep the 80 columns across the page, specify a new width of 85. If your printer can only print 80 columns of text, specify a width of 80; the text will be reformatted to 75 columns after the 5-column margin.

• To print the file owners-manual' with a 5-column margin and 80 columns of text, type:

$pr -t -o 5 -w 85 owners-manual | lpr RET • To print the file owners-manual' with a 5-column margin and 75 columns of text, type: $ pr -t -o 5 -w 80 owners-manual | lpr RET

### Swapping Tab and Space Characters

Use the expand and unexpand tools to swap tab characters for space characters, and to swap space characters with tabs, respectively.

Both tools take a file name as an argument and write changes to the standard output; if no files are specified, they work on the standard input.

To convert tab characters to spaces, use expand. To convert only the initial or leading tabs on each line, give the -i' option; the default action is to convert all tabs.

• To convert all tab characters to spaces in file list', and write the output to list2', type:

$expand list > list2 RET • To convert only initial tab characters to spaces in file list', and write the output to the standard output, type: $ expand -i list RET

To convert multiple space characters to tabs, use unexpand. By default, it only converts leading spaces into tabs, counting eight space characters for each tab. Use the -a' option to specify that all instances of eight space characters be converted to tabs.

• To convert every eight leading space characters to tabs in file list2', and write the output to list', type:

$unexpand list2 > list RET • To convert all occurrences of eight space characters to tabs in file list2', and write the output to the standard output, type: $ unexpand -a list2 RET

To specify the number of spaces to convert to a tab, give that number as an argument to the -t' option.

• To convert every leading space character to a tab character in list2', and write the output to the standard output, type:

$unexpand -t 1 list2 RET ## Paginating Text The formfeed character, ASCII C-l or octal code 014, is the delimiter used to paginate text. When you send text with a formfeed character to the printer, the current page being printed is ejected and a new page begins--thus, you can paginate a text file by inserting formfeed characters at a place where you want a page break to occur. To insert formfeed characters in a text file, use the pr filter. Give the -f' option to omit the footer and separate pages of output with the formfeed character, and use -h ""' to output a blank header (otherwise, the current date and time, file name, and current page number are output at the top of each page). • To paginate the file listings' and write the output to a file called listings.page', type: $ pr -f -h "" listings > listings.page RET

By default, pr outputs pages of 66 lines each. You can specify the page length as an argument to the -l' option.

• To paginate the file listings' with 43-line pages, and write the output to a file called listings.page', type:

### 14.2.2 Placing Text in Columns

You can also use pr to put text in columns--give the number of columns to output as an argument. Use the -t' option to omit the printing of the default headers and footers.

• To print the file news.update' in four columns with no headers or footers, type:

$pr -4 -t news.update | lpr RET ### Options Available When Paginating Text The following table describes some of pr's options; see the pr info for a complete description of its capabilities. OPTION DESCRIPTION +first:last Specify the first and last page to process; the last page can be omitted, so +7 begins processing with the seventh page and continues until the end of the file is reached. -column Specify the number of columns to output text in, making all columns fit the page width. -a Print columns across instead of down. -c Output control characters in hat notation and print all other unprintable characters in "octal backslash" notation. -d Specify double-spaced output. -f Separate pages of output with a formfeed character instead of a footer of blank lines (63 lines of text per 66-line page instead of 53). -h header Specify the header to use instead of the default; specify -h "" for a blank header. -l length Specify the page length to be length lines (default 66). If page length is less than 11, headers and footers are omitted and existing form feeds are ignored. -m Use when specifying multiple files; this option merges and outputs them in parallel, one per column. -o spaces Set the number of spaces to use in the left margin (default 0). -t Omit the header and footer on each page, but retain existing formfeeds. -T Omit the header and footer on each page, as well as existing formfeeds. -v Output non-printing characters in "octal backslash" notation. -w width Specify the page width to use, in characters (default 72). ## Underlining Text In the days of typewriters, text that was meant to be set in an italicized font was denoted by underlining the text with underscore characters; now, it's common practice to denote an italicized word in plain text by typing an underscore character, _', just before and after a word in a text file, like _this_'. Some text markup languages use different methods for denoting italics; for example, in TeX or LaTeX files, italicized text is often denoted with brackets and the \it' command, like {\it this}'. (LaTeX files use the same format, but \emph' is often used in place of \it'.) You can convert one form to the other by using the Emacs replace-regular-expression function and specifying the text to be replaced as a regexp. • To replace plaintext-style italics with TeX \it' commands, type:  M-x replace-regular-expression RET _$$[^_]+$$_ RET \{\\it \1} RET • To replace TeX-style italics with plaintext _underscores_, type:  M-x replace-regular-expression RET \{\\it \{$$[^\}]+$$\} RET _\1_ RET Both examples above used the special regexp symbol \1', which matches the same text matched by the first $$...$$' construct in the previous regexp. See Info file emacs-e20.info', node Regexps' for more information on regexp syntax in Emacs. To put a literal underline under text, you need to use a text editor to insert a C-h character followed by an underscore (_') immediately after each character you want to underline; you can insert the C-h in Emacs with the C-q function. When a text file contains these literal underlines, use the ul tool to output the file so that it is viewable by the terminal you are using; this is also useful for printing (pipe the output of ul to lpr). • To output the file term-paper' so that you can view underbars, type: $ ul term-paper RET

To output such text without the backspace character, C-h, in the output, use col with the -u' option.

• To output the file term-paper' with all backspace characters stripped out, type:

$col -u term-paper RET ## Sorting Text You can sort a list in a text file with sort. By default, it outputs text in ascending alphabetical order; use the -r' option to reverse the sort and output text in descending alphabetical order. For example, suppose a file provinces' contains the following: Shantung Honan Szechwan Hunan Kiangsu Kwangtung Fukien • To sort the file provinces' and output all lines in ascending order, type: $ sort provinces RET
Fukien
Honan
Hunan
Kiangsu
Kwangtung
Shantung
Szechwan
$ • To sort the file provinces' and output all lines in descending order, type: $ sort -r provinces RET
Szechwan
Shantung
Kwangtung
Kiangsu
Hunan
Honan
Fukien
$ The following table describes some of sort's options. OPTION DESCRIPTION -b Ignore leading blanks on each line when sorting. -d Sort in "phone directory" order, with only letters, digits, and blanks being sorted. -f When sorting, fold lowercase letters into their uppercase equivalent, so that differences in case are ignored. -i Ignore all spaces and all non-typewriter characters when sorting. -n Sort numerically instead of by character value. -o file Write output to file instead of standard output. ## 14.5 Numbering Lines of Text There are several ways to number lines of text. One way to do it is to use the nl ("number lines") tool. Its default action is to write its input (either the file names given as an argument, or the standard input) to the standard output, with an indentation and all non-empty lines preceded with line numbers. • To peruse the file report' with each line of the file preceded by line numbers, type: $ nl report | less RET

You can set the numbering style with the -b' option followed by an argument. The following table lists the possible arguments and describes the numbering style they select.

ARGUMENT

NUMBERING STYLE

a

Number all lines.

t

Number only non-blank lines. This is the default.

n

Do not number lines.

pregexp

Only number lines that contain the regular expression regexp.

The default is for line numbers to start with one, and increment by one. Set the initial line number by giving an argument to the -v' option, and set the increment by giving an argument to the -i' option.

• To output the file report' with each line of the file preceded by line numbers, starting with the number two and counting by fours, type:

$nl -v 2 -i 4 report RET • To number only the lines of the file cantos' that begin with a period (.'), starting numbering at zero and using a numbering increment of five, and to write the output to cantos.numbered', type: $ nl -i 5 -v 0 -b p'^\.' cantos > cantos.numbered RET

The other way to number lines is to use cat with one of the following two options: the -n' option numbers each line of its input text, while the -b' option only numbers non-blank lines.

• To peruse the text file report' with each line of the file numbered, type:

$cat -n report | less RET • To peruse the text file report' with each non-blank line of the file numbered, type: $ cat -b report | less RET

In the preceding examples, output from cat is piped to less for perusal; the original file is not altered.

To take an input file, number its lines, and then write the line-numbered version to a new file, send the standard output of the cat command to the new file to write.

• To write a line-numbered version of file report' to file report.lines', type:

$cat -n report > report.lines RET ## Reversing Text The tac command is similar to cat, but it outputs text in reverse order. There is another difference---tac works on records, sections of text with separator strings, instead of lines of text. Its default separator string is the linebreak character, so by default tac outputs files in line-for-line reverse order. • To output the file prizes' in line-for-line reverse order, type: $ tac prizes RET

Specify a different separator with the -s' option. This is often useful when specifying non-printing characters such as formfeeds. To specify such a character, use the ANSI-C method of quoting.

• To output prizes' in page-for-page reverse order, type:
$tac -s$'\f' prizes RET

The preceding example uses the formfeed, or page break, character as the delimiter, and so it outputs the file prizes' in page-for-page reverse order, with the last page output first.

Use the -r' option to use a regular expression for the separator string. You can build regular expressions to output text in word-for-word and character-for-character reverse order:

• To output prizes' in word-for-word reverse order, type:
$tac -r -s '[^a-zA-z0-9\-]' prizes RET  • To output prizes' in character-for-character reverse order, type: $ tac -r -s '.\| RET
' prizes RET

To reverse the characters on each line, use rev.

• To output prizes' with the characters on each line reversed, type:

$rev prizes RET # Searching Text It's quite common to search through text for a given sequence of characters (such as a word or phrase), called a string, or even for a pattern describing a set of such strings; this chapter contains recipes for doing these kind of things. ## Searching for a Word or Phrase The primary command used for searching through text is the rather froglike-sounding tool called grep, where its advanced usage is discussed). It outputs lines of its input that contain a given string or pattern. To search for a word, give that word as the first argument. By default, grep searches standard input; give the name of a file to search as the second argument. • To output lines in the file catalog' containing the word CD', type: $ grep CD catalog RET

To search for a phrase, specify it in quotes.

• To output lines in the file catalog' containing the word Compact Disc', type:

$grep 'Compact Disc' catalog RET The preceding example outputs all lines in the file catalog' that contain the exact string Compact Disc'; it will not match, however, lines containing compact disc' or any other variation on the case of letters in the search pattern. Use the -i' option to specify that matches are to be made regardless of case. • To output lines in the file catalog' containing the string compact disc' regardless of the case of its letters, type: $ grep -i 'compact disc' catalog RET

This command outputs lines in the file catalog' containing any variation of the pattern compact disc', including Compact Disc', COMPACT DISC', and comPact dIsC'.

One thing to keep in mind is that grep only matches patterns that appear on a single line, so in the preceding example, if one line in catalog' ends with the word compact' and the next begins with disc', grep will not match either line. There is a way around this with grep.

You can specify more than one file to search. When you specify multiple files, each match that grep outputs is preceded by the name of the file it's in (and you can suppress this with the -h' option.)

• To output lines in all of the files in the current directory containing the word CD', type:

$grep CD * RET • To output lines in all of the .txt' files in the ~/doc' directory containing the word CD', suppressing the listing of file names in the output, type: $ grep -h CD ~/doc/*.txt RET

Use the -r' option to search a given directory recursively, searching all subdirectories it contains.

• To output lines containing the word CD' in all of the .txt' files in the ~/doc' directory and in all of its subdirectories, type:

$grep -r CD ~/doc/*.txt RET NOTE: There are more complex things you can search for than simple strings, as will be explained in the next section. ## Regular Expressions--Matching Text Patterns In addition to word and phrase searches, you can use grep to search for complex text patterns called regular expressions. A regular expression--or "regexp"---is a text string of special characters that specifies a set of patterns to match. Technically speaking, the word or phrase patterns described in the previous section are regular expressions--just very simple ones. In a regular expression, most characters--including letters and numbers--represent themselves. For example, the regexp pattern 1 matches the string 1', and the pattern bee matches the string bee'. There are a number of reserved characters called metacharacters that don't represent themselves in a regular expression, but have a special meaning that is used to build complex patterns. These metacharacters are as follows: ., *, [, ], ^,$, and \.

To specify one of these literal characters in a regular expression, precede the character with a \'.

• To output lines in the file catalog' that contain a $' character, type: $ grep '\$' catalog RET • To output lines in the file catalog' that contain the string $1.99', type:

$grep '\$1\.99' catalog RET
• To output lines in the file catalog' that contain a \' character, type:

$grep '\\' catalog RET The following table describes the special meanings of the metacharacters and gives examples of their usage. METACHARACTER MEANING . Matches any one character, with the exception of the newline character. For example, . matches a', 1', ?', .' (a literal period character), and so forth. * Matches the preceding regexp zero or more times. For example, -* matches -', --', ---', ---------', and so forth. Now imagine a line of text with a million -' characters somewhere in it, all marching off across the horizon, up into the blue sky, and through the clouds. A million -' characters in a row. This pattern would match it. Now think of the same long parade, but it's a million and one -' characters--it matches that, too. [ ] Encloses a character set, and matches any member of the set--for example, [abc] matches either a', b', or c'. In addition, the hyphen (-') and caret (^') characters have special meanings when used inside brackets: - The hyphen specifies a range of characters, ordered according to their ASCII value. For example, [0-9] is synonymous with [0123456789]; [A-Za-z] matches one uppercase or lowercase letter. To include a literal -' in a list, specify it as the last character in a list: so [0-9-] matches either a single digit character or a -'.x ^ As the first character of a list, the caret means that any character except those in the list should be matched. For example, [^a] matches any character except a', and [^0-9] matches any character except a numeric digit. ^ Matches the beginning of the line. So ^a matches a' only when it is the first character on a line. $ Matches the end of the line. So a$matches a' only when it is the last character on a line. \ Use \ before a metacharacter when you want to specify that literal character. So \$ matches a dollar sign character ($'), and \\ matches a single backslash character (\'). In addition, use \ to build new metacharacters, by using it before a number of other characters: \| Called the alternation operator'; it matches either regexp it is between--use it to join two separate regexps to match either of them. For example, a\|b matches either a' or b'. \+ Matches the preceding regexp as many times as possible, but at least once. So a\+ matches one or more a' adjacent characters, such as aaa', aa', and a'. \? Matches the regexp preceding it either zero or one times. So a\? matches a' or an empty string--which matches every line. \{number\} Matches the previous regexp (one specified to the left of this construction) that number of times--so a\{4\} matches aaaa'. Use \{number,\} to match the preceding regexp number or more times, \{,number\} to match the preceding regexp zero to number times, and \{number1,number2\} to match the preceding regexp from number1 to number2 times. $$regexp$$ Group regexp together for an alternative; useful for combination regexps. For example, while moo\? matches mo' or moo', $$moo$$\? matches moo' or the empty set. NOTE: The name grep' derives from a command in the now-obsolete Unix ed line editor tool--the ed command for searching globally through a file for a regular expression and then printing those lines was g/re/p, where re was the regular expression you'd use. The following sections describe some regexp recipes for commonly searched-for patterns. ### Matching Lines Beginning with Certain Text Use ^' in a regexp to denote the beginning of a line. • To output all lines in /usr/dict/words' beginning with pre', type: $ grep '^pre' /usr/dict/words RET
• To output all lines in the file book' that begin with the text in the beginning', regardless of case, type:

$grep -i '^in the beginning' book RET NOTE: These regexps were quoted with ' characters; this is because some shells otherwise treat the ^' character as a special "metacharacter" ### Matching Lines Ending with Certain Text Use $' as the last character of quoted text to match that text only at the end of a line.

• To output lines in the file sayings' ending with an exclamation point, type:

$grep '!$' sayings RET

NOTE: To use $' in a regexp to find words that rhyme with a given word. ### Matching Lines of a Certain Length To match lines of a particular length, use that number of .' characters between ^' and $'---for example, to match all lines that are two characters (or columns) wide, use ^..$' as the regexp to search for. • To output all lines in /usr/dict/words' that are exactly two characters wide, type: $ grep '^..$' /usr/dict/words RET For longer lines, it is more useful to use a different construct: ^.\{number\}$', where number is the number of lines to match. Use ,' to specify a range of numbers.

• To output all lines in /usr/dict/words' that are exactly seventeen characters wide, type:

$grep '^.\{17\}$' /usr/dict/words RET
• To output all lines in /usr/dict/words' that are twenty-five or more characters wide, type:

$grep '^.\{25,\}$' /usr/dict/words RET

### Matching Lines That Contain Any of Some Regexps

To match lines that contain any of a number of regexps, specify each of the regexps to search for between alternation operators (\|') as the regexp to search for. Lines containing any of the given regexps will be output.

• To output all lines in playlist' that contain either the patterns the sea' or cake', type:

### Matching Lines That Don't Contain a Regexp

To output all lines in a text that don't contain a given pattern, use grep with the -v' option--this option reverts the sense of matching, selecting all non-matching lines.

• To output all lines in /usr/dict/words' that are not three characters wide, type:

$grep -v '^...$' RET
• To output all lines in access_log' that do not contain the string http', type:

$grep -v http access_log RET ### Matching Lines That Only Contain Certain Characters To match lines that only contain certain characters, use the regexp ^[characters]*$', where characters are the ones to match.

• To output lines in /usr/dict/words' that only contain vowels, type:

$grep -i '^[aeiou]*$' /usr/dict/words RET

The -i' option matches characters regardless of case; so, in this example, all vowel characters are matched regardless of case.

### Finding Phrases Regardless of Spacing

One way to search for a phrase that might occur with extra spaces between words, or across a line or page break, is to remove all linefeeds and extra spaces from the input, and then grep that.

To do this, pipe the input to tr with '\r\n:\>\|- as an argument to the -d' option (removing all linebreaks from the input); pipe that to the fmt filter with the -u' option (outputting the text with uniform spacing); and pipe that to grep with the pattern to search for.

• To search across line breaks for the string at the same time as' in the file notes', type:

$cat notes | tr -d '\r\n:\>\|-' | fmt -u | grep 'at the same time as' RET ### Finding Patterns in Certain Contexts To search for a pattern that only occurs in a particular context, grep for the context in which it should occur, and pipe the output to another grep to search for the actual pattern. For example, this can be useful to search for a given pattern only when it is quoted with an >' character in an email message. • To list lines from the file email-archive' that contain the word narrative' only when it is quoted, type: $ grep '^>' email-archive | grep narrative RET

You can also reverse the order and use the -v' option to output all lines containing a given pattern that are not in a given context.

• To list lines from the file email-archive' that contain the word narrative', but not when it is quoted, type:

$grep narrative email-archive | grep -v '^>' RET ### Using a List of Regexps to Match From You can keep a list of regexps in a file, and use grep to search text for any of the patterns in the file. To do this, specify the name of the file containing the regexps to search for as an argument to the -f' option. This can be useful, for example, if you need to search a given text for a number of words--keep each word on its own line in the regexp file. • To output all lines in /usr/dict/words' containing any of the words listed in the file forbidden-words', type: $ grep -f forbidden-words /usr/dict/words RET
• To output all lines in /usr/dict/words' that do not contain any of the words listed in forbidden-words', regardless of case, type:

$grep -v -i -f forbidden-words /usr/dict/words RET ### Regexps for Common Situations The following table lists sample regexps and describes what they match. You can use these regexps as boilerplate when building your own regular expressions for searching text. Remember to enclose regexps in quotes. TO MATCH LINES THAT ... USE THIS REGEXP 0\{9\} - contain nine zeroes in a row ^....$ or ^.\{4\}$ - are exactly four characters long ^.\{70\}$ - are exactly seventy characters long

^\* - begin with an asterisk character

^tow.*ing$ - begin with tow' and end with ing' [0-9] - contain a number ^[^0-9]*$ - do not contain a number

199[1-5] - contain a year from 1991 through 1995

$$195[7-9]$$\|$$196[0-9]$$ - contain a year from 1957 through 1969

\.te\?xt - contain either .txt' or .text'

cat\.\+gory - contain cat' then gory' in the same word

cat\.\+\?gory - contain cat' then gory' in the same line

q[^u] - contain a q' not followed by a u'

$$ftp\|gopher\|http\|$$://.*\..* - contain any ftp, gopher, or http' URLs

N.*T.*K - contain N', T', and K', with zero or more characters between each

## Searching More than Plain Text Files

The following recipes are for searching data other than in plain text files.

### Matching Lines in Compressed Files

Use zgrep to search through text in files that are compressed. These files usually have a .gz' file name extension, and can't be searched or otherwise read by other tools without uncompressing the file first.

The zgrep tool works just like grep, except it searches through the text of compressed files. It outputs matches to the given pattern as if you'd searched through normal, uncompressed files. It leaves the files compressed when it exits.

• To search through the compressed file README.gz' for the text Linux', type:

## Outputting the Context of a Search

It is sometimes useful to see a matched line in its context in the file--that is, to see some of the lines that surround it.

Use the -C' option with grep to output results in context---it outputs matched lines with two lines of "context" both before and after each match. To specify the number of context lines output both before and after matched lines, use that number as an option instead of -C'.

• To search /usr/dict/words' for lines matching tsch' and output two lines of context before and after each line of output, type:

$grep -C tsch /usr/dict/words RET • To search /usr/dict/words' for lines matching tsch' and output six lines of context before and after each line of output, type: $ grep -6 tsch /usr/dict/words RET

To output matches and the lines before them, use -B'; to output matches and the lines after them, use -A'. Give a numeric option with either of these options to specify that number of context lines.

• To search /usr/dict/words' for lines matching tsch' and output two lines of context before each line of output, type:

$grep -B tsch /usr/dict/words RET • To search /usr/dict/words' for lines matching tsch' and output six lines of context after each line of output, type: $ grep -A6 tsch /usr/dict/words RET
• To search /usr/dict/words' for lines matching tsch' and output ten lines of context before and three lines of context after each line of output, type:

$grep -B10 -A3 tsch /usr/dict/words RET ## Searching and Replacing Text A quick way to search and replace some text in a file is to use the following one-line perl command: $ perl -pi -e "s/oldstring/newstring/g;" filespec RET

In this example, oldstring is the string to search, newstring is the string to replace it with, and filespec is the name of the file or files to work on. You can use this for more than one file.

• To replace the string helpless' with the string helpful' in all files in the current directory, type:

\$ perl -pi -e "s/helpless/helpful/g;" * RET

You can also search and replace text in an Emacs buffer; to do this, use the replace-regexp function and give both the expression to search for and the expression to replace it with.

• To replace the text helpless' with the text helpful' in the current buffer, type:

M-x replace-regexp RET helpless RET helpful RET

## Searching Text in Less

There are two useful commands in less for searching through text: / and ?. To search forward through the text, type / followed by a regexp to search for; to search backward through the text, use ?.

When you do a search, the word or other regexp you search for appears highlighted throughout the text.

• To search forward through the text you are perusing for the word cat', type:

/cat RET

To search backward through the text you are perusing for the regexp [ch]at', type:

?[ch]at RET