UNIT - 4
Simple Filters
There are some UNIX commands that accept input from standard input or files, perform some manipulation on it , and produces some output to the standard output. Since these commands perform some filtering operations on data , they are appropriately called as “Filters”. These filters are used to display the contents of a file in stored order, extract the lines of a specified file that contains a specific pattern etc.
Filters are the commands which accept data from standard input manipulate it and write the results to standard output. Filters are the central tools of the UNIX tool kit, and each filter performs a simple function. Some commands use delimiter, pipe (|) or colon (:). Many filters work well with delimited fields, and some simply won't work without them. The piping mechanism allows the standard output of one filter serve as standard input of another. The filters can read data from standard input when used without a filename as argument, and from the file otherwise.
The Simple Database: Several UNIX commands are provided for text editing shell programming. (emp.lst) - each line of this file has six fields separated by i delimiters. The details of an employee are stored in one single line. This text designed in fixed format and containing a personnel database. There are 15 lines, where each field is separated by the delimiter (|).
$ cat emp.ist 2233 | a.k.shukla | g.m | sales | 12/12/52 | 6000 9876 | jai Sharma | director | production | 12/03/50 | 7000 5678 | sanika sar | d.g.m. | marketing | 19/04/43 | 6000 2365 | barun sengupta | director | personnel | 11/05/47 | 7800 5423 | n.k.gupta | chairman | admin | 30/08/56 | 5400 1006 | chanchal singhvi | director | sales | 03/09/38 | 6700 6213 | karuna ganguly |g.m. | accounts | 05/06/62 | 6300 1265 | s.n. dasgupta | manager | sales | 12/09/63 | 5600 4290 | jayant choudhary | executive | production | 07/09/50 | 6000 2476 | anil aggarwal | manager | sales | 01/05/59 | 5000 6521 | lalit choudhury | director | marketing | 26/09/45 | 8200 3212 | shyam saksena | d.g.m. | account | 12/12/55 | 6000 3564 | sudhir agarwal |executive | personnel | 06/07/47 | 7500 2345 | j.b.saxsena | g.m. | marketing | 12/03/45 | 8000 0110 | v.k. agrawal | g.m. | marketing | 31/12/40 | 9000 1. pr: paginating files: pr command adds suitable headers, footers and formatted text. pr adds five lines of margin at the top and bottom. The header shows the date and time of last modification of the file along with the filename and page number. Syntax: $ pr option filename $ pr dept.lst ...blank lines... May 06 10:38 1997 dept.lst page 1 01:accounts:6213 02:progs:5423 03:marketing:6521 05:production:9876 06:sales:1006 ..blank lines. pr options: The different options for pr command are: -k prints k (integer) columns -t to suppress the header and footer -h to have a header of user's choice -d double spaces input - n will number each line and helps in debugging - on offsets the lines by n spaces and increases left margin of page For example, if a file xyz contains series of 20 numbers one in each line then -k and -t options will print the output as follows: $cat xyz | pr –t -5 1 5 9 13 17 2 6 10 14 18 3 7 11 15 19 4 8 12 16 20 $ pr +10 chap01 # starts printing from page 10 $ pr -I 54 chap01 # this option sets the page length to 54
2. head - displaying the beginning of the file : The command displays the top of the file. It displays the first 10 lines of the file when used without an option. Syntax: $ head option filename $ head emp.lst Option: -n to specify a line count $ head –n 3 emp.lst 2233 | a.k.shukla | g.m | sales | 12/12/52 | 6000 9876 | jai Sharma | director | production | 12/03/50 | 7000 5678 | sanika sar | d.g.m. | marketing | 19/04/43 | 6000
3. tail: displaying the end of a file : This command displays the end of the file. It displays the last 10 lines of the file, when used without an option. Syntax: $ tail option filename $ tail emp.lst Option : -n to specify a line count $ tail-n 3 emp.lst 3564 | sudhir agarwal |executive | personnel | 06/07/47 | 7500 2345 | j.b.saxsena | g.m. | marketing | 12/03/45 | 8000 0110 | v.k. agrawal | g.m. | marketing | 31/12/40 | 9000 Displays the last three lines of the file. We can also address lines from the beginning of the file instead of the end. The +count option allows to do that, where count represents the line number from where the selection should begin. $ tail +11 emp.lst Will display 11th line onwards. Different options for tail are: (i) Monitoring the file growth (-f) (ii) Extracting bytes rather than lines (-c) Use tail -f when we are running a program that continuously writes to a file, and we want to see how the file is growing. We have to terminate this command with the interrupt key. |
- cut: slitting a file vertically : It is used for slitting the file vertically, head -n 5 I tee shortlist will select the first five lines of emp.lst and saves it to shortlist. We can cut by using-c option with a list of column numbers, delimited by a comma (cutting columns)
Syntax: $ cut option filename Options: -c for cutting columns -d for delimiters/field separator -f for field $ cut –c 6-22,24-32 shortlist a.k.shukla g.m jai Sharma director sanika sar d.g.m. barun sengupta director n.k.gupta chairman $ cut -c-3,6-22,28-34,55- shortlist The expression 55- indicates column number 55 to end of line. Similarly, -3 is the same as 1-3. Most files don't contain fixed length lines, so we have to cut fields rather than columns (cutting fields). $ cut -d "|" -f 2,3 shortlist |tee cutlist1 a.k.shukla |g.m jai Sharma |director sanika sar |d.g.m. barun sengupta |director n.k.gupta |chairman Will display the second and third columns of shortlist and saves the output in cutlist1
(i) To print the remaining fields, we have $ cut-d\|-f 1,4- shortlist > cutlist2 2. paste: pasting files : When we cut with cut, it can be pasted back with the paste command, vertically rather than horizontally. We can view two files side by side by pasting them. In the previous topic, cut was used to create the two files cutlist1 and cutlist2 containing two cut-out portions of the same file. Syntax: $ paste option filename Options: -d for adding delimiter -s for joining lines $ paste cutlisti cutlist2 a.k.shukla | g.m 2232 | sales | 12/12/52 | 6000 jai Sharma | director 9876 | production | 12/03/50 | 7000 sanika sar | d.g.m. 5678 | marketing | 19/04/43 | 6000 barun sengupta | director 2365 | personnel | 11/05/47 | 7800 n.k.gupta | chairman 5423 | admin | 30/08/56 | 5400 We can specify one or more delimiters with d $ paste -d "|" cutlist1 cutlist2 a.k.shukla | g.m | 2232 | sales | 12/12/52 | 6000 jai Sharma | director | 9876 | production | 12/03/50 | 7000 sanika sar | d.g.m. | 5678 | marketing | 19/04/43 | 6000 barun sengupta | director | 2365 | personnel | 11/05/47 | 7800 n.k.gupta | chairman | 5423 | admin | 30/08/56 | 5400 Where each field will be separated by the delimiter |. Even though paste uses at least two files for concatenating lines, the data for one file can supplied through the standard input. Let us consider that the file address book contains the details of three persons: $cat addressbook Sudhakar vvsr.sudhakar@gmail.com 7689567860 Prateek pratsin@yahoo.com 9128465857 Manisha mani.vara@gmail.com 9763745348 Spaste -s-d "|| \n" addressbook -are used in a circular manner Sudhakar |vvsr.sudhakar@gm ail.com |7689567860 Prateek | pratsin@yahoo.com |9128465857 Manisha mani.vara@gmail.com |9763745348
3. sort: ordering a file : Sorting is the ordering of data in ascending or descending sequence. The sort command orders a file and by default, the entire line is sorted Syntax: $ sort option filename $sort shortlist 2233 | a.k.shukla | g.m | sales | 12/12/52 | 6000 2365 | barun sengupta | director | personnel | 11/05/47 | 7800 5423 | n.k.gupta | chairman | admin | 30/08/56 | 5400 5678 | sanika sar | d.g.m. | marketing | 19/04/43 | 6000 9876 | jai Sharma | director | production | 12/03/50 | 7000 This default sorting sequence can be altered by using certain options. We can also sort one or more keys (fields) or use a different ordering rule. sort options: The important sort options are: -t char uses delimiter char to identify fields -k n sorts on nth field -k m,n starts sort on mth field and ends sort on nth field -k m.n starts sort on nth column of mth field -u removes repeated lines -n sorts numerically -r reverses sort order -f folds lowercase to equivalent uppercase -m list merges sorted files in list -c checks if file is sorted -o filename places output in file filename $sort -t"|" -k 2 shortlist 2233 | a.k.shukla | g.m | sales | 12/12/52 | 6000 2365 | barun sengupta | director | personnel | 11/05/47 | 7800 9876 | jai Sharma | director | production | 12/03/50 | 7000 5423 | n.k.gupta | chairman | admin | 30/08/56 | 5400 5678 | sanika sar | d.g.m. | marketing | 19/04/43 | 6000 $sort -t"|"-r-k 2 shortlist Or $sort-"|"-k 2 r shortlist 5678 | sanika sar | d.g.m. | marketing | 19/04/43 | 6000 5423 | n.k.gupta | chairman | admin | 30/08/56 | 5400 9876 | jai Sharma | director | production | 12/03/50 | 7000 2365 | barun sengupta | director | personnel | 11/05/47 | 7800 2233 | a.k.shukla | g.m | sales | 12/12/52 | 6000
$sort-t"|" -k 3,3 -k 2,2 shortlist 5423 | n.k.gupta | chairman | admin | 30/08/56 | 5400 5678 | sanika sar | d.g.m. | marketing | 19/04/43 | 6000 2365 | barun sengupta | director | personnel | 11/05/47 | 7800 9876 | jai Sharma | director | production | 12/03/50 | 7000 2233 | a.k.shukla | g.m | sales | 12/12/52 | 6000
$sort -t"|"-k 5.7,5.8 shortlist 5678 | sanika sar | d.g.m. | marketing | 19/04/43 | 6000 2365 | barun sengupta | director | personnel | 11/05/47 | 7800 9876 | jai Sharma | director | production | 12/03/50 | 7000 2233 | a.k.shukla | g.m | sales | 12/12/52 | 6000 5423 | n.k.gupta | chairman | admin | 30/08/56 | 5400 when sort acts on numericals, strange things can happen. When we sort a file containing only numbers, we get a curious result. This can be overridden by –n (numeric) option. $sort numfile 10 2 27 4 $sort -n numfile 2 4 10 27 $cut -d "|"-f3 emp.lst | sort -u | tee desigx.lst Chairman d.g.m director executive g.m. manager Removing repeated lines can be possible using -u option as shown above. If we cut out the designation filed from emp.lst, we can pipe it to sort to find out the unique designations that occur in the file. Other sort options are: sort-o sortedlist -k 3 shortlist #output stored in sortedlist sort -o shortlist shortlist #output stored in same file The -c option is used to check whether the file has actually been sorted in the default order. sort -c shortlist The -m option is used to merge two or more files that are sorted individually. sort -m foo1 foo2 foo3
|
This command is used to search for a specified pattern form a specified file and display those lines containing the patter. Syntax:- grep [-option] pattern <filename> Where options -b ignores spaces, tab. -i Ignore case -v Displays only the lines that do not match the specified pattern. -e Displays the total number of occurrences of the pattern in the file. -n Displays the resultant lines along with their line number. Example:- $cat emp.ext 1001 Ram Computer CS 1002 Merry Electronics ET 1003 John Computer CS $grep “CS” emp.txt o/p:- 1001 Ram Computer CS 1003 John Computer CS Regular Expression Character Set *: Represents any number of characters ?: Represents any single character. [r1-r2]: Range [^abcd] : Matches a single character which is not a,b,c or d. ^[character]: Matches the lines that are beginning with the character specified in <Character> [character]$ :Matches the lines that are ending with the character specified in <character> Example:- $grep “Com*” emp.txt o/p:- 1001 Ram Computer CS 1003 John Computer CS Related commands with grep:- 1.egrep [ Extended grep] 2.fgrep [ Fixed grep] egrep :- This command offers additional features than grep. Multiple patterns can be searched by using pipe symbol
2. uniq command: locate repeated and non-repeated lines : When we concatenate or merge files, we will face the problem of duplicate entries creeping in. We saw how sort removes them with the -u option. UNIX offers a special tool to handle these lines-the uniq command. Syntax: $ uniq option filename Consider a sorted dept.lst that includes repeated lines: $cat dept.lst
01|accounts |6213 01|accounts |6213 02|admin |5423 03|marketing | 6521 03| marketing |6521 03| marketing |6521 04|personnel |2365 05|production |9876 06|sales |1006 displays all lines with duplicates. Where as, $uniq dept.lst 01 |accounts |6213 02 |admin |5423 03 |marketing | 6521 04 |personnel |2365 05 | production |9876 06 |sales |1006 simply fetches one copy of each line and writes it to the standard output. Since uniq requires a sorted file as input, the general procedure is to sort a file and pipe its output to uniq. The following pipeline also produces the same output, except that the output is saved in a file: sort dept.lst | uniq - uniqlist Options : Selecting the non-repeated lines (-u): cut -d "|" -f3 emp.lst | sort | uniq -u chairman Selecting the duplicate lines (-d): cut -d "|" -f3 emp.lst | sort | uniq -d d.g.m. director executive g.m. manager Counting frequency of occurrence (c): cut-d"|"-f3 emp.lst |sort| uniq -c 1 chairman 2 d.g.m. 4 director 2 executive 4 g.m. 2 manager
3. tr command: translating characters: The tr filter manipulates the individual characters in a line. It translates characters using one or two compact expressions. Syntax: tr options expn1 expn2 standard input It takes input only from standard input it doesn't take a filename as argument. By default, it translates each character in expression1 to its mapped counterpart in expression2. The first character in the first expression is replaced with the first character in the second expression, and similarly for the other characters. $tr '|/' ‘~’<emp.lst | head -n 3 2233 ~ a.k.shukla ~ g.m ~ sales ~ 12-12-52 ~ 6000 9876 ~ jai sharma ~ director ~ production ~ 12-03-50 ~ 7000 5678 ~ sanika sar ~ d.g.m. ~ marketing ~ 19-04-43 ~ 6000 It is easy to define the two expressions as two separate variables and then evaluate in double quotes. exp1=’|/' ; exp2= ‘~_‘ tr "$exp1" "$exp2" < emp.lst Changing case of text is possible from lower to upper for first three lines of the file. $head -n 3 emp.lst | tr '[a-z]' '[A-Z]’ 2233 | a.k.shukla | g.m | sales | 12/12/52 | 6000 9876 | jai Sharma | director | production | 12/03/50 | 7000 5678 | sanika sar | d.g.m. | marketing | 19/04/43 | 6000 Deleting characters (-d): tr -d'|/’< emp.lst | head -n 3 2233a.k.shukla g.m sales 1212526000 9876jai Sharma director production 1203507000 5678sanika sar d.g.m. marketing 1904436000 Compressing multiple consecutive charecters (-s): tr-s’ ‘< emp.lst | head -n 3 2233 | a.k.shukla | g.m | sales | 12/12/52 | 6000 9876 | jai Sharma | director | production | 12/03/50 | 7000 5678 | sanika sar | d.g.m. | marketing | 19/04/43 | 6000
Complementing values of expression (-c): tr-cd '|/' <emp.lst| head –n 3 ||||//|||||//|||||//|
4. cmp command : Comparing two files : The cmp command is used to compare files. The syntax is as follows: Syntax: $ cmp option filename filename $ cmp file1 file2 file1 file2 differ: char 9, line 1 The file1 and file2 are compared byte by byte, and the location of first mismatch is echoed to the screen. $ cat file1 file2 Sumit Sudhakar Yogiraj Mrinal _._._._._._._._._._ Sumet Sudhakur Yogeraj Mrunal $_ If we want to list out all the differences in the two files then we will use $cmp –| file1 file2 4 151 145 //Fourth character has the octal values 151 and 145 13 141 165 19 151 145 26 51 165 The comm command: listing common records: A comm command compare line to the sorted files filel file2. It produces three column output. First column shows compare line lines unique to the first file, second column shows lines unique to the second file third column shows lines common to both file. Syntax: $ comm [options] <file1> <file2> $ cat file1 file2 Australia China India Japan New Zealand _._._._._._._._._._ California China India Nepal Tanzania
$ comm file1 file2 Australia California China India Japan Nepal New Zealand Tanzania Options: -1 Suppress printing of column 1 - 2 Suppress printing of column 2 - 3 Suppress printing of column 3 - 12 prints only lines in column 3 - 13 prints only lines in column 2 - 23 prints only lines in column 1 $ comm -12 file1 file2 China India The diff command: Displaying suggestion to make both files identical A diff command can be used to display file differences. Output consist of lines of contest from each file, with file1 tagged by a < symbol and file file2 tagged by a > symbol. Context lines are preceded by the following commands. a-append, d-delete, c-change Syntax: $ comm [options] <file1> <file2> $ cat file1 file2 c.k. shukla chanchal singhvi s.n. dasgupta Sanika Sar _._._._._._._._._._ anil aggarwal barun sengupta c.k. shukla lalit chowdhury s.n. dasgupta $ diff file1 file2 0 a 1,2 // append line 1 to 2 of second file after line 0 of first file > anil agarwal > barun sengupta 2 c 4 // change line 2 of first file with line 4 of second file < chanchal singhvi _._._._._._._._._._ > lalit chowdhury 4 d 5 // delete line 4 of first file and line 5 of second file < Sanika Sar
|
References
- Sumitabha Das: UNIX – Concepts and Applications, 4th Edition, Tata McGraw Hill, 2006.
- Behrouz A. Forouzan and Richard F. Gilberg: UNIX and Shell Programming, Cengage Learning, 2005.
- M.G. Venkateshmurthy: UNIX & Shell Programming, Pearson Education, 2005.