Episode 002 – tr

The tr, or translate (aka: transliterate) command, substitutes one more characters for another set of characters or it will delete a specified set of characters.  The tr command takes input from standard in and writes to standard out.  This simple example of the tr command translates some numbers into a world:

echo “12234″ |tr ’1234′ ‘aple’

The output:

apple

The tr command has “translated” each character in set one ’1 2 3 4′ to its corresponding character in set 2 ‘a p l e’ and produced the word “apple” to standard out.

Sets can be defined with short hand.  For instance, in the above set ’1234′ could be replaced with ’1-4′ to achieve the same results:

echo “12234″ |tr ’1-4′ ‘aple’

Take note that unlike in other programs or applications a range is not specified withing brackets (i.e.; [1-4[).  Placing a range within brackets may not produce the desired results.

This also poses a problem when there is a need to translate "-" to something else or strip it out.  Used at the start of the set, tr -ap1, would treat the "-" as a flag, "-a" in this case. Used within a set, tr -ac-k1, the "-" is treated as the range "c-k."  Therefore, if you need to translate the "-" put it at the end of the set:

echo "hello-hooray-let-the-show-begin" | tr - ' '

Results in the output:  "hello hooray let the show begin" where the "-" is replaced by " " a single space.

Characters not specified in standard in will not be translated:

echo "12234" | tr '1-5' 'aples'

Since there is no '5' from standard in there will be no resulting translation of "5" and thus no "s" will appear in standard out.

Characters in the first set that appear mulitple times will be translated to the last corresponding character in set 2 accorddingly:

echo "apple" | tr 'caapa' '12345'

The output is '544le' with the matches in set 1 of a and p being translated to 4 and 5 accordingly.  The 'a' appears three times in set 1 being assigned to 2,3, and 5 but the tr command only uses the last entry so 'a' is set to '5.'

The size of each set is important and depending on what version of tr is installed on your system you may get different results if the sets are not of equal size:

echo 12345 | tr '12345' 'abc'

In this example if you were running the BSD version of tar the resulting output would be 'abccc' where the last character in set 2 was repeated until it matched the number of characters in set 1.  In the system V version of tr would truncate set 1 to the lenght of set 2 and the output would be 'abc45.'  GNU tr handles these cases like BSD tr unless the -t (truncate  set 1) flag is used.

Set 2 can contain the short form for repeat strings to flush out the size of set 2 to the size of set 1.  The syntax for the repeat string is:  [n*] where n is some character:

echo 12345 | tr ’12345′ ‘[a*]bc’

Results in ‘aaabc’ and:

echo 12345 | tr ’12345′ ‘ab[c*]‘

Results in ‘abccc.’

Defined classes can be used instead of specifying individual characters in a set.  For example, the set of all characters in the english alphabet can be defined by the class [:alpha:] and the set of all numbers 0-9 is defined by the class [:digits:].

echo “apple345 | tr [:alpha:] ’1′

Results in:  ’11111345′ while:

echo “apple245! | tr [:digit:] ‘abc’

Results in: ‘appleccc!’  Notice an exclamation point was tacked on the end there.  That is not in the [:alpha:] nor [:digig:] classes.  You might wonder why the 345 was converted to only 3 ‘c’s’.  Recall that in GNU tr like BSD tr set2 will repeat the last character to expand to set 1.  So 0 = a, 1 = b, and 2 through 9 = c.

A full list of character sets can be found in the man or info pages for tr.  Some of the more commons sets to use are:

  • [:alpha:] – all alphabet characters
  • [:digit:] – digits 0-9
  • [:alnum:] – a-z0-9
  • [:punct:] – punctuation characters
  • [:lower:] – lower case letters (a-z)
  • [:upper:] – upper case letters (A-Z)
  • [:blank:] and [:space:] – horizontal and vertical whitespace

In most cases character classes [:lower:] and [:upper:] are the only two  classes allowed in set 2 so long as their corresponding class is used respectively as set 1 (e.g.:  tr [:upper:] [:lower:]).   The exception to this rule is when using the –delete or —squeeze-repeats option (covered shortly).

The tr command accepts the following flags:

  • -c, -C, –complement  - first complment set 1
  • -d, –delete – delete characters in set 1
  • -s, –squeeze-repeats – replace repeated characters in set1 with single occurance of that character
  • -t, –truncate-set1 – first truncate set 1 to length of set 2

The -t, or truncate set  1, option reversed the way GNU tr handles set 2 when it is smaller than set 1.   The default behavior is to repeat the last character in set 2 for each corresponding character in set 1 beyond the size of set 2.  The -t option truncates set 1 to the size of set 2 so characters in set 1 with no corresponding complement in set 2 would be ignored:

echo “123456″ | tr -t ’123456′ ‘abc’

Would produce the output:  ’abc456′ where as without the -t flag the output would be ‘abcccc’ as ’456′ in set 1 would be matched to the ‘c’ in set 2.

The -d, or –delete, does not translate but deletes characters and does not accept a set 2.  If you try to pass a set 2 it will produce an error.  Set 1 can consist of characters or classes:

echo “123apple45″ | tr -d ’12345′

echo “123apple45″ | tr -d [:digits:]

Both produce the same output:  ’apple’ stripping the numbers ’123′ and ’45′ from the output.

The squeeze, -s, flag is used to replace repeating characters specified in set 1 if no translation or deletion is to occur.  If translation or deletion is to occur then set 2 is used and squeeze occurs after the translation or deletion of characters specified in set 1.

echo “apple123455p” | tr -s ‘ap5′

In this example tr would replace any repeating instances of a, p, or 5 with a single instance of the respective character.  The result is ‘aple12345p’ where the second ‘p’ and second ’5′ were “squeezed” out but not the first and only occurrence of ‘a” as that  is merely a single character.  The last ‘p’ is not stripped out because it occurs only by itself.

The -c, -C, or –complement flag replaces the complement, those characters not in the stream, of set 1 with the characters in set 2.  For example:

echo “12345apple” |tr -c [:alpha:] ‘s’

The output of tr is:  ’sssssapple’ replacing all the characters not in the [:alpha:] set with ‘s’ – ’12345′.  The characters ‘apple’ are ignored because they are in the [:alpha:] set.

Grasping the basics of the tr command will allow you to chain flags together to produce more comlex results.  A common example of this is to get all the words in a text file:

tr -cs ‘[:alpha:]‘ ‘[\n*]‘ < somefile.txt

This example makes use of the –complement and –squeeze flags.  Standard in is the file somefile.txt and that is passed to the tr command.  The tr command first takes the complement of set 1 [:alpha:] which is all digits, spaces, punctuation, etc; characters that are not in the alphabet (a-z) and replaces them with the \n, or newline character.  The asterisk is required in set 2 to repeat the \n in set 2 so the length equals that of set 1.  The –squeeze option then squeezes out extra \n’s in the output so that there is only one newline after each word.  Recall that when –squeeze is used in conjunction with another flag, the set 1 is processed by the first flag and –squeeze uses the values in set 2.  The result is a list of words or letters to standard out.

echo “here is a simple line!  Only 10 words long b.” | /

tr -cs ‘[:alpha:]‘ ‘[\n*]‘

The output of this command is a string of the words and letters:

  • here
  • is
  • a
  • simple
  • line
  • only
  • words
  • long
  • b

The !, 10, ., and each space was stripped out and replaced with a newline character.

One of the more handy uses of the tr command is to replace dos line breaks in text files with the standard new line:

tr -d ‘\r’ < somefiledos.txt > somefileproper.txt

This example strips out or deletes the dos end of line character \r.  If you ended with a file that was all one continues line of text you may need to replace as oppose to delete:

tr ‘\r’ ‘\n’ < somefiledos.txt > somefileproper.txt

For more on the tr command consult the links below and also be sure to check out the man and info pages on your current system.

Bibliography

If the video is not clear enough view it off the YouTube website and select size 2 or full screen.  Or download the video in Ogg Theora format:

Thank you very much

 

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>