Episode 017 - split

The split command is used to split up a file into smaller files. For example, if you need to transfer a 3GB file but are restricted in storage space of the transfer to 500 MB you can split the 3GB file up into about 7 smaller files each 500MB or less in size. Once the files are transferred restoring them is done using the cat command and directing the output of each file back into the master file:

split -b500M some3GBfile

This will generate a number of 500MB files with a naming structure xa[a-z]:

xaa xab xac xad…

When you want to restore the original file use the cat command:

cat xa* > some3GBfile

Files can be split by size in bytes, lines, or characters. The default is to split files by 1000 lines. You can change the number of lines using the -l or --lines= switch:

split -l20 mypoem100lines

This will create 5 files xaa, xab, xac, xad, xae each with 20 lines of the original file.

You can specify line bytes with the -C or --line-bytes= switch and then a number:

split -C20 mypoem100lines

In this instance instead of splitting the file into 5 smaller files each with 20 lines of the poem, a number of smaller files would be created each with 20 line bytes of the poem. If the file contains ASCII, alpha numeric characters, this will generally be 1 character per byte. Depending on how many characters are in this file a huge number of smaller files could be created.

As noted in the first example, you can split files based upon size in bytes with the -b or --bytes=. Most newer versions of split will allow you to use size integrers: K for Kilobytes, M for Megabytes, G for Gigabytes, etc. K, M, G, T, P, E, Z, Y are powers of 1024 and KB, MB, GB, TB, PB, EB, ZB, YB are powers of 1000.

split -b1K mypoem100lines
split -b1KB mypoem100lines

The first command will split the file into a number of smaller files each 1024 bytes or less in size. The second example will do the same thing but each file will be 1000bytes or less in size. In each example by “or less in size” typically refers to the last file generated which will be sized the remaining bytes left. Most files will not divide evenly with no remainder.

The default output of split is to create files with the naming convention x[a-z][a-z]:

xaa, xab, xac… xba, xbb, xbc… xza, xzb…

The prefix, the default “x”, can be changed by passing a prefix after the input name:

split -l20 mypoem100lines MyPoemSplit

Instead of:

xaa, xab, xac, xad…

The output would be:

MyPoemSplitaa, MyPoemSplitab, MyPoemSplitac…

The suffix “aa” can be altered with a few different switches; -a or --suffix-length=N will generate suffixes of N length, the default is 2. For instance:

split -l20 -a4 mypoem100lines

Would result in a suffix length of 4 characters:

xaaaa, xaaab, xaaac…

The -d or --numeric-suffixes=N will use numeric instead of alphabetic characters:

split -l20 --numeric-suffixes=12 mypoem100lines

The suffix in this case would start at 12 and increment:

x12, x13, x14…

Note that some older versions of split will not allow you to pass a value to --numeric-suffixes.

The -d will start numbering at the default 0. If you want to start with a different number pass the number using the --numeric-suffixes= switch as just -dN will most likely throw an error.

Finally, the --additional-suffix= will append an additional suffix to the end of each file:

split -l20 --additional-suffix=part mypoem100lines

Would create a number of files with an ending suffix of “part” :

xaapart, xabpart, xacpart…

You might wonder what would happen if you run out of suffix increments. For instance, if you executed this:

split -l1 -a1 mypoem100lines

This would begin to output:

xa, xb, xc, xd…

Once it reached xz what would it do? It would fail and report the following message:

split: output file suffixes exhausted

So be careful of your suffix choices when splitting a large file into many smaller files.

If you pass the --verbose switch split will elucidate what it is doing. Typically the output would state: “creating file ‘file name’” for each file split creates from the original.

The examples we have looked at dealt with splitting a larger file up into smaller files based upon a specified size. Split has an option to split a larger file up into a specific number of smaller files or chunks with the -n or --number switch. The default option is to pass a single number to the switch:

split -n5 mypoem100lines

This will split mypoem100lines into 5 smaller files:

xaa, xab, xac, xad, xae

These files should be the same size.

The format K/N will instead of splitting the file up into n smaller files will instead split the file up but output K to standard out insted of writing any files. That is:

split -n2/5 mypoem100lines

Will split mypoem100lines up into the equivalent of 5 equal files but instead of writing those files would output chunk 2 to standard out.

The l/N will split the file up into N number of smaller files but will not split lines. Unlike standard N the files will be of different size now since lines will not be split up.

split -nl/3 mypoem100lines

Mypoem100lines is split into 3 files preserving the split line.

l/K/N acts just like l/N where lines are not split but instead of writing each file to disk, file K is sent to standard out.

There are two other format to the -n command: r/N and r/K/N. The “r” acts as “l” splitting the files on lines and not breaking lines, but does so in a round robin distribution. Again, r/K/N will do round robin split on lines and output K to standard out instead of writing splits to files. Round robin means that instead of the file being split where the first 5 lines to one file, the second five lines to another, and so on; the first line of the file goes to one file, the second to the next, the third to the third, and so on until it reaches the end of the sequence. Split will then wrap around back to the first file and write the remaining content thusly. For instance if we had a 10 line file with each line being a number and split this into 4 files:

split --nr/4 10linefile

This would produce the following output:

xaa:
1
5
xab:
2
6
10
xac:
3
7
xad:
4
8

Aside from splitting a file out to smaller files split has a function to pass the output to a command using --filter=[command]. For example:

split -l10 mypoem100lines --filter=”cat $file”

Would split mypoem100lines into chunks of 10 lines and pass this to the cat command via the filter switch. Cat would then output the contents of $file, the 10 lines, to standard out. The variable name of the output from split is always $file.

The -e or --elide-empty-files (elide means omit) will suppress the generation of empty or zero length files. For instance:

split -n100 10linefile

Would produce 100 files from the 10linefile over half of which would contain no data. Where as:

split -n100 -e 10linefile

Would suppress the creation of those 0 byte files.

Split will also take input from standard out instead of a file. For instance:

tail -f /var/log/apache/error_log | split -l50

This will split the output of tailing the Apache error_log.txt into files of 50 lines each.

Bibliography:

  • man split
  • info split

If the video is not clear enough view it off the YouTube website and select size 2 or full screen.  Or download the video in Ogg Theora format:

Thank you very much!

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Episode 017 - split

  1. bruce patterson says:

    Hey Dann,
    I’ve been meaning to post a thank you for some time but I was overcome with laziness! Thanks for taking the time to post this stuff. I find all of this stuff incredibly helpful and when I can I pimp this site out to anyone who’ll listen. Thanks again!

    Bp

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>