Home > Unix > Split utility Performance on Linux

Split utility Performance on Linux

Generally processing huge files is a big deal. Parallel processing is a common method to get over this problem. There are several ways processing files as parallel. I used MPI api to process files as parallel when I was at Computer Hardware class. It gives you to lots of methods to accomplish this. Parallel processing, basically, one node splits file up to all of the nodes then each node responsible for processing its own data and after processed it informs the coordinator process, as is in MPI. For detailed information you can visit LAM-MPI User Guide.

In addition to MPI api, we can use Unix split function to split up our files then each process could take a file and process it. Every process read its file at the same time therefore we can reduce total file processing time.

In this text I want to describe linux/unix split utility with a demo. Our first scenario we have 2GB file ( output.txt ) and we’re going to split it up to various size files.

[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> ls -la output.txt
-rw-r–r– 1 oracle oinstall 1907874429 Oct 2 13:51 output.txt
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> nohup ./split.sh > split.txt &
[1] 31883
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> tail -f split.txt
Started ( Splitting up to 5000 lines per file ) at :1286016749781
Finished ( Splitting up to 5000 lines per file ) at :1286016757756. Total time:7975 ms.
Started ( Splitting up to 50000 lines per file ) at :1286016757769
Finished ( Splitting up to 50000 lines per file ) at :1286016778449. Total time:20680 ms.
Started ( Splitting up to 500000 lines per file ) at :1286016780335
Finished ( Splitting up to 500000 lines per file ) at :1286016845797. Total time:65462 ms.
[1] + Done nohup ./split.sh > split.txt &

As it seems, if size of the partitioned file increases, processing time will increase as well. Is this true always? I do not think so:

[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> ls -la output.txt
-rw-r–r– 1 oracle oinstall 1907874429 Oct 2 13:51 output.txt
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> nohup ./split.sh > split.txt &
[1] 6335
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> tail -f split.txt
Started ( Splitting up to 5000 lines per file ) at :1286018690329
Finished ( Splitting up to 5000 lines per file ) at :1286018699098. Total time:8769 ms.
Started ( Splitting up to 50000 lines per file ) at :1286018699109
Finished ( Splitting up to 50000 lines per file ) at :1286018706011. Total time:6902 ms.
Started ( Splitting up to 500000 lines per file ) at :1286018706022
Finished ( Splitting up to 500000 lines per file ) at :1286018731639. Total time:25617 ms.
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> nohup ./split.sh > split.txt &
[1] 5900
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test>
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> tail -f split.txt
Started ( Splitting up to 5000 lines per file ) at :1286018566979
Finished ( Splitting up to 5000 lines per file ) at :1286018577839. Total time:10860 ms.
Started ( Splitting up to 50000 lines per file ) at :1286018577850
Finished ( Splitting up to 50000 lines per file ) at :1286018584919. Total time:7069 ms.
Started ( Splitting up to 500000 lines per file ) at :1286018587039
Finished ( Splitting up to 500000 lines per file ) at :1286018607979. Total time:20940 ms.
[1] + Done nohup ./split.sh > split.txt &
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test>

But, if the size of file that is splitted is smaller, results will be more stable:

[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> ls -la output.txt
-rw-r–r– 1 oracle oinstall 41074528 Oct 2 13:53 output.txt
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> nohup ./split.sh > split.txt &
[1] 5142
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> tail -f split.txt
Started ( Splitting up to 5000 lines per file ) at :1286018340006
Finished ( Splitting up to 5000 lines per file ) at :1286018340174. Total time:168 ms.
Started ( Splitting up to 50000 lines per file ) at :1286018340186
Finished ( Splitting up to 50000 lines per file ) at :1286018340344. Total time:158 ms.
Started ( Splitting up to 500000 lines per file ) at :1286018340355
Finished ( Splitting up to 500000 lines per file ) at :1286018340499. Total time:144 ms.
[1] + Done nohup ./split.sh > split.txt &
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> ls -la output.txt
-rw-r–r– 1 oracle oinstall 246447168 Oct 2 14:20 output.txt
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test>
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test>
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test>
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> nohup ./split.sh > split.txt &
[1] 5650
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> tail -f split.txt
Started ( Splitting up to 5000 lines per file ) at :1286018462209
Finished ( Splitting up to 5000 lines per file ) at :1286018463166. Total time:957 ms.
Started ( Splitting up to 50000 lines per file ) at :1286018463180
Finished ( Splitting up to 50000 lines per file ) at :1286018464039. Total time:859 ms.
Started ( Splitting up to 500000 lines per file ) at :1286018464054
Finished ( Splitting up to 500000 lines per file ) at :1286018464887. Total time:833 ms.

As a result, it could not be said that a number is the best for splitting, it depends on many factors. You ought to test to reduce splitting time if you use split utility .

This tests were performed in Linux 2.6.18-194.el5, SUN X4150. It has eight processors and 16 GB mem.

By the way, split.sh is below :

[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> cat split.sh
#!/bin/ksh

time1=0;
time2=0;
time3=0;
getTime()
{
time1=`perl -MTime::HiRes -e ‘print int(1000 * Time::HiRes::gettimeofday),”\n”‘`
}

getTime
time2=$time1;
echo “Started ( Splitting up to 5000 lines per file ) at :$time1”
split -a 5 -d output.txt -l 5000 output5K
getTime
time3=`echo “$time1 – $time2” | bc`
echo “Finished ( Splitting up to 5000 lines per file ) at :$time1. Total time:$time3 ms.”

getTime
time2=$time1;
echo “Started ( Splitting up to 50000 lines per file ) at :$time1”
split -a 5 -d output.txt -l 50000 output50K
getTime
time3=`echo “$time1 – $time2” | bc`
echo “Finished ( Splitting up to 50000 lines per file ) at :$time1. Total time:$time3 ms.”

getTime
time2=$time1;
echo “Started ( Splitting up to 500000 lines per file ) at :$time1”
split -a 5 -d output.txt -l 500000 output500K
getTime
time3=`echo “$time1 – $time2” | bc`
echo “Finished ( Splitting up to 500000 lines per file ) at :$time1. Total time:$time3 ms.”

Categories: Unix Tags:
  1. No comments yet.
  1. No trackbacks yet.

Leave a comment