Split utility Performance on Linux

Generally processing huge files is a big deal. Parallel processing is a common method to get over this problem. There are several ways processing files as parallel. I used MPI api to process files as parallel when I was at Computer Hardware class. It gives you to lots of methods to accomplish this. Parallel processing, basically, one node splits file up to all of the nodes then each node responsible for processing its own data and after processed it informs the coordinator process, as is in MPI. For detailed information you can visit LAM-MPI User Guide.

In addition to MPI api, we can use Unix split function to split up our files then each process could take a file and process it. Every process read its file at the same time therefore we can reduce total file processing time.

In this text I want to describe linux/unix split utility with a demo. Our first scenario we have 2GB file ( output.txt ) and we’re going to split it up to various size files.

[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> ls -la output.txt
-rw-r–r– 1 oracle oinstall 1907874429 Oct 2 13:51 output.txt
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> nohup ./split.sh > split.txt &
[1] 31883
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> tail -f split.txt
Started ( Splitting up to 5000 lines per file ) at :1286016749781
Finished ( Splitting up to 5000 lines per file ) at :1286016757756. Total time:7975 ms.
Started ( Splitting up to 50000 lines per file ) at :1286016757769
Finished ( Splitting up to 50000 lines per file ) at :1286016778449. Total time:20680 ms.
Started ( Splitting up to 500000 lines per file ) at :1286016780335
Finished ( Splitting up to 500000 lines per file ) at :1286016845797. Total time:65462 ms.
[1] + Done nohup ./split.sh > split.txt &

As it seems, if size of the partitioned file increases, processing time will increase as well. Is this true always? I do not think so:

[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> ls -la output.txt
-rw-r–r– 1 oracle oinstall 1907874429 Oct 2 13:51 output.txt
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> nohup ./split.sh > split.txt &
[1] 6335
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> tail -f split.txt
Started ( Splitting up to 5000 lines per file ) at :1286018690329
Finished ( Splitting up to 5000 lines per file ) at :1286018699098. Total time:8769 ms.
Started ( Splitting up to 50000 lines per file ) at :1286018699109
Finished ( Splitting up to 50000 lines per file ) at :1286018706011. Total time:6902 ms.
Started ( Splitting up to 500000 lines per file ) at :1286018706022
Finished ( Splitting up to 500000 lines per file ) at :1286018731639. Total time:25617 ms.
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> nohup ./split.sh > split.txt &
[1] 5900
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test>
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> tail -f split.txt
Started ( Splitting up to 5000 lines per file ) at :1286018566979
Finished ( Splitting up to 5000 lines per file ) at :1286018577839. Total time:10860 ms.
Started ( Splitting up to 50000 lines per file ) at :1286018577850
Finished ( Splitting up to 50000 lines per file ) at :1286018584919. Total time:7069 ms.
Started ( Splitting up to 500000 lines per file ) at :1286018587039
Finished ( Splitting up to 500000 lines per file ) at :1286018607979. Total time:20940 ms.
[1] + Done nohup ./split.sh > split.txt &
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test>

But, if the size of file that is splitted is smaller, results will be more stable:

[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> ls -la output.txt
-rw-r–r– 1 oracle oinstall 41074528 Oct 2 13:53 output.txt
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> nohup ./split.sh > split.txt &
[1] 5142
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> tail -f split.txt
Started ( Splitting up to 5000 lines per file ) at :1286018340006
Finished ( Splitting up to 5000 lines per file ) at :1286018340174. Total time:168 ms.
Started ( Splitting up to 50000 lines per file ) at :1286018340186
Finished ( Splitting up to 50000 lines per file ) at :1286018340344. Total time:158 ms.
Started ( Splitting up to 500000 lines per file ) at :1286018340355
Finished ( Splitting up to 500000 lines per file ) at :1286018340499. Total time:144 ms.
[1] + Done nohup ./split.sh > split.txt &
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> ls -la output.txt
-rw-r–r– 1 oracle oinstall 246447168 Oct 2 14:20 output.txt
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test>
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test>
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test>
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> nohup ./split.sh > split.txt &
[1] 5650
[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> tail -f split.txt
Started ( Splitting up to 5000 lines per file ) at :1286018462209
Finished ( Splitting up to 5000 lines per file ) at :1286018463166. Total time:957 ms.
Started ( Splitting up to 50000 lines per file ) at :1286018463180
Finished ( Splitting up to 50000 lines per file ) at :1286018464039. Total time:859 ms.
Started ( Splitting up to 500000 lines per file ) at :1286018464054
Finished ( Splitting up to 500000 lines per file ) at :1286018464887. Total time:833 ms.

As a result, it could not be said that a number is the best for splitting, it depends on many factors. You ought to test to reduce splitting time if you use split utility .

This tests were performed in Linux 2.6.18-194.el5, SUN X4150. It has eight processors and 16 GB mem.

By the way, split.sh is below :

[oracle@galatats01.no.turkcell.tgc]:/home/oracle/split_test> cat split.sh
#!/bin/ksh

time1=0;
time2=0;
time3=0;
getTime()
{
time1=`perl -MTime::HiRes -e ‘print int(1000 * Time::HiRes::gettimeofday),”\n”‘`
}

getTime
time2=$time1;
echo “Started ( Splitting up to 5000 lines per file ) at :$time1″
split -a 5 -d output.txt -l 5000 output5K
getTime
time3=`echo “$time1 – $time2″ | bc`
echo “Finished ( Splitting up to 5000 lines per file ) at :$time1. Total time:$time3 ms.”

getTime
time2=$time1;
echo “Started ( Splitting up to 50000 lines per file ) at :$time1″
split -a 5 -d output.txt -l 50000 output50K
getTime
time3=`echo “$time1 – $time2″ | bc`
echo “Finished ( Splitting up to 50000 lines per file ) at :$time1. Total time:$time3 ms.”

getTime
time2=$time1;
echo “Started ( Splitting up to 500000 lines per file ) at :$time1″
split -a 5 -d output.txt -l 500000 output500K
getTime
time3=`echo “$time1 – $time2″ | bc`
echo “Finished ( Splitting up to 500000 lines per file ) at :$time1. Total time:$time3 ms.”

Running Shell Script in PL/SQL on a RAC Environment

Sometimes we need to run shell script within the PL/SQL code. To do accompplish this we have two options :

1 ) Create a java stored class that runs shell script with Java Runtime class.
2 ) Using DBMS_SCHEDULER Oracle supplied package

When you are on second option and you work on a RAC environment you should be aware of somethings that are listed below:

1 ) Unix user who had installed the Oracle Database ( generally oracle ) should have been granted to run specified shell script. This is a requirement whether you use Oracle RAC or not.
2 ) In shell script, use exact path of supplied unix utilities instead of using its actual name. For example, if shell script use “echo” utility and you do not call explicitly profile file or set environment variables, you should specify its path exactly what it is, like “/bin/echo”.
3 ) Shell script that will be run by Oracle DB should be placed on all of the nodes with same grants. Because you do not know which node will execute the shell script.
Assume that you have 2 nodes a. Your shell script is located on /home/oracle/demoSh/demo.sh . And instance names that match nodes which are listed below :

node1 : galatats1
node2 : galatats2

By the way you can learn your instance name in your session via querying v$instance system view.

On instance2 :

[oracle@galatats02.no.turkcell.tgc]:/home/oracle/demoSh> ls -la
total 20
drwxr-xr-x 2 oracle oinstall 4096 Sep 26 01:49 .
drwx—— 6 oracle 501 4096 Sep 26 01:49 ..
-rwxr-xr-x 1 oracle oinstall 1239 Sep 26 01:49 demo.sh
-rw-r–r– 1 oracle oinstall 20 Sep 26 01:49 output.txt

In order to execute demo.sh inside the PL/SQL we are creating a DBMS_SCHEDULER job like that :

BEGIN
dbms_scheduler.create_job(
job_name => ‘DEMO’,
job_type => ‘EXECUTABLE’,
job_action => ‘/home/oracle/demoSh/demo.sh’,
start_date => SYSTIMESTAMP,
number_of_arguments=>0,
enabled => true,
auto_Drop => true,
comments => ‘Demo’);
END;

As soon as job is created, it started to run. As soon as job completed its run, it is dropped automatically. Because of auto_drop clause was set to true while job was creating. ( By the way you do not necessary to do this ). How can I check status of my job that has already dropped? You can use USER_SCHEDULER_JOB_RUN_DETAILS view to check status of all scheduler jobs that run. Also In a RAC environment you can see node that job was executed on.

If you change name of demo.sh file on node1, the job we have created above will run successfully sometimes. If you check the status code of user_scheduler_job_run_details you can see both SUCCESSFUL and FAILED status of your job. In the FAILED row you may see this in ADDITIONAL_INFO column:

ORA-27369: job of type EXECUTABLE failed with exit code: No such file or directory

So that, in order to execute shell script as successful via PL/SQL you have to obey some rules. In this article I tried to tell what they are.

jSch — SSH Api for Java Applications — Ssh Port Forwarding in Java

If you write a java network application, for some security reasons you need to make an ssh connection to remote host. To make an ssh connection you need some parameters. Host address, user name and password are some important of them. To pass typing password you can use “send / expect” application or you can use jSch api that i am going to tell you.

SshHandler class main class of this api, it is like interface that provides method to pass user name and password. And Awaker class is my application class that reads file that includes connection information for each line.

Example line is:

nxxast01;turkcell;xyz;110.15.122.20;9911;127.0.0.1;9999

First piece is name of machine, second is username, third is password, fourth is remote host address, fifth is local port, sixth is where remote host will be directed and seventh is which port that remote host forward. I mentioned this to understand code ( by the way i am not a kind of java expert :) )

It means that machine 110.15.122.20 connects to 127.0.0.1:9999 then forward data to localhost:9911
( SSH Port Forwarding )

SshHandler Class :

/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/

package nor;

import com.jcraft.jsch.JSch;
import com.jcraft.jsch.JSchException;
import com.jcraft.jsch.Session;
import com.jcraft.jsch.UserInfo;
import java.util.logging.Level;
import java.util.logging.Logger;

/**
*
* @author TTASUNGUR
*/
public class SshHandler {
JSch jsch=new JSch();

public void openSshConnection(String ps_user,String ps_password, String ps_host, String ps_localport, String ps_remotehost, String ps_remoteport ) throws SshConnectionException
{
try
{
System.out.println(Global.info_prefix+”It will be trying to connect remote host : Username : “+ps_user+”, Password : , Remote Host : “+ps_host+”, Localport : “+ps_localport+”, Remote : “+ps_remotehost+”, RemotePort : “+ps_remoteport);
Session session= jsch.getSession(ps_user, ps_host, Global.sshport);
UserInfo ui = new MyUserInfo(ps_password);
session.setUserInfo(ui);
session.connect();
int assigned_port=session.setPortForwardingL(Integer.parseInt(ps_localport), ps_remotehost, Integer.parseInt(ps_remoteport));
System.out.println(Global.info_prefix+”Ssh connection is established for :” + ps_host +”. Localport:”+assigned_port+”:”+ps_remotehost+”:”+ps_remoteport);
}
catch (JSchException ex)
{
System.err.println(Global.error_prefix+”Ssh connection could not be established for :”+ps_host+”, due to :”+ex.getMessage());
ex.printStackTrace();
throw new SshConnectionException(Global.error_prefix+”Ssh connection could not be established for :”+ps_host+”, due to :”+ex.getMessage());
}
}

public static class MyUserInfo implements UserInfo
{
public String getPassword(){ return passwd; }

public MyUserInfo(String password)
{
this.passwd = password;
}

public boolean promptYesNo(String str){
System.out.println(str+”promptYesNo”);

return true;
}

String passwd;

public String getPassphrase(){ return null; }
public boolean promptPassphrase(String message){ return true; }
public boolean promptPassword(String message){
//System.out.println(“promptPassword”);
// passwd = message;
//passwd = “turkcell”;

return true;

}
public void showMessage(String message){
System.out.println(message);
}

public String[] promptKeyboardInteractive(String destination,
String name,
String instruction,
String[] prompt,
boolean[] echo){
return new String[3];
}
}
}

Awaker Class :
( important line is start with ssh.openSshConnection… )

package nor;

import com.jcraft.jsch.JSch;
import com.jcraft.jsch.JSchException;
import com.jcraft.jsch.Session;
import com.jcraft.jsch.UserInfo;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.Vector;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.logging.Level;
import java.util.logging.Logger;

/**
*
* @author TTASUNGUR
*/
public class Awaker {

ExecutorService threadExecutor;
String s_location_file_name = “rtdf_location_ips.rtdf”;
SshHandler ssh = new SshHandler();
String sentinal_files_path;
public Awaker(String sentinal_files_path)
{
this.sentinal_files_path = sentinal_files_path;
}

public void setStreamLocations(String pstr_location_file_name)
{
FileReader l_fr = null;
String ls_temp=”",ls_temp_arr[],ls_local_port;

try
{
File l_f1 = new File(pstr_location_file_name);
l_fr = new FileReader(l_f1);
BufferedReader br = new BufferedReader(l_fr);

try
{
// first line was metada it can be passed
ls_temp = br.readLine();
}
catch (IOException ex)
{
ex.printStackTrace();
}

while ( ls_temp != null)
{
try
{
ls_temp = br.readLine();
System.out.println(Global.info_prefix+”Line was read : “+ls_temp);
ls_temp_arr = ls_temp.split(Global.location_file_splitter);

if ( ls_temp_arr.length != Global.numberof_location_file_fields)
{
System.out.println(Global.error_prefix+”This line has not “+Global.numberof_location_file_fields+” fields”);
}
else
{
try
{
ssh.openSshConnection(ls_temp_arr[1], ls_temp_arr[2], ls_temp_arr[3], ls_temp_arr[4], ls_temp_arr[5], ls_temp_arr[6]);
//StreamHandler sh1 = new StreamHandler(ls_temp_arr[0]+”(“+ls_temp_arr[3]+”)Local Port:”+ls_temp_arr[4],Global.localhost,Integer.parseInt(ls_temp_arr[4]));
//threadExecutor.submit(sh1);

}
catch (SshConnectionException ex)
{
System.out.println(ex.getMessage());
}

}

}
catch (IOException ex)
{
System.out.println(Global.error_prefix+”The line could not be read!”);

}

}
}
catch (FileNotFoundException ex)
{
System.out.println(Global.error_prefix+”Location file are not found : “+s_location_file_name);
ex.printStackTrace();
}
finally
{
try
{
l_fr.close();
}
catch (IOException ex)
{
ex.printStackTrace();
}
}

}

public void startThreads()
{

threadExecutor = Executors.newCachedThreadPool();

FileHandler fh;
fh = new FileHandler(sentinal_files_path);
fh.run();

// stream locations will be set
setStreamLocations(s_location_file_name);

threadExecutor.shutdown(); // nonsense code
if ( threadExecutor.isTerminated() && threadExecutor.isShutdown() )
{
// this message should not be displayed any time.
// because application will not never finish.
System.out.println(“Info : Application has been finished.”);
}
}

}

To download jSch Api, click here.

Follow

Get every new post delivered to your Inbox.