Archive

Posts Tagged ‘vertica’

Vertica Parallel Load

As it is known there are 2 types of storage area in Vertica, WOS and ROS.  Whereas WOS stores data in-memory, ROS stores data in-disk. If you are using COPY command without DIRECT hint, data is first loaded into the WOS area then Tuple Mover moves the WOS data to the ROS area. If you load big files, streams you can pass the WOS area by using DIRECT command and it improves the load performance.

Actually I do not want to mention about the DIRECT hint, I want to talk about the PARALLEL file load.

Think about you have a big file, more than 100 GB size and you want to load this file data into the vertica as parallel as possible. Where can you parallelise at ? File reading parallelism and loading parallelism.

With loading-parallelism you can modify the resource-pool configuration to increase or decrease # of slave threads by using EXECUTIONPARALLELISM parameter. If you set this value to the 1 just one thread executes the query causes to reduce the response times if you have multiple small queries. But it causes to long response times for long queries. For these types of queries you should increase the parameter value as well.

Another method is the main subject of this post, file reading parallelism. If you have multiple machine and one big file and you want to read and load this file to the vertica database in default you can use just one machine to read file.

COPY customers FROM LOCAL
'/data/customer.dat'
DELIMITER '~'
REJECTMAX 1000
EXCEPTIONS '/home/dbadmin/ej'
REJECTED DATA '/home/dbadmin/rj'
ENFORCELENGTH
DIRECT
STREAM NAME 'stream_customers';

But if you have 4 machine and if you split-up that file into 4 pieces ( Linux Split, then distribute files across the nodes ) then you can load much faster than the previous method. With this method you should also specify the node name.

COPY customers FROM 
'/data/customer.dat' ON v_myv_node0001,
'/data/customer.dat' ON v_myv_node0002,
'/data/customer.dat' ON v_myv_node0003,
'/data/customer.dat' ON v_myv_node0004
DELIMITER '~'
REJECTMAX 1000
EXCEPTIONS '/home/dbadmin/ej'
REJECTED DATA '/home/dbadmin/rj'
ENFORCELENGTH
DIRECT
STREAM NAME 'stream_customers';

And also I experienced that if you put multiple files for each node it can load much much faster than the all methods. For example in above case we have 100 GB customer file and we split it into 4 pieces and then each file has 25 GB data. In addition to that if you split it again in each node and issue seperate COPY commands you can see the difference.

-- COPY STREAM 1
COPY customers FROM 
'/data/customer1.dat' ON v_myv_node0001,
'/data/customer1.dat' ON v_myv_node0002,
'/data/customer1.dat' ON v_myv_node0003,
'/data/customer1.dat' ON v_myv_node0004
DELIMITER '~'
REJECTMAX 1000
EXCEPTIONS '/home/dbadmin/ej'
REJECTED DATA '/home/dbadmin/rj'
ENFORCELENGTH
DIRECT
STREAM NAME 'stream_customers';

-- COPY STREAM 2
COPY customers FROM 
'/data/customer2.dat' ON v_myv_node0001,
'/data/customer2.dat' ON v_myv_node0002,
'/data/customer2.dat' ON v_myv_node0003,
'/data/customer2.dat' ON v_myv_node0004
DELIMITER '~'
REJECTMAX 1000
EXCEPTIONS '/home/dbadmin/ej'
REJECTED DATA '/home/dbadmin/rj'
ENFORCELENGTH
DIRECT
STREAM NAME 'stream_customers';
Categories: Uncategorized Tags: , ,