Vertica Parallel Load

As it is known there are 2 types of storage area in Vertica, WOS and ROS.  Whereas WOS stores data in-memory, ROS stores data in-disk. If you are using COPY command without DIRECT hint, data is first loaded into the WOS area then Tuple Mover moves the WOS data to the ROS area. If you load big files, streams you can pass the WOS area by using DIRECT command and it improves the load performance.

Actually I do not want to mention about the DIRECT hint, I want to talk about the PARALLEL file load.

Think about you have a big file, more than 100 GB size and you want to load this file data into the vertica as parallel as possible. Where can you parallelise at ? File reading parallelism and loading parallelism.

With loading-parallelism you can modify the resource-pool configuration to increase or decrease # of slave threads by using EXECUTIONPARALLELISM parameter. If you set this value to the 1 just one thread executes the query causes to reduce the response times if you have multiple small queries. But it causes to long response times for long queries. For these types of queries you should increase the parameter value as well.

Another method is the main subject of this post, file reading parallelism. If you have multiple machine and one big file and you want to read and load this file to the vertica database in default you can use just one machine to read file.

COPY customers FROM LOCAL
'/data/customer.dat'
DELIMITER '~'
REJECTMAX 1000
EXCEPTIONS '/home/dbadmin/ej'
REJECTED DATA '/home/dbadmin/rj'
ENFORCELENGTH
DIRECT
STREAM NAME 'stream_customers';

But if you have 4 machine and if you split-up that file into 4 pieces ( Linux Split, then distribute files across the nodes ) then you can load much faster than the previous method. With this method you should also specify the node name.

COPY customers FROM 
'/data/customer.dat' ON v_myv_node0001,
'/data/customer.dat' ON v_myv_node0002,
'/data/customer.dat' ON v_myv_node0003,
'/data/customer.dat' ON v_myv_node0004
DELIMITER '~'
REJECTMAX 1000
EXCEPTIONS '/home/dbadmin/ej'
REJECTED DATA '/home/dbadmin/rj'
ENFORCELENGTH
DIRECT
STREAM NAME 'stream_customers';

And also I experienced that if you put multiple files for each node it can load much much faster than the all methods. For example in above case we have 100 GB customer file and we split it into 4 pieces and then each file has 25 GB data. In addition to that if you split it again in each node and issue seperate COPY commands you can see the difference.

-- COPY STREAM 1
COPY customers FROM 
'/data/customer1.dat' ON v_myv_node0001,
'/data/customer1.dat' ON v_myv_node0002,
'/data/customer1.dat' ON v_myv_node0003,
'/data/customer1.dat' ON v_myv_node0004
DELIMITER '~'
REJECTMAX 1000
EXCEPTIONS '/home/dbadmin/ej'
REJECTED DATA '/home/dbadmin/rj'
ENFORCELENGTH
DIRECT
STREAM NAME 'stream_customers';

-- COPY STREAM 2
COPY customers FROM 
'/data/customer2.dat' ON v_myv_node0001,
'/data/customer2.dat' ON v_myv_node0002,
'/data/customer2.dat' ON v_myv_node0003,
'/data/customer2.dat' ON v_myv_node0004
DELIMITER '~'
REJECTMAX 1000
EXCEPTIONS '/home/dbadmin/ej'
REJECTED DATA '/home/dbadmin/rj'
ENFORCELENGTH
DIRECT
STREAM NAME 'stream_customers';
Categories: Uncategorized Tags: , ,

The Book: CentOS System Administration Essentials

Packt Publishing is the UK based tech book publishers which published the book in recently. I am proud of being one of the reviewer of this book.

CentOS System Administration Essentials

CentOS System Administration Essentials

This book is a guide for administrators and developers who develops application which run in CentOS. It covers many subjects that administrators and developers need to know work with Linux. And also it has some bonus topics like installation of LDAP, Nginx, Puppet etc.

The book is not useful the people who is the Linux newbie. There are many other books can be read for that purpose.

Besides, this book is my second book which I am the one of the reviewer. The first one is the Getting Started with Oracle Event Processing 11g.

Categories: Uncategorized

Memorize Words

I have developed a simple  GUI application in order to memorize words easily which was written in Python and wxPython. This application provides to build a dictionary from any delimited text file and then asks words from that dictionary. Application saves the true/false count for each word in specified dictionary and then it asks the words which is not learned yet. In next versions I want to improve the ask engine in order to ask proper words.

Application has two built-in dictionary English-Turkish, Spanish-English.
This version yes it is a release candidate but not release itself.

This application is really simple and small and not tested comprehensively. Any new feature requests, bug issues would be highly appreciated.

Github repo : https://github.com/afsungur/MemWord

Other explanations:
https://github.com/afsungur/MemWord/blob/master/README.md

Screen Shot 2013-11-15 at 1.20.17 AM

Java Pipes vs. Sockets in a Single JVM

In a recent project I should have implemented an InputStream in a java code. There are many ways to achieve this and in this blog post I will show off the performance of just two of them, Sockets and Pipes ( PipedInputStream and PipedOutputStream ).

Scenario is really simple, I’ll send both 10 and 100 million messages from producer to consumer for each scenario. Message is just a sentence, “This is a message sent from Producer”.

I run all the scenario for 3 times and then get the average due to some os environment factor, other processes etc. And also I used the PrintStream as data writer for both test and BufferedReader for reader.

Buffer values can be set for Pipes by passing the long value to the PipedInputStream constructor. And for socket, buffer can be set from BufferedReader constructor. Buffer value for this test is 1024*1024*8 for both scenarios.

Results:

Pipe Socket
10M msg 12sec 26sec
100M msg 72sec 239sec

The main code of the test:

public class PipeAndSocketTester {

	public static void main(String args[]) throws IOException {
		boolean autoFlush = false;
		long messageNumber = 10000000;

		PipedOutputStream pos = new PipedOutputStream();
		PipedInputStream pis = new PipedInputStream(pos, 1024*1024*10);

		new PipeWriter(pos,messageNumber,autoFlush).start();
		new PipeReader(pis).start();

		// after above code executed I remove the pipe code then execute follows
		new SocketWriter(4544,messageNumber).start();
		new SocketReader(4544,1024*1024*10).start();

	}
}
}
Categories: Java Tags: ,

Oracle Event Processing – Pattern Matching Example 2

In previous post I showed off one of the pattern matching feature of Oracle Event Processing and in this section I am gonna show another example of it.

There are many built-in functions which can be used in CQL. I think one of the most important these functions is the prev function. Prev function facilitates accessing the previous elements in the same stream/partition easily.

Consider this scenario, we have stock values as below and I would like to find out the pattern; firstly average value of previous 3 elements should be greater than 10 and the last value greater than the last value of the previous condition.

If we want to write to CQL code of our requirements it would look like:

CQL Example

With numbers lets try to understand how the pattern can be matched ( left side: incoming events; right side: expression about pattern matching):

Events and patterns

 

Run the application:

output

As shown in above, after the 13 and 33 pattern has been matched and output was produced.

 

 

Oracle Event Processing – Pattern Matching Example

Event processing technologies became famous in recent years. Companies realizes that they need to take real time action in order to satisfy customer requirements or handle several component issues happened in some internal system or advertising etc.

Oracle Event processing has several processing methods in order to process incoming real time data. In this blog post I will show off one of the pattern matching operations.

Consider, you desire to find out which stock firstly increases then decreases three times and increases again. For example;

6.48;6.47;6.46;6.48;6.49;6.48;6.47;6.46;6.47;6.46;6.45;6.44;6.43;6.42;6.43

In the above numbers we will find out the pattern which starts with 6.48 and ends with 6.47. Because it firstly incremented from 6.48 to 6.49 and then decreased three times ( 6.48,6.47,6.46 ) and then increased from 6.46 to 6.47 again.

I am using the Oracle Event Processing 11.1.1.7 for Windows 64bit and Cygwin as windows terminal.

I changed a little bit the HelloWorld default cep application. The most crucial part is the CQL Processor part, how should I implement the CQL in order to find out the patterns?

OCEP Cql Processor Pattern Matching

OCEP Cql Processor Pattern Matching

Let’s look at the clauses in briefly.

StockProcessor is the name of the cql processor.
PARTITION BY shortName: Partition incoming data by shortName ( CQL Engines calculates each partition individually. For example, ORCL, IBM, GOOG prices are computed individually. )
MEASURES : Which values, fields will be in the output
PATTERN : Order of conditions ( Look at regular expression for more details )
DEFINE : Defining the conditions, “A as” means the define condition A.

Run the application and type the stock name and prices as in below:

OCEP Cql Processor Pattern Matching Demo

OCEP Cql Processor Pattern Matching Demo

Just after pattern is matched the output event is produced and in this example I just printed the name and last price value of the stock.

I didn’t specify any performance measurements but if the number of distinct stocks are increased performance will be our first concern. However, I would like to share some performance results in a another blog post later.

The Book: Getting Started with Oracle Event Processing 11g

Packt Publishing is the UK based tech book publishers which will publish an expected book in a few days.

Besides, this book is the my first book which I am the one of the reviewer:

Getting Started with Oracle Event Processing 11g

Getting Started with Oracle Event Processing 11g

The one of the authors is the Robin J. Smith who is a great and much more experienced IT man who I met in the Istanbul at a lunch. The people who are looking for a book which tells the concepts of the Event Processing with really good designed examples and includes much more detail about the Oracle Event Processing 11g product; this book is definitely for you.

Follow

Get every new post delivered to your Inbox.

Join 86 other followers