Overview

The datawarehouse usually takes each day data as tables or files from several primary systems and loads them into one relational database. Then data are basicaly purified and processed using various mappings (process) into another targets which are usually tables, files or reports. Each mapping can run after its sources contain prepared data using some other processes. And it is not rare if the data from primary systems are prepared in some days in very different time than in other days. So the order of running of processes is not given and should be found out in runtime according to the dependencies between processes and their sources. And also the final targets can be faster loaded when processes can run in parallel.

This procedure as described above is possible using DwWorkflow.

Workflow Example

I'll show this concept on a very simple example. Each day one file is created from a primary system and this file is merged into a table. The file can be merged after the file from previous day is already merged. Then the system is composed of one procedure creating the file, one procedure merging into the table, one table and one file. These two objects and two procedures must be registered in the DwWorkflow server and the two dependencies must also be registered. To run the workflow each day it is necessary to have virtual object and mapping scheduler. It is shown on the following picture.

How this example is registered into the server? One way is to use the console.

	 saveObject('scheduler', True, 10000, 0)
	 saveObject('file', True, 10000, 0)
	 saveObject('table', True, 10000, 0)
	 saveMapping('scheduler', 'wf.mapimpls.DayScheduler', '', None, 'scheduler', 10000)
	 saveMapping('file', 'wf.mapimpls.DummyMappingImpl', None, None, 'file', 10000)
	 saveMapping('table', 'wf.mapimpls.DummyMappingImpl', None, None, 'table', 10000)
	 addDep('scheduler', 'table', 10000, -1)
	 addDep('file', 'scheduler', 10000, 0)
	 addDep('table', 'file', 10000, 0)

The other way is to use the eclipse plugin.

When we look at the commands in the console then only the first parameters are obvious - these parameters are names. To understand the other parameters we have to know more information about the used concept. Each object has state or states. The state is composed of two numbers. First number is n and it means time and the second number is s it means state at the given time. And there are two types of objects either with only one state or with many states. In the meantime we will use only the first option. Now we can understand the last three parameters in saveObject command. The True means an object with only one state, 10000 is s and this number is reserved for the final OK state and the last parameter 0 is n - it's the beginning.

To understand saveMapping it's necessary to know how processes run and a little about generating of mappings. Each process is run by the DwWorkflow server by running of a java class which implements a java interface. This java class is the second parameter. Some processes needs initial information which can be taken from the third parameter which is a xml code because the implemented java class is usually written as a general class. The fourth parameter is a class implementing a java interface which is used for generating of a code for mapping. For example we can have hundrends of files in the primary system with description in a configuration file. Then it is a good idea to generate for each file one procedure in some language according to the description in the configuration file and write a java class which will run these generated procedures. The fifth parameter is the target object and the sixth parameter is s into which the process shifts the target.

Code generating

In a datawarehouse there are usually several types of mappings and each mapping of the same type is very similar each to other. And therefore it is possible to generate the stuff which is always the same across the type and it allows the developer when he/she modifies or corrects or tunes the mapping to concentrate on the core funcionality. It is also possible to assemble the mapping in the runtime but this approach has a big drawback that you can not see the processed source code and therefore it is harder to debug than when you can immediately see the generated code.

Generator Example

One of the common mappings in datawarehouse is the merging mapping. It takes some data from some sources, transforms them into a form of the target and then it compares the transformed sources and the target and merges them into the target. One one how to write these mappings is to write a database view which transforms the sources and then this view is used in the generated mapping. For example we have two source tables and one target.

	create table source_a (a number, b number)
	create table source_b (c number, b number)
	create table target (c number, d number, last_update date)
	create view v_target as
	select a.a c, a.b + b.b d from source_a a inner join source_b b on a.a = b.c

The mappping can look if we use pseudo code as following:

      create procedure p_target is
      begin
         logging stuff p_target start
         setup p_target stuff

         merge into target o using
	 (select c, d, sysdate _d from v_target) n
	 on (o.c = n.c)
         when matched then
         update o.d = n.d, o.last_update = n._d
         when not matched then
         insert into (c, d, last_update) values(n.c, n.d, n._d)

	 logging stuff p_target end
         catch 
         rollback
	 logging stuff p_target error
      end

The DwWorkflow offers an easy template engine. The template for the example would look like:

      create procedure p_%table_name% is
      begin
         logging stuff p_%table_name% start
         setup p_%table_name% stuff

         merge into %table_name% o using
	 (select @cols@ %cols.name%, @cols@sysdate _d from %view_name%) n
	 on (@cols_no_pk@ o.%cols_no_pk.name% = n.%cols_no_pk.name% %cols_no_pk.and% @cols_no_pk@)
         when matched then
         update @cols_no_pk@ o.%cols_no_pk.name% = n.%cols_no_pk%, @cols_no_pk@ o.last_update = n._d
         when not matched then
         insert into (@cols@ %cols.name%, @cols@ last_update) values(@cols@ n.%cols.name%, @cols@ n._d)

	 logging stuff p_%table_name% end
         catch 
         rollback
	 logging stuff p_%table_name% error
      end

The template engine provides support for substituting of a simple variable and loop variables which can be nested. The simple variable is wrapped by % and loop variable are wrapped by @. The values of variables are filled by the generating class which is given as a paremeter into the saveMapping. The generating class in this example takes these parameters partly from the xml configuration and partly from the database system tables. The advantage arises if there are tables with several hundreds of columns. And also it is a common situation that it is needed during some phase of datawarehouse life to change some code. With generating it is only needed to change the template and regenarete all mappings. It is much easier and much less erroneous than to rewrite mappings by hand.

It is not necessary to use for generating the forementioned template engine. It is possible to use a different template engine.

Another examples

Another more complex examples are included in the distribution.