DW/BI DON: 2013

Friday, August 9, 2013

Update Strategy TF

Update Strategy Transformation:-

1) Update Strategy TF: It is connected and an Active TF (Input # Output)

2) Update Strategy TF is used to insert, update, and delete records in the target table.

3) US TF can be used only if the Target is a Relational Table.

4) Update Strategy Expression:

We have to flag rows by assigning the constant values using the update strategy expression. The update strategy expression property is available in the properties tab of the update strategy transformation.

e.g., IIF (condition, DD_INSERT, DD_UPDATE)

Or IIF (condition, 0, 1)

The constants and their numeric equivalents for each database operation are listed below.

· DD_INSERT: Numeric value is 0. Used for flagging the row as Insert.

· DD_UPDATE: Numeric value is 1. Used for flagging the row as Update.

· DD_DELETE: Numeric value is 2. Used for flagging the row as Delete.

· DD_REJECT: Numeric value is 3. Used for flagging the row as Reject.

Key Points:

1) Update strategy works only when we have a primary key on the target table. If there is no primary key available on the target table,

Then we have to specify a primary key in the target definition in the mapping for update strategy transformation to work.

2) For the row tags DD_DELETE and DD_UPDATE, the table definition in a mapping must have a key identified otherwise the session

created from that mapping will fail.

3) Rows tagged with DD_REJECT will be passed on to the next transformation or target and subsequently placed in the appropriate

"bad file" if the "Forward Rejected Rows" attribute is "checked" (default). If the attribute is "un-checked" then reject rows will be

skipped.

Performance Tuning on AggregatorTF:

1) Unless it is really required, don’t go for the Updated Strategy TF.

2) In the session label also we can update the table, instead of using USTF, if the normal updates required without any conditions.

Important Note on Designing Perspective:

Note: Since USTF is having performance issues, we are avoiding using it here.

Scenario: Need to update a Relational Table which is having a High volume data.

Assumption1: Both the Source and Target tables are belongs to same DB.

Assumption2: Both the Source and Target tables are belongs to different DB.

Approach:

For Assumption1 use the Join query in the Source Qualifier.

For Assumption2 use the Joiner TF to join the Source and Target and follow as above.

1.Use Database JOIN to identify the records to be updated.

2.(By Using ROUTER TF)Insert the records into TEMP table, which is identified for UPDATE.

3.Use post session SQL on the TEMP Table to update the target table.

Tuesday, August 6, 2013

Aggregator TF in Detail

Aggregator Transformation in Detail:-

1) Aggregator TF: It is connected and an Active TF (Input # Output)

2) Incremental Aggregation(IA): (In the session Level Properties)

Incremental Aggregation is the process of capturing the changes in the source and calculating the aggregations in a session. This process makes the integration service to update the target incrementally and avoids the process of calculating the aggregations on the entire source.

If the source changes after the first session execution, then in the next run the entire source data will not be consider for Aggregation process. Only the newly added source data will be considered for Aggregation process. This has been explained in the following section Understanding of IA. This will be useful, when the effect for Target will be minimal.

Note: we cannot use sorted input option for incremental aggregation (since the records come for SORTED INPUT as presorted for that particular run but in the historical cache (in case of IA) the data may not be in the sorted order).

3) Reinitialize Aggregate Cache: (In the session Level Properties)

This will treat as the Aggregation process starts for the first time along with the wf session.

It will remove all the previous session information related to Aggregator. This has been explained in the following section Understanding of IA.

4) Sorted Input: (In the session Level Mapping)

This option can be used to improve the session performance. You can use this option only when the input to the aggregator transformation in sorted on group by ports. The Integration Service assumes all the data is sorted by group and it performs aggregate calculations as it reads rows for a group.

The integration service uses memory to process the aggregator transformation and it does not use cache memory.

If you specify the sorted input option without actually sorting the data, then integration service fails the session.

IMP: How the NULL values are handled in Aggregator?

You can configure the integration service to treat null values in aggregator functions as NULL or zero. By default the integration service treats null values as NULL in aggregate functions.

Understanding of IA:

After the end of the each session execution –

Notes:

· If the Mapping has been changed and saved, then the Existing Aggregate Cache will be deleted and the new Aggregate cache will create.

· If we configure the session by activating the option reinitialize the aggregate cache, then it will delete the old Index and Data files and will create the new files.

· On decreasing the Partition/Moving/Deleting aggregated files, a new Cache will create.

IS ---à Integration Service: Integration service is nothing but the ETL process. The Integration Service starts one or more Integration Service processes to run and monitor workflows. When you run a workflow, the Integration service process starts and locks the workflow, runs the workflow tasks, and starts the process to run sessions. Informatica integration services to access, understand, transform, and manage their enterprise data assets. These
services profile, cleanse, match, transform, move, federate, and monitor the quality of data, wherever and in whatever form it resides, to ensure that the enterprise receives data that is consistent, accurate, and timely.

Aggregated Cache:

The Power Center server stores key values in the Index cache and output values in the Data cache.

Both the Index file and Data file used as to store historical data.

Index Cache (File) ---à

The index cache holds group information from the group by ports.

All Group by Columns are in AGGREGATOR INDEX CACHE.

Index Cache (default size) - 12MB (12,000,000 byte)

The value of index Cache can be increased up to 24MB.

Data Cache (File) ---à

DATA CACHE is generally larger than the AGGREGATOR INDEX CACHE.

Columns in Data Cache:

Variable ports if any
Non group by input/output ports.
Non group by input ports used in non-aggregate output expression.
Port containing aggregate function

Data Cache (default size) - 24MB (24,000,000 byte)

Nested Aggregate Functions:

You can nest one aggregate function within another aggregate function. You can either use single-level aggregate functions or multiple nested functions in an aggregate transformation. You cannot use both single-level and nested aggregate functions in an aggregator transformation. The Mapping designer marks the mapping as invalid if an aggregator transformation contains both single-level and nested aggregate functions. If you want to create both single-level and nested aggregate functions, create separate aggregate transformations.

Explanation: The nested aggregate function returns only one output row, whereas the single level aggregate function returns more than one row. Since the numbers of rows returned are not same, you cannot use both single level and nested aggregate functions in the same transformation. If you include both the single level and nested functions in the same aggregator, the designer marks the mapping or Mapplet as invalid. So, you need to create separate aggregator transformations.

Note: We can nest up to two levels only.

Examples: MAX(SUM(sales))

Conditional clauses:

You can reduce the number of rows processed in the aggregation by specifying a conditional clause.

Example: SUM(salary, slaray>1000)

This will include only the salaries which are greater than 1000 in the SUM calculation.
Non Conditional clauses:
You can also use non-aggregate functions in aggregator transformation.

Example: IIF( SUM(sales) <20000 o:p="" sales="" sum="">

Note: Without Aggregator TF we can take sum, by implementing the SORTing TF (if not having already sorted input) , ExpressionTF to set a FLAG and a Filter to filter the Flag = 1.

e.g., http://www.dwbiconcepts.com/etl/14-etl-informatica/10-aggregation-with-out-informatica-aggregator.html

Performance Tuning on AggregatorTF:

1) Unless it is really required, don’t go for the AggregatorTF.

2) Try to use the aggregation in the source qualifier SQL query

(Exception: In case of source as a flat file / the aggregated fields are derived inside mapping)

3) Ensure that all the columns used in the group by are sorted in the same order of group by to enhance the performance

4) “Sorted Input” option should be checked in the aggregator properties to enhance the performance.

(Use sorted input: Sort the data before passing into aggregator).

5) Filter the unwanted data before aggregating.

6) Limit the number of input/output or output ports to reduce the amount of data the aggregator transformation stores in the data cache.

Friday, August 2, 2013

LOOK-UP in a Nutshell

LOOK-UP in a nutshell

Informatica LookUp:

· PASSIVE (till 8.6)—Active after Version 9 , Connected or UnConnected TF

· Used to look up data in a Relational Table, view, or in Flat file

· Lookup can be CACHED or UnCACHED (default lookups are CACHED).

· CACHED lookup can be of Static/Dynamic/Persistent(Named/Unnamed cache)

CACHED LookUp:

Under session running process, Informatica reads the lookup data from the database server and stores it in the Informatica server.

Persistent Cache (Integration Service saves the lookup cache files after the session run):

This will be useful, if you are going to use the same look up table multiple times in the existing session. So from the informatica server read the already cached lookup data and use the same for other lookup which uses the same Look up table. Performance wise it is better since the cache file does not change/refresh on session run.

Note: NON-Persistent Cache (Integration Service deletes the lookup cache files after the session run if it is not required later):

Return Port:

An unconnected Lookup transformation has one return port that returns one column of data to the calling transformation through this port.

ACTIVE LOOK-UP (after version 9):

Check Return All Values on Multiple Match, while creating the Look-up TF to return multiple matching records. (Which makes the Look Up as ACTIVE.)

Performance Tuning:

1. To improve performance, cache the lookup tables. Informatica can cache all the lookup and reference tables; this makes operations run very fast.

If Cache is required or not:

When we cache a look-up, Informatica fires the lookup query to database and brings in the entire set of data from the database to the Cache file directory of the Informatica server. Informatica then uses this cache file data whenever the lookup is called in the mapping flow during the session runtime. By doing this Informatica saves the time (and effort) to go back to remote database every time the lookup is called in the mapping.

In case of an un-cached or non-cached lookup, the opposite thing happens. Here Informatica goes to database to fetch data every time the lookup is called inside the mapping. How often the lookup is called depends on number of records coming from source (exactly equalling to number of records in case of a connected lookup)

There is couple of things that we need to consider while opting for any particular type of lookup. I tend to follow these general guidelines:

§ If the number of records coming from source is comparatively much lesser than the number of records present in the lookup table then you should consider using un-cached lookup. This is because less number of records from source ensures less number of database calls.

a) Source Table Data < LookUp Table Data à un-cached lookup

b) If lookup is done on the primary key of the lookup table.

§ If the lookup table is small (less than 2 million), it’s generally a good idea to cache the lookup table

Source Table Data > LookUp Table Data à cached lookup

Even after caching, the performance can be further improved by minimizing the size of the lookup cache. Reduce the number of cached rows by using a sql override with a restriction.
Use the where clause with condition.

3. If lookup transformation specifies several conditions, then place conditions that use equality operator ‘=’ first in the conditions that appear in the conditions tab.

4. In lookup tables, delete all unused columns and keep only the fields that are used in the mapping.

5. All data are read into cache in the order the fields are listed in lookup ports. If we have an index that is even partially in this order, the loading of these lookups can be speeded up.

If the table that we use for look up has an index (or if we have privilege to add index to the table in the database, do so), then the performance would increase both for cached and un cached lookups.

Remember Informatica creates its own index on the joining columns and therefore if the joining columns are ordered, Informatica would require less space (and time) to create the indexes.

Wisely use of ORDER BY clause in Lookup

In the sql override query of the lookup table, there will be an ORDER BY clause. Remove it if not needed or put fewer column names in the ORDER BY list.

When Informatica fires the lookup query to database, it appends an ORDER BY clause at the end of the query. However you can suppress this default behaviour by appending a comment “--“at the end of the override query. You should consider following factors regarding default Informatica Lookup ORDER BY –

a. ORDER BY clause may create additional processing overhead on database. When Informatica lookup fires the SELECT query to the database, it includes all the lookup ports in the SELECT query and in the ORDER BY clause. Hence ensure that there is no unnecessary port in the lookup. Because otherwise those ports will be included in the Informatica generated SELECT statement and also in ORDER BY clause, thus creating additional overhead in database.

b. Also, do not completely suppress the ORDER BY with '--'.

If you suppress the ORDER BY, then data fetched from database will not be sorted. Hence Informatica will require to again sort this data in its own cache space in order to generate the Index cache (yes! index cache cannot be built on unsorted data). This means in any case, data has to be sorted - either from database or inside the Informatica.

We can suppress the default ORDER BY with '--' and write our own ORDER BY only on those columns (lookup ports) that are being used in the joining.

Use of Persistent/Dynamic Lookup Cache

If lookup data is static, use persistent cache. Persistent caches help to save and reuse cache files. If several sessions in the same job use the same lookup table, then using persistent cache will help the sessions to reuse cache files. In case of static lookups, cache files will be built from memory cache instead of from the database, which will improve the performance.

** Informatica will not create cache sized above 2GB. Hence if our lookup cache size is more than 2GB then if possible consider the option of joining the lookup source table in the database level itself instead of building lookup cache.

If target table is the lookup table, then use dynamic cache. The Informatica server updates the lookup cache as it passes rows to the target.

Select connected OR Unconnected Lookup

In case of a connected lookup, since the lookup is connected in the data flow pipeline so the lookup will be called for each record coming from source, irrespective of the fact whether the data returned from lookup is at all used later in the mapping or not.

In case of unconnected lookup, this can be controlled by calling the lookup only when it is really needed.

Wisely set Pre-build Lookup Cache OPTION

Performance boost of pre-building lookup caches is more apparent in cases where the source query itself is taking considerable amount of time to return rows so the same time can be simultaneously used to pre-build all the lookup caches instead of waiting for the source to return results and then starting to build other lookup caches on demand.

Informatica may build the lookup cache even before the first row reaches the Lookup transformation (Pre-build scenario) or it may build the lookup cache only when the first row reaches the lookup transformation (On demand scenario).

§ When Informatica pre-builds the lookup cache, how many caches it can build concurrently depends on various factors including “Additional Concurrent Pipelines for Lookup Cache Creation” property present in session level under “Config Object” tab.

§ One should also check if the database where Informatica is firing the Lookup queries has enough capacity to process the Lookup queries simultaneously without hampering the individual query performance. This depends on various database level properties especially on database TEMP space availability, session level connection restrictions and degree-of-parallelism (DOP)

§ There can be scenario where not all the lookups are even required to be called (especially if the lookup is unconnected and called based on a condition). There is really no point in pre-building those lookups as those lookups should be built on demand only. This is also the reason why “Pre-Build Lookup Cache” option is available in individual lookup level and not as a generic option in session level .