Friday, August 2, 2013

LOOK-UP in a Nutshell

LOOK-UP in a nutshell
Informatica LookUp:
·         PASSIVE (till 8.6)—Active after Version 9 , Connected or UnConnected TF
·         Used to look up data in a Relational Table, view, or in  Flat file
·         Lookup can be CACHED or UnCACHED (default lookups are CACHED).
·         CACHED lookup can be of Static/Dynamic/Persistent(Named/Unnamed cache)


CACHED LookUp:
                Under session running process, Informatica reads the lookup data from the database server and stores it in the Informatica server.

Persistent Cache (Integration Service saves the lookup cache files after the session run):            
This will be useful, if you are going to use the same look up table multiple times in the existing session. So from the informatica server read the already cached lookup data and use the same for other lookup which uses the same Look up table. Performance wise it is better since the cache file does  not change/refresh on session run.

Note: NON-Persistent Cache (Integration Service deletes the lookup cache files after the session run if it is not required later):


Return Port:
An unconnected Lookup transformation has one return port that returns one column of data to the calling transformation through this port.
ACTIVE LOOK-UP (after version 9):
Check Return All Values on Multiple Match, while creating the Look-up TF to return multiple matching records. (Which makes the Look Up as ACTIVE.)

Performance Tuning:
1.       To improve performance, cache the lookup tables. Informatica can cache all the lookup and reference tables; this makes operations run very fast.

If Cache is required or not:
When we cache a look-up, Informatica fires the lookup query to database and brings in the entire set of data from the database to the Cache file directory of the Informatica server. Informatica then uses this cache file data whenever the lookup is called in the mapping flow during the session runtime. By doing this Informatica saves the time (and effort) to go back to remote database every time the lookup is called in the mapping.
In case of an un-cached or non-cached lookup, the opposite thing happens. Here Informatica goes to database to fetch data every time the lookup is called inside the mapping. How often the lookup is called depends on number of records coming from source (exactly equalling to number of records in case of a connected lookup)
There is couple of things that we need to consider while opting for any particular type of lookup. I tend to follow these general guidelines:
§  If the number of records coming from source is comparatively much lesser than the number of records present in the lookup table then you should consider using un-cached lookup. This is because less number of records from source ensures less number of database calls.
a)      Source  Table Data < LookUp Table Data  à un-cached lookup
b)      If lookup is done on the primary key of the lookup table.
§  If the lookup table is small (less than 2 million), it’s generally a good idea to cache the lookup table
Source  Table Data > LookUp Table Data  à cached lookup

2.         
Even after caching, the performance can be further improved by minimizing the size of the lookup cache. Reduce the number of cached rows by using a sql override with a restriction.
Use the where clause with condition.
    
3.    If lookup transformation specifies several conditions, then place conditions that use equality operator ‘=’ first in the conditions that appear in the conditions tab.

4.       In lookup tables, delete all unused columns and keep only the fields that are used in the mapping.
5.       All data are read into cache in the order the fields are listed in lookup ports. If we have an index that is even partially in this order, the loading of these lookups can be speeded up.
If the table that we use for look up has an index (or if we have privilege to add index to the table in the database, do so), then the performance would increase both for cached and un cached lookups.
 
Remember Informatica creates its own index on the joining columns and therefore if the joining columns are ordered, Informatica would require less space (and time) to create the indexes.

6.     

Wisely use of ORDER BY clause in Lookup


In the sql override query of the lookup table, there will be an ORDER BY clause. Remove it if not needed or put fewer column names in the ORDER BY list.

When Informatica fires the lookup query to database, it appends an ORDER BY clause at the end of the query. However you can suppress this default behaviour by appending a comment “--“at the end of the override query. You should consider following factors regarding default Informatica Lookup ORDER BY –

a.     ORDER BY clause may create additional processing overhead on database. When Informatica lookup fires the SELECT query to the database, it includes all the lookup ports in the SELECT query and in the ORDER BY clause. Hence ensure that there is no unnecessary port in the lookup. Because otherwise those ports will be included in the Informatica generated SELECT statement and also in ORDER BY clause, thus creating additional overhead in database.
b.    Also, do not completely suppress the ORDER BY with '--'.
If you suppress the ORDER BY, then data fetched from database will not be sorted. Hence Informatica will require to again sort this data in its own cache space in order to generate the Index cache (yes! index cache cannot be built on unsorted data). This means in any case, data has to be sorted - either from database or inside the Informatica.
We can suppress the default ORDER BY with '--' and write our own ORDER BY only on those columns (lookup ports) that are being used in the joining.
7.     

Use of Persistent/Dynamic Lookup Cache

  
If lookup data is static, use persistent cache. Persistent caches help to save and reuse cache files. If several sessions in the same job use the same lookup table, then using persistent cache will help the sessions to reuse cache files. In case of static lookups, cache files will be built from memory cache instead of from the database, which will improve the performance.



**
Informatica will not create cache sized above 2GB. Hence if our lookup cache size is more than 2GB then if possible consider the option of joining the lookup source table in the database level itself instead of building lookup cache.

If target table is the lookup table, then use dynamic cache. The Informatica server updates the lookup cache as it passes rows to the target.  


8.      

Select connected OR Unconnected Lookup

In case of a connected lookup, since the lookup is connected in the data flow pipeline so the lookup will be called for each record coming from source, irrespective of the fact whether the data returned from lookup is at all used later in the mapping or not.
In case of unconnected lookup, this can be controlled by calling the lookup only when it is really needed.

9.      

Wisely set Pre-build Lookup Cache OPTION

Performance boost of pre-building lookup caches is more apparent in cases where the source query itself is taking considerable amount of time to return rows so the same time can be simultaneously used to pre-build all the lookup caches instead of waiting for the source to return results and then starting to build other lookup caches on demand.
Informatica may build the lookup cache even before the first row reaches the Lookup transformation (Pre-build scenario) or it may build the lookup cache only when the first row reaches the lookup transformation (On demand scenario).
§   When Informatica pre-builds the lookup cache, how many caches it can build concurrently depends on various factors including “Additional Concurrent Pipelines for Lookup Cache Creation” property present in session level under “Config Object” tab.
§  One should also check if the database where Informatica is firing the Lookup queries has enough capacity to process the Lookup queries simultaneously without hampering the individual query performance. This depends on various database level properties especially on database TEMP space availability, session level connection restrictions and degree-of-parallelism (DOP)
§  There can be scenario where not all the lookups are even required to be called (especially if the lookup is unconnected and called based on a condition). There is really no point in pre-building those lookups as those lookups should be built on demand only. This is also the reason why “Pre-Build Lookup Cache” option is available in individual lookup level and not as a generic option in session level .


 


No comments:

Post a Comment