Extraction, Transformation & Loading (ETL)

To extract data from an operational environment and remodel it in an informative manner in a physically delocalized information environment, special software must be used: ETL software.

ETL

At a certain point, the need arises to decentralize the central data within an ODS or to house it in a data warehouse (see article Ad hoc desktop analysis). Before any form of extraction can see the light of day, decentralization itself must take place. For this purpose, the company will use a so-called ETL tool . ETL stands for Extraction, Transformation and Loading . An ETL process therefore consists of extracting data from a source system, enriching it with certain logic and loading the transformed data into a new environment. There are in fact two more processes that could be added to the definition of ETL, namely Indexing and Analyzing . Since the data is already in the database at the time of indexing, indexing is usually not included in the standard definition of ETL. The analysis also takes place at a later stage and is considered a separate entity for the same reason. This form of analysis of course has nothing to do with any form of business analysis, but is a strictly technical process that only applies in a cost-based DBMS environment.

Choice

When choosing an ETL tool, it is important that it is able to successfully complete the three main processes of the ETL process. The indexing and analysis is database specific and can possibly be done afterwards via self-written scripts. A first generation of ETL tools were code generators, which generated the necessary scripts based on so-called metadata that were then started via batch procedures. Such outdated tools often provided little or no transformation, meaning the logic had to be shifted to the source system for extraction or to the data warehouse for reporting. A new generation of tools provides the opportunity to incorporate the most complex transformations into the process with a minimum of programming. It is sufficient to parameterize the logic, the effective handling is completely black box. The black box principle may simplify the definition of the logic, but it does lose a certain flexibility and it also makes you more dependent on the vendor.

Investment

ETL tools are generally very expensive and their purchase is often preceded by lengthy evaluations and comparisons, especially when dealing with smaller data warehouse projects. There are cheaper solutions on the market, especially in the code generator category, but they usually require a lot of adjustments to be included within the corporate structure. Examples of tools include Pervasive, Ascential DataStage, Ab Initio, Oracle Warehouse Builder and Informatica. However, there are many more on the market and it is therefore advisable to invest the necessary time in the preliminary study and look for the tool that best suits your own company and requires the least adjustments.

In-house development

The option remains of course open to program everything within your own IT department, but the cost of the additional workforce that results from this will in most cases be higher than that of any ETL tool. The effort is not limited to effective programming, but is often preceded by a lot of preliminary work on studying every aspect of the ETL flow. The decision to make or buy a tool – the infamous make or buy principle – is completely company dependent, but is still based on some basic aspects. First and foremost, the complexity of the data transformation and the quality of the source data must be taken into account. These will determine the nature of the transformations and therefore immediately clarify whether or not a code generator is sufficient. Another important aspect is the volume of data that will pass through the ETL flow. Some software is simply too light to handle so-called bulk loads . A treasure hunt presents itself.