Ad hoc desktop analysis is the first step towards the decentralization of operational data. Concepts such as ODS and Data Warehousing are becoming important and slowly but surely a parallel system is being developed that should feed the information needs of users. This article focuses on the why of this duplication and important consequences associated with it, such as scheduling and synchronization.
The decision to separate operational data from the OLTP system is often made for technical reasons and is a fairly simple decision in itself. Thinking about the architecture is more difficult because it mainly depends on the presentation of the extracted data. If the functional purposes are mainly based on a form of operational reporting, albeit at a more global level than the screen queries, then a so-called Operational Data Store or ODS will be built. An ODS can be considered as a copy of the operational environment, but with information that is limited in time. It allows reporting at the same level, but unlike the original database, also called the source system, the data remains available for much less time. In practice, this basic rule is all too often overlooked. Once the data has been transferred from the source system, it usually remains. In order not to fall into the same problem situations in the long term that prompted the building of the ODS, it is essential that a correct archiving method is provided. This is discussed in more detail a little further on in this book. Creating an ODS is in principle a simple operation from a technical point of view because the granularity of the data perfectly matches that of the data on the source system. Perfect record-to-record synchronization can therefore take place between both systems based on creation and mutation information. Although the availability of the data is somewhat limited in time, this high granularity still entails certain limitations. Any higher level reporting will require the online grouping of data, which will entail longer extraction times. We can therefore say that an ODS is not really suitable for analytical reporting.
If one wishes to go in that direction, there will be a need for the development of a data warehouse. As mentioned, a data warehouse is a blueprint of the corporate structure, in contrast to an ODS, which is a blueprint of the database structure. Within a data warehouse, the principle of operational reporting is completely abandoned and the OLTP structure will, as it were, be molded according to the information needs. Although the analytical possibilities with such an architecture are much more extensive, the data will still be processed at a relatively low level of detail. This requirement is the result of the ad hoc nature of certain forms of reporting, where previously it was not always clear which direction one wanted to take. Multiple options must therefore be available, which entails a process that results in more data.
An ODS and a data warehouse can coexist, even on the same server, to enable any form of reporting. It is up to the IT department to make the combined use as transparent as possible for the business user. Since an ODS is fed from a 1-on-1 relationship with the OLTP system, it is perfectly possible to use the extracted data in turn to feed the data warehouse. Since the ODS basically contains less data than the source system, and its location is isolated from the operational environment, it is pointless to load the OLTP system twice with the same extraction. It is much easier to use the ODS as a source system for the data warehouse, especially because the aggregation required within the data warehouse can be achieved much faster and more efficiently via the ODS. It is of course essential that both systems contain the same form of information.
Synchronization & scheduling
It goes without saying that the data that is made available within the information environment must correspond with that contained in the OLTP tables. In other words, it must be synchronized between the systems set up. With an ODS, synchronization in itself seems to be a simple task, because the granularity on both systems is in principle perfectly matched. This simplicity is negated by a possible problem of data delay. A choice must be made between synchronous and asynchronous processing of the transactions. Synchronous processing implies that every transaction on the OLTP system is immediately repeated on the second system, in this case on the ODS. The data is therefore identical in the tables of both systems with an ignorable interval of a few milliseconds. In terms of content and availability, this seems very interesting, if it were not for the fact that during operational data processing one becomes dependent on two systems. Once again the need for output will hinder input and it is essential that this is avoided. Asynchronous processing avoids this double dependence, but does cause a certain delay in the availability of the data within the ODS. If we assume that after this processing there will be another operation to the data warehouse, then we will immediately have to contend with a double delay. It goes without saying that updating a data warehouse will always be asynchronous processing, because neither the granularity nor the structure allow a synchronous approach. Asynchronous processing leads to scheduling, which is an important aspect when building an information environment. All tasks, often called jobs, must be distributed correctly across the availability of the system. If the scheduling tool provides for this, as many dependencies as possible should be built in, so that the quality and consistency of the data can be guaranteed at all times. Let us imagine that a data warehouse loading procedure consists of processing a file with invoice data and another task that processes a file containing new customers. If both jobs are scheduled without any form of dependency, it is perfectly possible that, after any problems with loading the customers, there will suddenly be invoice data in the data warehouse that refer to customers that have not yet been included in the dimension table. This results in join conditions that no longer apply and the invoice data being missing when reporting. While as far as the facts are concerned, it is assumed that the data has been completely updated. Data delay is unavoidable in asynchronous processing, so it is important that the level of delay is determined correctly. An information environment intended for ad hoc or semi-operational purposes is often updated daily, and often at night, to limit the load on the source system as much as possible. However, determining the delay remains dependent on the information needs rather than on the system. There is no point in updating data on a daily level that will only be used weekly. In the case of bulk loads, where enormous amounts of data need to be loaded, a small delay is advisable, even if the information is only used after a longer period of time. This incremental processing is then only carried out with the intention of relieving the burden on the system. The content of the data also plays a role. If the company only invoices monthly, there should be no daily job looking for new invoices when they know that they can never exist. Scheduling is therefore mainly about finding a correct compromise between the data needs and the system technical options. However, the consistency of the data is always central.