Data Mining within an information environment

A concept that is often mentioned when talking about Business Intelligence is Data Mining. It is accompanied by other aspects such as cleansing and enrichment of data and is one of the more difficult to understand terms within BI due to its complexity. A clarification is in order.

Scavenger hunt

Data mining can be described as a search for useful information within an existing database, with the intention of acquiring knowledge. In the literature, data mining is often referred to as Knowledge Discovery in Database (KDD). In addition to pure knowledge enrichment, KDD also includes cleaning up files and enriching the data with meaningful information. Data mining is to a certain extent related to the HOLAP technique (see article Online Analytical Processing (OLAP)) because here too one starts from a certain point of view and searches for the information downwards. But although data mining mainly focuses on dimensional data, some situations may require searching for the information at a much lower level, such as at the level of the ad hoc desktop analysis (see article Ad hoc desktop analysis) or even down to the level of operational reporting (see article Operational Querying). In theory, the KDD process starts with the construction of a data warehouse, followed by knowledge extraction and logical interpretation. In contrast to the classic construction of a data warehouse, which is based on the data and the representation of specific business processes, a data warehouse intended for data mining is built from a strategic point of view. We can say that general data warehousing has a global goal in mind, while data mining attempts to achieve a very specific goal.


The most common application of data mining is the construction of a CRM data warehouse. CRM stands for Customer Relation Management and allows you to maintain a central customer database that is accessible to all interested parties within the company. It guarantees the existence of a unique record for each customer within the corporate structure and provides complete and accurate information that is kept up to date at all times. To guarantee the uniqueness of the data, so-called deduplication procedures are used with every delivery of customer data. Based on the address data provided, it is determined whether or not a customer is present in the system and whether or not his data needs to be enriched with additional information. The deduplication procedures often involve complex algorithms that take spelling, phonetic sounds and related elements into account. A CRM data warehouse cannot simply be aligned with the traditional data warehouse and is therefore rarely approached in this sense.

Externe data & cleansing

Not all data originates within the company. This is certainly the case with CRM matters, especially with large companies. A customer database itself has a standard model with a fixed structure for recording addresses and telephone numbers, among other things. That is why people often decide to purchase databases that contain address information relating to certain target groups. Due to the diversity of information sources, it goes without saying that the files will rarely or never have the same physical structure. In most cases, the fields used comply with the standard model, but there is often an inconsistency throughout the detailed structure. We consider, among other things, the order of the fields or a redundancy or shortage of information. In addition to the structured data within the database, one will therefore be confronted with various forms of non-structured data, which will have to find their way within the system correctly. This does not mean that the relational customer data that originates within the company is already perfectly structured for use within a CRM data warehouse. In most cases, this data will also have to go through a cleaning and/or enrichment procedure. Relational customer data is, as it were, merged with external data and transformed as a whole into usable, correct and completed data.

Application: mailings

The ultimate use of large amounts of customer data is usually in the marketing environment, where the purchased addresses often form the basis for sending advertising mailings. To keep the cost of the mailing process as low as possible and to maintain professional communication with customers, it is essential that a specific shipment is sent only once to a specific address. That is why the deduplication procedures are one of the most important aspects within the entire process from charging to use. In its first phase, this form of data mining conflicts somewhat with the general definition, which states that knowledge must be acquired to be considered as such. In mailing processes, this knowledge usually follows at a later stage, depending on the responses to the mailings. A grouped evaluation of the responses usually highlights areas of interest and in turn leads to hypotheses that can serve as a basis for further research. A catalog is created of thousands of different thought patterns that are impossible to obtain with classical analysis tools. It can therefore be said that knowledge is gained in every results analysis of a mailing.

Figure 1: data mining” onclick=”openImage(this);”> Figure 1: data mining

Application: evaluation of viewing figures

Another example of data mining is the evaluation of viewing figures at a television production company. Viewing figures are to a certain extent comparable to the results of a survey, although in this situation the product has already been sold. An evaluation of the viewing figures has an irrevocable influence on the programming because they demonstrate the interest pattern of the viewers. So the past will guide the future in the right direction. Figure 1 illustrates what a data mining management flow looks like in general terms. The starting point is a strategic goal, which will ultimately lead to actions based on knowledge acquired from the information provided.