White Paper: Data Mining with XpertRule® Miner


Organisations are increasingly storing large amounts of data generated by their operating activities. Such historical data has buried within it patterns relating to the effectiveness of the various business processes. Data mining can discover such patterns in data and is now considered a catalyst for enhancing business processes through avoiding failure patterns and exploiting success patterns.

The potential for discovering knowledge buried in data has created the need for better management of corporate historic data. This has led to the concept of data warehousing, whereby operational data is maintained in a database dedicated to providing business users with online data for business analysis. A data warehouse can be a large corporate database, a departmental database (data mart) or a local database on a single client PC. The quality of knowledge that can be discovered from data is not dependent on the scale and architecture of the data warehouse. Quality is dependent on having the right data and the appropriate data mining tools and development methodology.

The business benefits of data mining have created a scramble by software suppliers to position their products as data mining tools. Anything from simple query and reporting products to the most advanced pattern discovery products have been put forward as "data mining" tools. This has caused confusion among business users as to what data mining actually means. There are three technologies for the discovery of patterns in data:

  • Query and reporting tools: These allow the user to find answers (confirmation) to queries (patterns) already being suspected. Such tools can be best described as hypothesis driven data exploration tools, with the user volunteering all the patterns to be investigated.
  • OLAP tools: These are advanced forms of query & reporting tools which allow large multi-dimensional databases to be interrogated speedily and graphically. These tools can be best described as visualisation driven data exploration tools. The discovery process is still user driven. However, the user is armed with a multi-dimensional view of the data to drill down at will, thereby aiding the exploration/discovery process.
  • Data Mining Tools: These automate the process of discovering patterns/knowledge in data. They enable business goal driven discovery. For example, instead of the user asking for a report or a graph of sales per region and product - hoping to detect a pattern - the user can instead ask for patterns relating to high sales volumes (a business goal).
The process of discovering patterns from data (also known as Knowledge Discovery in Databases) is a process that combines all of the above technologies since it requires hypothesis, exploration and automatic discovery. It follows that the above technologies are complimentary. In addition to supporting automatic pattern generation, XpertRule Miner also supports the ability to query/report and to visualise/explore the data in conjunction with the discovered patterns.


Important considerations when deploying Data Mining

Data mining is emerging as a mature technology which is being incorporated into mainstream business applications. Data Mining has evolved beyond the point where the algorithms are the main criteria for assessing the technology. The important considerations when deploying data mining in an organization are:

  • The need for a data mining process (methodology) supported effectively by the data mining environment.
  • The need for an interactive knowledge discovery environment in which the business knowledge of the user is combined with the power of the discovery algorithms in order to derive business knowledge (patterns) from data.
  • The effective and active deployment of the data mining models and patterns.
  • Flexibility in addressing various computing architectures.
  • Scalability and performance on large data volumes.


Graphical Support for a Data Mining Process

The effectiveness of data mining as a business intelligence tool has been demonstrated with a large number of successful applications. However, in order to give data mining a wider appeal it has become apparent that a methodology or process is required to allow non data mining specialists to achieve the same degree of success as seasoned practitioners. Such a systematic and repeatable process will allow data mining to be successfully deployed by many people across organizations. There are a number of initiatives and projects to develop such a process, two of which are partly funded by the European Commission. XpertRule Software has been involved directly in one of these (CRITIKAL) and is a member of the Special Interest Group set up in conjunction with the second (CRISP DM). It is reassuring to see a common data mining process (methodology) starting to emerge. There is broad agreement on the main tasks within such a process which are data preparation, data exploration, pattern discovery, pattern validation and pattern deployment.

XpertRule Miner provides a graphical environment for supporting all the stages of the data mining process. The click, drag and drop environment allows non programmers to carry out complex data preparation, mining and deployment processes.

Graphical Data Transformation


Data Sources

XpertRule Miner uses data drivers known as CAF servers to read/write to data sources. The standard ODBC CAF server will support all ODBC compliant data sources. The open architecture of the CAF drivers allows the development of additional CAFs using the API of non ODBC data sources. CAFs for client-server architectures are also available - for example, the TCP/IP STUB CAF.


Data preparation & Transformation

It is now accepted by most data mining practitioners that between 50% to 80% of the total life cycle of a data mining project can be taken up by the data preparation stage. The objectives of this stage is to cleanse the data and to transform it into a format suitable for the application of pattern discovery techniques.

XpertRule Miner allows non programmers to carry out complex data transformations using an intuitive drag and drop graphical interface. It can process data tables with millions of records. The data transformation operations supported are:

  • Data Aggregation: This is used to summarise detailed data (e.g. aggregating 1 second data into 5 minute averages) and also to transform time series data into attribute/value (case) data suitable for tree induction and cluster analysis.
  • Data Table Manipulation: This includes record filtering, random sampling, merging, joining and sorting.
  • Column Derivation: This allows the user to define new data columns which are derived from existing data columns. It is also used for data cleansing (processing blanks & outliers) and the grouping or banding of field values. XpertRule Miner supports a comprehensive VB like script for calculations and string manipulations.
  • Data Visualisation & Reporting:XpertRule Miner can generate field statistics, frequency distribution graphs, 2D and 3D multi field graphs and time series graphs. These graphs and reports allows the user to get a better understanding of the raw data, design effective data cleansing and transformation strategies and validate the transformed data. The prepared data can be browsed and explored before the pattern discovery process is started. This gives a better understanding of the data and enables the user to better interpret the discovered patterns

3D Graph

Pattern Discovery

In order to address industry wide data mining needs, XpertRule Miner supports a basket of knowledge discovery techniques:

A tree example


Tree Induction:This is goal driven discovery and is the most widely used technique involving the induction of patterns (trees) relating to a business event (goal), such as mortgage arrears, customer attrition, energy consumption, insurance claims, etc.

Interactive Induction

Interactive/incremental Data Mining: This combines automatic tree induction and manual tree construction. It enables the business user to develop tree patterns in collaboration with the induction algorithm. At every node (branch) in the tree, XpertRule Miner shows the importance of the various attributes at that point. The user is given the opportunity to impart their background business knowledge and influence the choice of attribute splits while respecting the information evidence provided by Miner.

Association Rules

Discovering Association Rules:
This is the discovery of associations between business events. For example, which items are purchased together in a supermarket (basket analysis), which product options are taken up together, which faults occur together, etc. XpertRule Miner supports the discovery of association rules and frequent item sets from transaction data of items or events.

Discovering Clusters in data: This is the discovery of natural clusters or segmentation in data. An example would be segmenting a mortgage portfolio. XpertRule Miner generates clusters in 'case' (attribute based) data by discovering sets of attribute values that are frequently associated with each other.


Pattern Exploration and Validation

Data visualisation and exploration plays an important role throughout the data mining process. During the tree induction process, XpertRule Miner allows user defined reports and data graphs to be updated dynamically as the user is exploring the various nodes and leafs (profiles) of the discovered tree. In addition to giving the user a method of validating the accuracy and meaning of tree patterns, the pattern exploration process helps the user obtain a better understanding of the patterns being discovered and their implications. XpertRule Miner supports a number of tree exploration reports; field statistics, frequency distribution, field propensity/value across profiles and "gain or lift" graphs.

ProfilerX Tree Miner


Pattern deployment

Patterns discovered using data mining can be deployed in a number of ways to address the relevant business requirements. XpertRule Miner supports a number of deployment strategies:

  • Reporting and Dissemination: Graphical tree patterns can be generated in Windows Meta File format which allows them to be easily embedded in other Windows applications such as Word, Excel and PowerPoint.
  • Data Filtering: XpertRule Miner can generate the discovered patterns as C code, SQL or SAS procedures. This allows the user to select, for further processing, data records matching the discovered patterns.
  • Decision Support: The tree patterns discovered in XpertRule Miner can be used as part of an online decision support system. This can be achieved by generating the tree patterns as C code or by embedding the tree mining client ProfilerX (shown in the illustration above) as an ActiveX component.
  • Active Deployment:This is where a small number of data and business specialists in an organization can create a specific data mining business scenario (vertical application) to be deployed to a large number of data mining users inside or outside the organization. This is achieved using the tree mining client ProfilerX as an embedded ActiveX component.


Connectivity, scalability and performance

The data mining tools available today fall into one of two distinct architectures;

  • Client based mining: These data mining tools run on clients machines and mine data stored on the same client or data downloaded from a server to the client for mining. These tools limit the size of data that can be mined, typically in the order of tens of thousands of records (table rows). These limits are imposed by client memory/processor speed restrictions, as well as network bandwidth restrictions
  • Workstation (server) based mining: These tools run on workstations with very thin display clients. While high performance workstations and high bandwidth to the server overcome the limitations of client based mining tools, these tools have the disadvantages of high costs and the need to make copies of the data on the server.

XpertRule Miner resolves all the problems associated with both client and workstation based data mining by supporting a multi-tier client-server architecture. This is made possible by engineering the data mining algorithms of Miner to be multi-tier, consisting of Contingency And Frequency (CAF) servers which summarises the data and ProfilerX clients which generate and display patterns interactively. The advantages of this architecture are:

  • Scalability: For stand alone client based data mining, the database, CAF server and ProfilerX client can all reside on the client PC. For small scale client-server data mining, the database can reside on a server, while both the CAF and ProfilerX client can reside on the client PC. For medium scale client-server data mining, the database and CAF server can reside on a server, with the ProfilerX client residing on the client PC. Finally, for large scale client-server data mining, the database can reside on a high performance data warehouse server, the CAF server can reside on a middle-tier server and the ProfilerX client resides on each client PC.
  • Performance: The scalability of the architecture ensures that performance can be optimised regardless of the scale/architecture of data mining. This is achieved through a number of innovative features:
    • The multi tier architecture allows the CAF server with its high bandwidth requirement to be placed at the point where it has the maximum bandwidth to the database server. While the ProfilerX with its low bandwidth requirements can be run on client machines.

    • The CAF server can exploit the high performance (parallelism) of a database server by mining the data in-situ (i.e. without moving the data) through the firing of SQL query streams at the database. These intelligent queries will generate the required contingency and frequency counts without needing to read all the source data.

    • The CAF server can cache data from the database server using tokenised highly optimised data structures. This allows data mining of millions of data records (gigabytes of data) in minutes on standard specification Windows 95, 98 or NT machines (e.g. 333 MHz Pentium with 64MB RAM).
Site Map | Home