Knowledge Discovery in Databases

Divyojyoti Ghosh
3 min readJan 25, 2022

Data Science is the science of extracting knowledge out of the data by identifying patterns in it. Knowledge Discovery in Databases(KDD) is one of the methodologies for the identification of patterns in data or in other words it is a process to make sense of the data.

In every field such as medicines, astronomy, finance, retail, marketing, etc. there is an enormous growth of the database, it is impossible to analyse and interpret the data manually. It is required to use modern technologies with their high computational ability for the analysis of these data. Also, the users’ expectations are becoming more sophisticated with each passing day, for example, the requirement of online stores was to show all the products available in their stores according to different categories, today the main requirement of online stores is to recommend the correct products to the correct customers.

KDD addresses this problem of data overload, it attempts to provide the solution for this issue by providing a suitable methodology to discover knowledge from these huge databases. Data Mining is one of the steps in KDD that deals with applying algorithms for pattern extraction from Data.

KDD Roadmap

Fayyad et al. (1996) define Knowledge Discovery in Databases(KDD) as a significant process of identifying correct, well explainable and useful patterns in data that have not been identified or found before.

As KDD is a process, it involves various steps. The KDD process is iterative as well as interactive. Following are the steps one needs to follow in KDD -

  1. Problem Specification — In this step, more information about the domain of the problem is gathered. An understanding of the application domain is developed for setting up the goal of the whole process.
  2. Resource Gathering — In this step, all the resources such as required software, hardware, and data are gathered. The subset of the database is selected from the database according to the requirement.
  3. Data Cleansing — The output of this step is a cleansed operational database. This step includes several cleaning operations such as handling of missing values in the data, handling of outliers, balancing the database according to the target field, etc.
  4. Data Pre-processing- In this stage the changes are done on data on feature level, such as the creation of new features(columns) using the original features in the database, removing the features, etc.
  5. Selection of Data Mining Task— There are several data mining tasks such as classification, clustering, regression, association, etc. A task is mapped to the database according to the goal of the KDD process.
  6. Model Selection— In this step, the data mining algorithm and the method for extracting pattern is chosen. Also, for the best outcome of KDD, the parameters to be used for the data mining algorithm is selected.
  7. Data Mining— In this step the patterns are extracted by using the chosen data mining task and algorithm. Also, the model is evaluated in this step, the performance of the algorithm is tested using various evaluation criteria.
  8. Interpretation — The extracted patterns are then interpreted using several visualization techniques. Also, in this stage, the data can be visualized according to the model established after the data mining step.
  9. Acting upon Knowledge — In this stage actions are taken on the basis of knowledge fetched using all the above steps. These steps also provide clarity on the previous conflicts regarding the problem or the application domain.

All these above steps are highly iterative and there can be loops between any two steps of the KDD process.

References -

[1] Vijay Kotu and Bala Deshpande, 2019, Data Science (Second Edition), Morgan Kaufmann, https://doi.org/10.1016/B978-0-12-814761-0.00001-0.
(https://www.sciencedirect.com/science/article/pii/B9780128147610000010)

[2] Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P., 1996. From data mining to knowledge discovery in databases. AI magazine, 17(3), pp.37–37.

--

--