Data Cleansing

Divyojyoti Ghosh
6 min readFeb 2, 2022
Image is created by Divyojyoti Ghosh(me) on www.canva.com using different elements available for making designs.

Data cleaning is preparing quality data for analysis from the real data which is in general of low-quality. Low-quality data here means the data with various irregularities such as missing values, outliers, noise, etc. In General, Data Cleaning and preparation take around 80% of the total time of data engineering.

The data-mining algorithm generally assumes that the input data has no irregularities and it produces outcomes with respect to the data provided, therefore, it is very essential to prepare good quality data so that the data mining algorithms work well and produce useful patterns and outputs. Also, the Cleaning of data generates a dataset that is smaller than the real dataset, which increases the efficiency of the data mining algorithm.

Data Cleansing includes four operations—

  1. Sampling
  2. Handling Outliers
  3. Handling Missing Values
  4. Balancing Dataset

Sampling

Sampling is selecting a subset of data from the total data. The total data is also known as population. There are two basic problems with using the population, the first is the unavailability of total data in many cases, for example, if we are working on leaves of plants, it is impossible to collect data about all the leaves available on the planet. The second problem is that even if it is possible to collect the total data for many cases such as transaction data, the computation becomes very difficult as it is too much load for the algorithm and the system on which the algorithm is running.

The basic idea of sampling is to select a subset of data that considers the maximum variability of the population. There are several types of sampling techniques used for sampling the whole data —

1. Simple Random Sampling — The main idea behind this technique is to decrease selection bias. In this, records are selected at random and each record in the dataset has an equal chance of being selected.

2. Systematic Sampling — In this technique, the first record is selected at random, then each record after a fixed interval is selected as the part of the sample.

3. Stratified Random Sampling — In this technique, the whole population is divided into sub-population according to the characteristic of the record, then records are selected at random from each sub-population. The proportion of sub-groups is kept the same as it was in the original dataset. Suppose the total population has 100 members and 80 members are employed and 20 members are unemployed. If the total strength of the sample needs to be 10, then 8 members will be selected at random from the employed group and 2 members will be selected at random from the unemployed group.

Outliers

Outliers are those records that have values substantially different from the rest of the records. These outliers are generally a result of abnormal calculation during the data collection or some mistake in the instrument used for calculation, or sometimes they are really exceptional cases. The outliers should be removed from the data as they can produce an erroneous pattern or even if the outliers are real exceptional cases, they can produce patterns that cannot be applied to general data.

There are several ways to detect outliers. We can use Box-plots to detect outliers, if we visualize the dataset using boxplot, the data above the whiskers or below the whiskers can be considered as outliers. We can also use Inter Quartile Range method to detect outliers, in this method, we calculate the Inter Quartile Range(IQR) by subtracting the 1st Quartile value(25th percentile) from the 3rd Quartile value(75th percentile). Then, we calculate the lower limit by using the formula 1st Quartile value - 1.5 x IQR and the upper limit using 3rd Quartile value +1.5 x IQR. The values beyond this limit are considered outliers. There are several models that can be used to find outliers such as Isolation Forest, Local Outlier Factor, etc.

Missing Values

In the dataset, there can be records that have values missing for certain attributes, these are known as missing values. The values can be missing for different reasons, and handling of these missing values depends upon the reason because of which it is missing.

Missing Values can be classified into three types —

  1. Missing Completely At Random(MCAR)
  2. Missing At Random(MAR)
  3. Missing Not At Random(MNAR)

Missing Completely At Random(MCAR) — When data is missing completely at random it means that the reason behind its absence is neither its value nor any other attribute’s value for that record. The data is missing because of human error or equipment error or maybe the person was sick or maybe some technical error.

In this case, the probability of the data being missing is the same for all records or in other words, due to this reason any value could have been missing. There is no relationship between the reason and any value or the value itself.

Missing At Random(MAR) — When the reason for the value being missing for a particular attribute is the value of another attribute, then these missing values are considered to be of MAR type. For example, the value for the ‘age’ attribute is missing mostly for the people who belong to a rural area as in rural areas many people are not literate and they do not have knowledge about their birth date. In this case, we can predict the reason but cannot get any idea about the value which is missing.

Missing Not At Random(MNAR) — When a value is missing not at random it means the reason behind the value being missing is the value itself. For example, people who have low income are less likely to enter their income. In this case, the mean income is always going to be biased towards a higher number.

There are two ways to handle missing values —

  1. Removal — If there are fewer no. of records with missing values for one or more attributes, a choice can be made to remove these records. But this technique introduces bias if the data is not MCAR.
  2. Imputation — This technique is to guess the value that is missing and replace it with that. There are different ways such as replacing the missing value with the mean or median of the values in that column if it is of numerical type and if the column is categorical, the mode can be used to replace the value. The nearest neighbour algorithm can also be used for imputation or a model can be created to predict the value of the missing data.

Balancing

The real-world datasets may have more data for a particular class or group than others. Some groups may have a very less number of entries in the dataset. The classification algorithm may have very good accuracy for the dataset even though it completely ignores a particular class as that class has very less entries. For example, there are 90 entries of employed people and 10 entries of unemployed people in a dataset, in this case, an algorithm can achieve 90% accuracy even if it classifies all the 10 unemployed people as employed. We need to balance the data so that each class of records has equal representation. This can be done in two ways Under Sampling and Over Sampling.

Under Sampling is a technique in which the majority class’s records are removed from the dataset to make the number of majority class records and minority class records equal.

Over Sampling is just the reverse of Under Sampling, in this technique records of minority classes are added using SMOTE which uses the nearest neighbour algorithm to create new samples from the available data.

References

[1] Zhang, S., Zhang, C. and Yang, Q., 2003. Data preparation for data mining. Applied artificial intelligence, 17(5–6), pp.375–381.

[2] https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/

[3] https://towardsdatascience.com/boxplot-for-anomaly-detection-9eac783382fd

[4] https://www.r-bloggers.com/2020/06/why-balancing-your-data-set-is-important/

[5] https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/

--

--