Data preprocessing is a crucial step in data mining. It involves transforming raw data into a clean, structured, and suitable format for mining. Proper data preprocessing helps improve the quality of the data, enhances the performance of algorithms, and ensures more accurate and reliable results.
In the real world, many databases and data warehouses have noisy, missing, and inconsistent data due to their huge size. Low quality data leads to low quality data mining.
Noisy: Containing errors or outliers. E.g., Salary = “-10”
Noisy data may come from
Missing: lacking certain attribute values or containing only aggregate data. E.g., Occupation = “”
Missing (Incomplete) may data come from
Inconsistent: Data inconsistency meaning is that different versions of the same data appear in different places.For example, the ZIP code is saved in one table as 1234-567 numeric data format; while in another table it may be represented in 1234567.
Inconsistent data may come from
Data preprocessing is used to improve the quality of data and mining results. And The goal of data preprocessing is to enhance the accuracy, efficiency, and reliability of data mining algorithms.
Data preprocessing is an essentialstepin the knowledge discovery process, because quality decisions must be based on qualitydata.And Data Preprocessing involvesData Cleaning, Data Integration, Data Reduction and Data Transformation.
Steps in Data Preprocessing
Data cleaning is a process that "cleans" the data by filling in the missing values, smoothing noisy data, analyzing, and removing outliers, and removing inconsistencies in the data.
If usersbelieve the data are dirty, they are unlikely to trust the results of any data mining that hasbeen applied.
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or datacleansing) routines attempt to fill in missing values, smooth out noise while identifyingoutliers, and correct inconsistencies in the data.
Imagine that you need to analyze All Electronics sales and customer data. You note thatmany tuples have no recorded value for several attributes such as customer income. Howcan you go about filling in the missing values for this attribute? There are several methods to fill the missing values.
Those are,
Noise is a random error or variance in a measured variable.Data smoothing techniques are used to eliminate noise and extract the useful patterns. The different techniques used for data smoothing are:
Binning: Binning methods smooth a sorted data value by consulting its “neighbourhood,” that is, the values around it. The sorted values are distributed into several “buckets,” or bins. Because binning methods consult the neighbourhood of values, they perform local smoothing.
There are three kinds of binning. They are:
Example:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin medians:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Data integration is the process of combining data from multiple sources into a single, unified view. This process involves identifying and accessing the different data sources, mapping the data to a common format. Different data sources may include multiple data cubes, databases, or flat files.
The goal of data integration is to make it easier to access and analyze data that is spread across multiple systems or platforms, in order to gain a more complete and accurate understanding of the data.
Data integration strategy is typically described using a triple (G, S, M) approach, where G denotes the global schema, S denotes the schema of the heterogeneous data sources, and M represents the mapping between the queries of the source and global schema.
Example: To understand the (G, S, M) approach, let us consider a data integration scenario that aims to combine employee data from two different HR databases, database A and database B. The global schema (G) would define the unified view of employee data, including attributes like EmployeeID, Name, Department, and Salary.
In the schema of heterogeneous sources, database A (S1) might have attributes like EmpID, FullName, Dept, and Pay, while database B's schema (S2) might have attributes like ID, EmployeeName, DepartmentName, and Wage. The mappings (M) would then define how the attributes in S1 and S2 map to the attributes in G, allowing for the integration of employee data from both systems into the global schema.
There are several issues that can arise when integrating data from multiple sources, including:
Imagine that you have selected data from the AllElectronics data warehouse for analysis.The data set will likely be huge! Complex data analysis and mining on huge amounts ofdata can take a long time, making such analysis impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced representation of thedata set that ismuch smaller in volume, yet closely maintains the integrity of the originaldata. That is, mining on the reduced data set should be more efficient yet produce thesame (or almost the same) analytical results.
In simple words,Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the most important information. This can be beneficial in situations where the dataset is too large to be processed efficiently, or where the dataset contains a large amount of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data mining, including:
Data transformation in data mining refers to the process of converting raw data into a format that is suitable for analysis and modelling. The goal of data transformation is to prepare the data for data mining so that it can be used to extract useful insights and knowledge.
Data transformation typically involves several steps, including:
Method Name | Irregularity | Output |
---|---|---|
Data Cleaning | Missing, Nosie, and Inconsistent data | Quality Data before Integration |
Data Integration | Different data sources (data cubes, databases, or flat files) | Unified view |
Data Reduction | Huge amounts of data can take a long time, making such analysis impractical or infeasible. | Reduce the size of a dataset and maintains the integrity. |
Data Transformation | Raw data | Prepare the data for data mining |