Each data record must be unambiguously interpretable. If entries differ only by one characteristic or only by the ID, a duplicate analysis is to be preferred because there is reasonable doubt that it is not the same entry.
Tools and Libraries
In Python’s Pandas library, Dataframe class provides a member function to find duplicate rows based on all columns or some specific columns. It returns a Boolean Series with True value for each duplicated row.
Install pandas via command:
pip install pandas
Find ambiguous entries
To check if rows occur multiple time you can use this code snippet which will check if an row is identical to a provious row.
# import pandas import pandas as pd # load dataset df = pd.read_csv(r"C:/Users/Datasets/marketing_campaign.csv", delimiter=";") def find_ambiguous_sets(dataframe, list_of_columns=None): """ Args: dataframe (pd.dataframe): Input dataframe created from a csv-file list_of_columns (list, optional): List of columns to check for duplications. Defaults to None to check every column Returns: dataframe: Dataframe with ambiguous rows to double check. """ return dataframe[dataframe.duplicated(list_of_columns, keep="first")]
If used on the preloaded dataframe the function shows that three rows are duplicates. With the list_of_columns-Parameter you can subset the columns to find duplicated column values. The more values there are that more ambiguous a dateset can be.
In : find_ambiguous_sets(dataframe, list_of_columns=None) Out: ID Year_Birth Education Marital_Status Income ... AcceptedCmp2 Complain Z_CostContact Z_Revenue Response 89 3033 1963 Master Together 38620.0 ... 0 0 3 11 0 131 4646 1951 2n Cycle Married 78497.0 ... 0 0 3 11 0 197 326 1973 Graduation Married 51148.0 ... 0 0 3 11 0
Since the pure number of duplicated columns has little significance, the following function can be used to determine the degree of uniqueness. Since only three entries are duplicated, the degree is almost one.
def degree_of_unambiguous(dataframe, list_of_columns=None): """Sums up numbwe of ambiguous rows Args: dataframe (pd.dataframe): Input dataframe created from a csv-file Returns: float: Degree of which a dataframe is unambiguous """ sum_of_duplicates = dataframe.duplicated(subset=list_of_columns, keep="first").sum() return ((dataframe.size - sum_of_duplicates) / dataframe.size) * 100
In : degree_of_unambiguous(df) Out: 99.86625055728935