Unambiguous

Description

Each data record must be unambiguously interpretable. If entries differ only by one characteristic or only by the ID, a duplicate analysis is to be preferred because there is reasonable doubt that it is not the same entry.

Tools and Libraries

Python

In Python’s Pandas library, Dataframe class provides a member function to find duplicate rows based on all columns or some specific columns. It returns a Boolean Series with True value for each duplicated row.

Install pandas via command:

pip install pandas

Find ambiguous entries

To check if rows occur multiple time you can use this code snippet which will check if an row is identical to a provious row.

# import pandas
import pandas as pd

# load dataset
df = pd.read_csv(r"C:/Users/Datasets/marketing_campaign.csv", delimiter=";")


def find_ambiguous_sets(dataframe, list_of_columns=None):
    """
   
    Args:
        dataframe (pd.dataframe): Input dataframe created from a csv-file
        list_of_columns (list, optional): List of columns to check for duplications. Defaults to None to check every column

    Returns:
        dataframe: Dataframe with ambiguous rows to double check.
    """

    return dataframe[dataframe.duplicated(list_of_columns, keep="first")]

If used on the preloaded dataframe the function shows that three rows are duplicates. With the list_of_columns-Parameter you can subset the columns to find duplicated column values. The more values there are that more ambiguous a dateset can be.

In [1]:  find_ambiguous_sets(dataframe, list_of_columns=None)

Out[2]:  ID  Year_Birth   Education Marital_Status   Income  ...  AcceptedCmp2  Complain Z_CostContact  Z_Revenue  Response
         89   3033        1963      Master       Together  38620.0  ...             0         0             3         11         0
         131  4646        1951    2n Cycle        Married  78497.0  ...             0         0             3         11         0
         197   326        1973  Graduation        Married  51148.0  ...             0         0             3         11         0

Measure unambiguous

Since the pure number of duplicated columns has little significance, the following function can be used to determine the degree of uniqueness. Since only three entries are duplicated, the degree is almost one.

def degree_of_unambiguous(dataframe, list_of_columns=None):
    """Sums up numbwe of ambiguous rows

    Args:
        dataframe (pd.dataframe): Input dataframe created from a csv-file

    Returns:
        float: Degree of which a dataframe is unambiguous
    """

    sum_of_duplicates = dataframe.duplicated(subset=list_of_columns, keep="first").sum()

    return ((dataframe.size - sum_of_duplicates) / dataframe.size) * 100
In [1]:  degree_of_unambiguous(df)

Out[2]:  99.86625055728935