Relevance
Description
Relevance in data quality refers to the usefulness of collected data and whether this data is needed for further processing. However, the concept of “relevance” can vary depending on the specific context and requirements of your task.
In the following example, a data set is examined for a criterion, i.e. a word or a numerical value. If the data set contains the criterion, it is classified as relevant.
Tools and Libraries
Install pandas
pip install pandas
Python
Load the needed dataset as showed in the following code snippet:
# import pandas
import pandas as pd
# load dataset
df = pd.read_csv(r"C:/Users/Datasets/beachwater.csv", delimiter=";")
print(df)
If we look at the data set, we see that the first column contains the name of the individual beaches. This is a good selection criterion to check the relevance.
In [1]: print(df)
Out[2]: Beach Name Measurement Timestamp Water Temperature Turbidity ... Wave Period Battery Life Measurement Timestamp Label Measurement ID
0 Montrose Beach 08/30/2013 08:00:00 AM 20.3 1.18 ... 3.0 9.4 8/30/2013 8:00 AM MontroseBeach201308300800
1 Ohio Street Beach 05/26/2016 01:00:00 PM 14.4 1.23 ... 4.0 12.4 05/26/2016 1:00 PM OhioStreetBeach201605261300
2 Calumet Beach 09.03.2013 16:00 23.2 3.63 ... 6.0 9.4 09.03.2013 16:00 CalumetBeach201309031600
3 Calumet Beach 05/28/2014 12:00:00 PM 16.2 1.26 ... 4.0 11.7 5/28/2014 12:00 PM CalumetBeach201405281200
4 Montrose Beach 05/28/2014 12:00:00 PM 14.4 3.36 ... 4.0 11.9 5/28/2014 12:00 PM MontroseBeach201405281200
The following function returns whether a search word is present in the dataset. With this simple function, the exploratory part of the data analysis can be discarded if it is clear from the beginning that relevance is not guaranteed.
def check_dataset_relevance(df, keyword):
"""This function gives feedback if a given keywords exists in a dataframe.
Args:
df (dataframe): dataframe to search in
keyword (string): Keyword as criteria to look for in the given dataframe
"""
# Check if the keyword is present in any column of the DataFrame
keyword_found = any(df.apply(lambda row: keyword in str(row), axis=1))
if keyword_found:
print(f"The dataset is relevant. It contains the keyword '{keyword}'.")
else:
print(f"The dataset is not relevant. It does not contain the keyword '{keyword}'.")
With this function we can, for example, search the beach dataset for specific beaches. In our case, we can find out if there is sensor data for Ohio Street Beach.
In [1]: df["height_new"] = df.apply(lambda row: get_inches(row["Height"]), axis=1)
Out[2]: The dataset is relevant. It contains the keyword 'Ohio Street Beach'.
Relevance can be understood in many ways. Therefore, the previous example is only a small excerpt of what is possible.