Consistency
Description
A data set must not have any contradictions within itself or with other data sets. Data are inconsistent if different valid states are not compatible with each other.
Consistency is for measuring if two data values derived by different sets aren’t conflicting with each other. The percent of values that match across various records is a common data quality metric for consistency.
Consistency refers primarily to the use of data by different users. Examples of consistent data usually refer to data formats and data types that should be identical throughout in order to maintain a required level of data quality.
Inconsistencies in data can be due to changes over time and/or across variables for example, in
Vintages or time periods
Units
Levels of accuracy
Levels of completeness
Types of inclusion or exclusions.
Tools and Libraries
Python
Install pandas and numpy via command:
pip install pandas
pip install numpy
Standard deviation is absolute measure of dispersion.
Note
Quote: However one could find which series is more consistent than other by coefficient of variation, that is relative measure of dispersion based on standard deviation multiplied by 100.
We can calculate consistency using standard deviation and mean of the given date:
Note
Code Snipped comming soon.
The data having lower coefficient of Variation is more consistent and vice - versa.
Checking for inconsistent datatypes
For the processing and use of data such as time series or numerical values, it is indispensable that data types must not differ. To check a data set for inconsistency, the following function can be used. It provides information about which columns are inconsistent.
# importing pandas
import pandas as pd
# importing numpy
import numpy as np
# dictionary of lists
dict = {
"column a": [0, 90, np.nan, "wort"],
"column b": [30, 45, 56, 0],
"column c": [np.nan, 40, 80, 98],
"column d": [np.nan, 12, 35, None],
}
# creating a dataframe from list
df = pd.DataFrame(dict)
# define function to check for different datatypes
def check_for_types(dataframe):
"""Check if columns of a dataframe consists of different datatypes
Args:
dataframe (pd.dataframe): Input dataframe created from a csv-file
"""
for dtype, column in zip(dataframe.dtypes, dataframe.columns):
if dtype == object:
print(f'{column} contains multiple different datatypes!')
In [1]: check_for_types(df)
Out[2]: column a contains multiple different datatypes!