# Completeness

## Description

Measured, stored or recorded data must have all necessary attributes. So-called NaN values result from faulty operations and reduce completeness. To improve this quality dimension, you can assess your data whether all your information is available or whether there are any missing elements.

## Tools and Libraries

### Python

In order to measure the completeness of a data set, it makes most sense to identify data gaps and, if necessary, to quantify them. In the following, a simple example and functions will show how data gaps can be identified.

Install pandas and numpy via command:

```pip install pandas

pip install numpy
```
```# import pandas
import pandas as pd

# import numpy
import numpy as np

# dictionary of lists
dict = {
"column a": [0, 90, np.nan, 95],
"column b": [30, 45, 56, 0],
"column c": [np.nan, 40, 80, 98],
"column d": [np.nan, 12, 35, None],
}

# creating a dataframe from list
df = pd.DataFrame(dict)
```

#### Identifying missing data

Simple dataframe with four columns where `None`- and `np.nan`-Values occur. That values occur in the formats mentioned is not always the case. How to identify and quantify completely empty cells can be read here <link>.

```# import pandas
import pandas as pd

# import numpy
import numpy as np

# dictionary of lists
dict = {
"column a": [0, 90, np.nan, 95],
"column b": [30, 45, 56, 0],
"column c": [np.nan, 40, 80, 98],
"column d": [np.nan, 12, 35, None],
}

# creating a dataframe from list
df = pd.DataFrame(dict)
```

Following functions can be used to create two arrays with row- and column-coordinates of the missing data for later computation.

The dataframe looks the following:

```    column a  column b  column c  column d
0       0.0        30       NaN       NaN
1      90.0        45      40.0      12.0
2       NaN        56      80.0      35.0
3      95.0         0      98.0       NaN
```
```# import numpy
import numpy as np

def missing_value_coordinates(dataframe):
"""Finds missing value in dataframes

Args:
dataframe (pd.dataframe): Input dataframe created from a csv-file

Returns:
array: arrays with row and column indices
"""

return np.where(np.asanyarray(np.isnan(dataframe)))
```

Output of the function `missing_value_coordinates()`

```In : missing_value_coordinates(dataframe)

Out: (array([0, 0, 2, 3, 3], dtype=int64), array([2, 3, 0, 1, 3], dtype=int64))
```

To count the number of missing values columnwise you can the following function which is a combination of two functions.

```def count_missing_value(dataframe):
"""Counts missing values in a dataframe

Args:
dataframe (pd.dataframe): Input dataframe created from a csv-file

Returns:
int: Summed up counts of missing values
"""

return dataframe.isnull().sum()
```

The output when the example dataframe is used. Now you can see in which column and how values are missing.

```In : count_missing_value(dataframe)

Out: column a    1
column b    1
column c    1
column d    2
```

Sometimes one works with data sets in which values are missing from the outset and these are not easily visible as in the last example. This can be investigated using an open source marketing data set.

```In :  df = pd.read_csv(
r"C:/Users/Goerner/Desktop/Datasets/marketing_campaign.csv", delimiter="\t"
)

df.iloc[10,:7]

Out:  ID                      1994
Year_Birth              1983
Marital_Status       Married
Income                   NaN
Kidhome                    1
Teenhome                   0
```

In this example the output when counted nan-values for the first seven columns will be:

```In : count_missing_value(dataframe.iloc[:,:7])

Out:  ID                 0
Year_Birth         0
Education          0
Marital_Status     0
Income            24
Kidhome            0
Teenhome           0
```

#### Removing missing data

Missing data can be problematic for machine-learning algorithms, for example, because many models cannot handle missing values. For this situation, it makes sense to remove rows with missing data.

To do this, several steps must be carried out. The first step is to declare what constitutes a missing datum and the second step is to convert these values into nan values and remove the corresponding row.

```# import numpy
import numpy as np

def replace_missing_value(dataframe, values: list):
"""_summary_

Args:
dataframe (pd.dataframe): Input dataframe created from a csv-file
values (list): List of values to replace with np.nan

Returns:
pd.dataframe: Dataframe with replaced values
"""
for type in values:

new_dataframe = dataframe.replace(type, np.nan)

return new_dataframe
```

For example `None` and `0` will be replaced with `NaN`-Values.

```In : replace_missing_value(dataframe, value_types = [str(None), 0])

Out:      column a  column b  column c  column d
0       NaN      30.0       NaN       NaN
1      90.0      45.0      40.0      12.0
2       NaN      56.0      80.0      35.0
3      95.0       NaN      98.0       NaN
```

Rows with `None` and `0` will be dropped. A clean dataframe is the result.

```In : dataframe.dropna(inplace=True)

Out: column a  column b  column c  column d
1      90.0        45      40.0      12.0
```

#### Measure completeness

There are several ways to determine the completeness of a data set. The completeness can refer to individual entries, columns or rows. Some trivial functions are now provided for the respective situations.

Calculation of complete dataseries:

```def complete_data_series(dataframe):
"""Calcualtes the degree of complete rowwise entries in a give dataframe

Args:
dataframe (pd.dataframe): Input dataframe created from a csv-file

Returns:
float: Degree of complete rowwise entries in a give dataframe

"""

row_count_missing_data = dataframe.isnull().any(axis=1).sum()

return 1 - (row_count_missing_data / len(dataframe))
```

Proportion of missing data:

```def degree_of_completeness(dataframe):
"""Calculates the degree of complete entries in a dataframe

Args:
dataframe (pd.dataframe): Input dataframe created from a csv-file

Returns:
float: Dgree of complete entries in a give dataframe
"""

missing_data_count = dataframe.isnull().sum().sum()

return 1 - (missing_data_count / dataframe.size)
```