Handling Missing Data

노정훈·2023년 7월 20일

Pandas

목록 보기

4/12

Trade-offs in Missing Data Conventions

A number of approaches have been developed to track the presence of missing data in a table or DataFrame.
They revolve around one of two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a missing entry.
In the masking approach, the mask might be an entirely separate Boolean array, or it might involve appropriation(도용) of one bit in the data representation to locally indicate the null status of a value.
In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point values with NaN, a special values that is part of the IEEE floating-point specification.
Use of a separate mask array requires allocation(할당) of an additional Boolean array, which adds overhead in both storage and computation.
A sentinel value reduces the range of valid values that can be represented, and may require extra logic in CPU and GPU arithmetic, because common special values like NaN are not available for all data types.

Missing Data in Pandas

The way in which Pandas handles missing values is constrained by its reliance on the Numpy package, which does not have a built-in notion of NA values for nonfloating-point data types.
For these reasons, Pandas has two modes of storing and manipulating null values.

The default mode is to use a sentinel-based missing data scheme, with sentinel values NaN or None depending on the type of the data.
You can opt in(참여하기로 하다) to using the nullable data types Pandas provides, which results in the creation an accompanying mask array to track missing entries. These missing entries are then presented to the user as the special pd.NA value.

In either case, the data operations and manipulations provided by the Pandas API will handle and propagate those missing entries in a predictable manner.

None as a Sentinel Value

None is a Python object, which means that any array containing None must have dtype=object.

# In[1]
vals1=np.array([1,None,2,3])
vals1

# Out[1]
array([1, None, 2, 3], dtype=object)

The dtype=object means that the best common type representation Numpy could infer for the contents of the array is that they are Python objects.
Python does not support arithmetic operations with None, aggregations like sum or min will generally lead to an error.
For this reason, Pandas does not use None as a sentinel in its numerical arrays.

NaN: Missing Numerical Data

NaN is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation.

# In[2]
vals2=np.array([1,np.nan,3,4])
vals2

# Out[2]
array([ 1., nan,  3.,  4.])

Numpy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code.
NaN is a bit like data virus; it infects any other object it touches.
Regardless of the operations, the result of arithmetic with NaN will be another NaN

# In[3]
print(1+np.nan)
print(0*np.nan)

# Out[3]
nan
nan

This means that aggregates over the values are well defined but not always useful.

# In[4]
vals2.sum(),vals2.min(),vals2.max()

# Out[4]
(nan,nan,nan)

Numpy does provide NaN-aware versions of aggregations that will ignore these missing values.

# In[5]
np.nansum(vals2),np.nanmin(vals2),np.nanmax(vals2)

# Out[5]
(8.0, 1.0, 4.0)

NaN and None in Pandas

# In[6]
pd.Series([1,np.nan,2,None])

# Out[6]
0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

For types that don't have an available sentinel value, Pandas automatically typecasts when NA values are present.
If we set a value in an integer array to np.nan, it will automatically be upcast to a floating-point type to accommodate the NA

# In[7]
x=pd.Series(range(2),dtype=int)
x

# Out[7]
0    0
1    1
dtype: int64

# In[8]
x[0]=None
x

# Out[8]
0    NaN
1    1.0
dtype: float64

Pandas handling of NAs by type

Typeclass	Conversion when storing NAs	NA sentinel value
floating	No change	`np.nan`
object	No change	`None` or `np.nan`
integer	Cast to float64	`np.nan`
boolean	Cast to object	`None` or `np.nan`

Pandas Nullable Dtypes

Pandas later added nullable dtypes, which are distinguished from regular dtypes by capitalization of their names.
For backward compatibility, these nullable dtypes are only used if specifically requested.

# In[9]
pd.Series([1,np.nan,2,None,pd.NA],dtype='Int32')

# Out[9]
0       1
1    <NA>
2       2
3    <NA>
4    <NA>
dtype: Int32

This representation can be used interchangeably with the others in all the operations explored through the rest of this chapter.

Operating on Null Values

Pandas provides several methods for detecting, removing, and replacing null values in Pandas data structures.

isnull : Generates a Boolean mask indicating missing values.
notnull: Opposite of isnull
dropna : Returns a filtered version of the data
fillna : Returns a copy of the data with missing values filled or imputed(귀속시키다).

Detecting Null Values

# In[10]
data=pd.Series([1,np.nan,'hello',None])

# In[11]
data.isnull()

# Out[11]
0    False
1     True
2    False
3     True
dtype: bool

# In[12]
data[data.notnull()]

# Out[12]
0        1
2    hello
dtype: object

The isnull and notnull methods produce similar Boolean results for DataFrame objects.

Dropping Null Values

# In[13]
data.dropna()

# Out[13]
0        1
2    hello
dtype: object

We cannot drop single values from a DataFrame; we can only drop entire rows or columns.

# In[14]
df=pd.DataFrame([[1     , np.nan, 2],
                 [2     ,      3, 5],
                 [np.nan,      4, 6]])
df

# Out[14]
      0	  1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

By default, dropna will drop all rows in which any null value is present.

# In[15]
df.dropna()

# Out[15]
      0	  1	2
1	2.0	3.0	5

Alternatively, you can drop NA values along a differnet axis. By using axis=1 or axis=column

# In[16]
df.dropna(axis=1)

# Out[16]
    2
0	2
1	5
2	6

This drop can be specified through the how or thresh parameters.
The default is how='any' , such that any row or column containing a null value will be dropped.
You can also specify how='all' , which will only drop rows/columns that contain all null values.

# In[17]
df[3]=np.nan
df

# Out[17]
      0	  1	2	  3
0	1.0	NaN	2	NaN
1	2.0	3.0	5	NaN
2	NaN	4.0	6	NaN

# In[18]
df.dropna(axis=1,how='all')

# Out[18]
      0	  1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

For final-grained control, the thresh parameter lets you specify a minimum number of non-null values for the row/columns to be kept.

# In[19]
df.dropna(axis=0,thresh=3)

# Out[19]
      0	  1	2	  3
1	2.0	3.0	5	NaN

Filling Null Values

# In[20]
data=pd.Series([1,np.nan,2,None,3],index=list('abcde'),dtype='Int32')
data

# Out[20]
a       1
b    <NA>
c       2
d    <NA>
e       3
dtype: Int32

We can fill NA entries with a single value, such as zero

# In[21]
data.fillna(0)

# Out[21]
a    1
b    0
c    2
d    0
e    3
dtype: Int32

We can specify a forward fill to propagate the previous value forward.

# In[22]
data.fillna(method='ffill') # forward fill

# Out[22]
a    1
b    1
c    2
d    2
e    3
dtype: Int32

Or we can specify a backward fill to propagate the next values backward.

# In[23]
data.fillna(method='bfill') # backward fill

# Out[23]
a    1
b    2
c    2
d    3
e    3
dtype: Int32

In case of a DataFrame, the options are similar, but we can also specify an axis along which the fills should take place.

# In[24]
df

# Out[24]
      0	  1	2	  3
0	1.0	NaN	2	NaN
1	2.0	3.0	5	NaN
2	NaN	4.0	6	NaN

# In[25]
df.fillna(method='ffill',axis=1)

# Out[25]
      0	  1	  2	  3
0	1.0	1.0	2.0	2.0
1	2.0	3.0	5.0	5.0
2	NaN	4.0	6.0	6.0

노정훈

이전 포스트

Operating on Data in Pandas

다음 포스트

Hierarchical Indexing

1개의 댓글

happy

2023년 7월 20일

너무 좋은 글이네요. 공유해주셔서 감사합니다.

답글 달기