NaN
, a special values that is part of the IEEE floating-point specification.NaN
are not available for all data types.NaN
or None
depending on the type of the data.pd.NA
value.None
is a Python object, which means that any array containing None
must have dtype=object
.# In[1]
vals1=np.array([1,None,2,3])
vals1
# Out[1]
array([1, None, 2, 3], dtype=object)
dtype=object
means that the best common type representation Numpy could infer for the contents of the array is that they are Python objects.None
, aggregations like sum
or min
will generally lead to an error.None
as a sentinel in its numerical arrays.NaN
is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation.# In[2]
vals2=np.array([1,np.nan,3,4])
vals2
# Out[2]
array([ 1., nan, 3., 4.])
NaN
is a bit like data virus; it infects any other object it touches.NaN
will be another NaN
# In[3]
print(1+np.nan)
print(0*np.nan)
# Out[3]
nan
nan
# In[4]
vals2.sum(),vals2.min(),vals2.max()
# Out[4]
(nan,nan,nan)
NaN
-aware versions of aggregations that will ignore these missing values.# In[5]
np.nansum(vals2),np.nanmin(vals2),np.nanmax(vals2)
# Out[5]
(8.0, 1.0, 4.0)
# In[6]
pd.Series([1,np.nan,2,None])
# Out[6]
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
np.nan
, it will automatically be upcast to a floating-point type to accommodate the NA# In[7]
x=pd.Series(range(2),dtype=int)
x
# Out[7]
0 0
1 1
dtype: int64
# In[8]
x[0]=None
x
# Out[8]
0 NaN
1 1.0
dtype: float64
Pandas handling of NAs by type
Typeclass | Conversion when storing NAs | NA sentinel value |
---|---|---|
floating | No change | np.nan |
object | No change | None or np.nan |
integer | Cast to float64 | np.nan |
boolean | Cast to object | None or np.nan |
# In[9]
pd.Series([1,np.nan,2,None,pd.NA],dtype='Int32')
# Out[9]
0 1
1 <NA>
2 2
3 <NA>
4 <NA>
dtype: Int32
isnull
: Generates a Boolean mask indicating missing values.notnull
: Opposite of isnull
dropna
: Returns a filtered version of the datafillna
: Returns a copy of the data with missing values filled or imputed(귀속시키다).# In[10]
data=pd.Series([1,np.nan,'hello',None])
# In[11]
data.isnull()
# Out[11]
0 False
1 True
2 False
3 True
dtype: bool
# In[12]
data[data.notnull()]
# Out[12]
0 1
2 hello
dtype: object
isnull
and notnull
methods produce similar Boolean results for DataFrame objects.# In[13]
data.dropna()
# Out[13]
0 1
2 hello
dtype: object
# In[14]
df=pd.DataFrame([[1 , np.nan, 2],
[2 , 3, 5],
[np.nan, 4, 6]])
df
# Out[14]
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
dropna
will drop all rows in which any null value is present.# In[15]
df.dropna()
# Out[15]
0 1 2
1 2.0 3.0 5
axis=1
or axis=column
# In[16]
df.dropna(axis=1)
# Out[16]
2
0 2
1 5
2 6
This drop can be specified through the how
or thresh
parameters.
The default is how='any'
, such that any row or column containing a null value will be dropped.
You can also specify how='all'
, which will only drop rows/columns that contain all null values.
# In[17]
df[3]=np.nan
df
# Out[17]
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
# In[18]
df.dropna(axis=1,how='all')
# Out[18]
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
thresh
parameter lets you specify a minimum number of non-null values for the row/columns to be kept.# In[19]
df.dropna(axis=0,thresh=3)
# Out[19]
0 1 2 3
1 2.0 3.0 5 NaN
# In[20]
data=pd.Series([1,np.nan,2,None,3],index=list('abcde'),dtype='Int32')
data
# Out[20]
a 1
b <NA>
c 2
d <NA>
e 3
dtype: Int32
# In[21]
data.fillna(0)
# Out[21]
a 1
b 0
c 2
d 0
e 3
dtype: Int32
# In[22]
data.fillna(method='ffill') # forward fill
# Out[22]
a 1
b 1
c 2
d 2
e 3
dtype: Int32
# In[23]
data.fillna(method='bfill') # backward fill
# Out[23]
a 1
b 2
c 2
d 3
e 3
dtype: Int32
axis
along which the fills should take place.# In[24]
df
# Out[24]
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
# In[25]
df.fillna(method='ffill',axis=1)
# Out[25]
0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 3.0 5.0 5.0
2 NaN 4.0 6.0 6.0
너무 좋은 글이네요. 공유해주셔서 감사합니다.