eval
and query
# In[1]
rng=np.random.default_rng(42)
x=rng.random(1000000)
y=rng.random(1000000)
%timeit x+y
# Out[1]
4.26 ms ± 915 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# In[2]
mask=(x>0.5)&(y<0.5)
# In[3]
tmp1=(x>0.5)
tmp2=(y<0.5)
mask=tmp1&tmp2
x
and y
arrays are very large, this can lead to significant memory and computational overhead.# In[4]
import numexpr
mask_numexpr=numexpr.evaluate('(x>0.5)&(y<0.5)')
np.all(mask==mask_numexpr)
# Out[4]
True
eval
and query
tools are essentially Pandas-specific wrappers of NumExpr functionality.eval
function in Pandas uses string expressions to efficiently compute operations on DataFrame objects.# In[5]
nrows,ncols=100000, 100
df1,df2,df3,df4=(pd.DataFrame(rng.random((nrows,ncols))) for i in range(4))
# In[6]
%timeit df1+df2+df3+df4
# Out[6]
167 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.eval
by constructing the expression as a string.eval
version of this expression is about 50% faster, while giving the same result.# In[7]
%timeit pd.eval('df1+df2+df3+df4')
# Out[7]
73.5 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# In[8]
np.allclose(df1+df2+df3+df4,pd.eval('df1+df2+df3+df4'))
# Out[8]
True
pd.eval
supports a wide range of operations.# In[10]
df1,df2,df3,df4,df5=(pd.DataFrame(rng.integers(0,1000,(100,3)))
for i in range(5))
# In[11]
result1=-df1*df2/(df3+df4)-df5
result2=pd.eval('-df1*df2/(df3+df4)-df5')
np.allclose(result1,result2)
# Out[11]
True
# In[12]
result1=(df1<df2)&(df2<=df3)&(df3!=df4)
result2=pd.eval('df1<df2<=df3!=df4')
np.allclose(result1,result2)
# Out[12]
True
# In[13]
result1=(df1<0.5)&(df2<0.5)|(df3<df4)
result2=pd.eval('(df1<0.5)&(df2<0.5)|(df3<df4)')
np.allclose(result1,result2)
# Out[13]
True
and
and or
in Boolean expressions.# In[14]
result3=pd.eval('(df1<0.5) and (df2<0.5) or (df3<df4)')
np.allclose(result1,result2)
# Out[14]
True
pd.eval
supports access to object attributes via the obj.attr
syntax and indexes via the obj[index]
syntax# In[15]
result1=df2.T[0]+df3.iloc[1]
result2=pd.eval('df2.T[0]+df3.iloc[1]')
np.allclose(result1,result2)
# Out[15]
True
pd.eval
.Please reference this url about the np.allclose
:
About np.allclose
pd.eval
function, DataFrame objects have an eval
method that works in similar ways.eval
method is that columns can be referred to by name.# In[16]
df=pd.DataFrame(rng.random((1000,3)),columns=['A','B','C'])
df.head()
# Out[16]
A B C
0 0.850888 0.966709 0.958690
1 0.820126 0.385686 0.061402
2 0.059729 0.831768 0.652259
3 0.244774 0.140322 0.041711
4 0.818205 0.753384 0.578851
pd.eval
, we can compute expressions with the three columns.# In[17]
result1=(df['A']+df['B'])/(df['C']-1)
result2=pd.eval("(df.A+df.B)/(df.C-1)")
np.allclose(result1,result2)
# Out[17]
True
DataFrame.eval
also allows assignment to any column.df.eval
to create a new column 'D' and assign to it a value computed from the other columns.inplace
is True, the value is reflected in the original. On the contrary, False
(default), just returned the result. # In[18]
df.eval('D=(A+B)/C',inplace=True)
df.head()
# Out[18]
A B C D
0 0.850888 0.966709 0.958690 1.895916
1 0.820126 0.385686 0.061402 19.638139
2 0.059729 0.831768 0.652259 1.366782
3 0.244774 0.140322 0.041711 9.232370
4 0.818205 0.753384 0.578851 2.715013
# In[19]
df.eval('D=(A-B)/C',inplace=True)
df.head()
# Out[19]
A B C D
0 0.850888 0.966709 0.958690 -0.120812
1 0.820126 0.385686 0.061402 7.075399
2 0.059729 0.831768 0.652259 -1.183638
3 0.244774 0.140322 0.041711 2.504142
4 0.818205 0.753384 0.578851 0.111982
DataFrame.eval
method supports an additional syntax that lets it work with local Python variables.# In[20]
column_mean=df.mean(1)
result1=df['A']+column_mean
result2=df.eval('A+@column_mean')
np.allclose(result1,result2)
# Out[20]
True
@
character here marks a variable name rather than a column name, and lets you efficiently evaluate expressions involving the two "namespaces": the namespace of columns, and the namespace of Python objects.@
character is only supported by the DataFrame.eval
method, not by the pandas.eval
function, because the pandas.eval
function only has access to the one(Python) namespace.query
.# In[21]
result1=df[(df.A < 0.5) & (df.B < 0.5)]
result2=pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1,result2)
# Out[21]
True
DataFrame.eval
, this is an expression involving columns of the DataFrame.DataFrame.eval
syntax.query
method.# In[22]
result2=df.query('A < 0.5 and B < 0.5')
np.allclose(result1,result2)
# Out[22]
True
query
method also accpets the @
flag to mark local variables.# In[23]
Cmean=df['C'].mean()
result1=df[(df.A < Cmean) & (df.B < Cmean)]
result2=df.query('A < @Cmean and B < @Cmean')
np.allclose(result1,result2)
# Out[23]
True
eval
and query
, there are two considerations: computation time and memory use.eval
or query
expression.# In[24]
df.values.nbytes
# Out[24]
32000
eval
can be faster even when you are not maxing out your system memory.eval/query
method is usually not significant. eval/query
is mainly in the saved memory, and the sometimes cleaner syntax they offer.For more information on eval/query
, you can refer these Pandas documentation :
1. Pandas.eval
2. Pandas.DataFrame.eval
3. Pandas.DataFrame.query