High-Performance Pandas: eval and query

노정훈·2023년 7월 29일
0

Pandas

목록 보기
12/12
  • The power of PyData stack is built upon the ability of Numpy and Pandas to push basic operations into lower-level compiled code via an intuitive higher-level syntax.
  • While these abstractions are effcient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue(=excessive) overhead in computational time and memory use.
  • To address this, Pandas includes some methods that allow you to directly access C-speed operations without costly allocation of intermediate arrays: eval and query

Motivating query and eval: Compound Expressions

# In[1]
rng=np.random.default_rng(42)
x=rng.random(1000000)
y=rng.random(1000000)
%timeit x+y
# Out[1]
4.26 ms ± 915 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • This is much faster than doing the addition via a Python loop or comprehension.
  • But this abstraction can become less efficient when computing compound expressions.
# In[2]
mask=(x>0.5)&(y<0.5)
  • Because Numpy evaluates each subexpression, this is roughly equivalent to the following code.
# In[3]
tmp1=(x>0.5)
tmp2=(y<0.5)
mask=tmp1&tmp2
  • In other words, every intermediate step is explicitly allocated in memory.
  • If the x and y arrays are very large, this can lead to significant memory and computational overhead.
  • The NumExpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.
# In[4]
import numexpr
mask_numexpr=numexpr.evaluate('(x>0.5)&(y<0.5)')
np.all(mask==mask_numexpr)
# Out[4]
True
  • The benefit here is that NumExpr evaluates the expression in a way that avoids temporary arrays where possible, and thus can be much more efficient than Numpy, especially for long sequences of computations on large arrays.
  • The Pandas eval and query tools are essentially Pandas-specific wrappers of NumExpr functionality.

pandas.eval for Efficient Operations

  • The eval function in Pandas uses string expressions to efficiently compute operations on DataFrame objects.
# In[5]
nrows,ncols=100000, 100
df1,df2,df3,df4=(pd.DataFrame(rng.random((nrows,ncols))) for i in range(4))
  • To compute the sum of all four DataFrames using the typical Pandas approach, we can just write the sum.
# In[6]
%timeit df1+df2+df3+df4
# Out[6]
167 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • The same result can be computed via pd.eval by constructing the expression as a string.
  • The eval version of this expression is about 50% faster, while giving the same result.
# In[7]
%timeit pd.eval('df1+df2+df3+df4')
# Out[7]
73.5 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# In[8]
np.allclose(df1+df2+df3+df4,pd.eval('df1+df2+df3+df4'))
# Out[8]
True
  • pd.eval supports a wide range of operations.
# In[10]
df1,df2,df3,df4,df5=(pd.DataFrame(rng.integers(0,1000,(100,3))) 
for i in range(5))
  1. Arithmetic operators
# In[11]
result1=-df1*df2/(df3+df4)-df5
result2=pd.eval('-df1*df2/(df3+df4)-df5')
np.allclose(result1,result2)
# Out[11]
True
  1. Comparison operators
# In[12]
result1=(df1<df2)&(df2<=df3)&(df3!=df4)
result2=pd.eval('df1<df2<=df3!=df4')
np.allclose(result1,result2)
# Out[12]
True
  1. Bitwise operators
# In[13]
result1=(df1<0.5)&(df2<0.5)|(df3<df4)
result2=pd.eval('(df1<0.5)&(df2<0.5)|(df3<df4)')
np.allclose(result1,result2)
# Out[13]
True
  • Additionally, it supports the use of the literal and and or in Boolean expressions.
# In[14]
result3=pd.eval('(df1<0.5) and (df2<0.5) or (df3<df4)')
np.allclose(result1,result2)
# Out[14]
True
  1. Object attributes and indices
  • pd.eval supports access to object attributes via the obj.attr syntax and indexes via the obj[index] syntax
# In[15]
result1=df2.T[0]+df3.iloc[1]
result2=pd.eval('df2.T[0]+df3.iloc[1]')
np.allclose(result1,result2)
# Out[15]
True
  1. Other operators
  • Other operations, such as function calls, conditional statements, loops, and other more involved constructs are currently not implemented in pd.eval.
  • If you'd like to execute these types of expressions, you can use the NumExpr library itself.

Please reference this url about the np.allclose :
About np.allclose

DataFrame.eval for Column-Wise Operations

  • Just as Pandas has a top-level pd.eval function, DataFrame objects have an eval method that works in similar ways.
  • The benefit of the eval method is that columns can be referred to by name.
# In[16]
df=pd.DataFrame(rng.random((1000,3)),columns=['A','B','C'])
df.head()
# Out[16]
           A	       B	       C
0	0.850888	0.966709	0.958690
1	0.820126	0.385686	0.061402
2	0.059729	0.831768	0.652259
3	0.244774	0.140322	0.041711
4	0.818205	0.753384	0.578851
  • By using pd.eval, we can compute expressions with the three columns.
# In[17]
result1=(df['A']+df['B'])/(df['C']-1)
result2=pd.eval("(df.A+df.B)/(df.C-1)")
np.allclose(result1,result2)
# Out[17]
True
  • We treat column names as variables within the evaluated expression, and the result is what we would wish.

Assignment in DataFrame.eval

  • DataFrame.eval also allows assignment to any column.
  • We can use df.eval to create a new column 'D' and assign to it a value computed from the other columns.
  • If inplace is True, the value is reflected in the original. On the contrary, False (default), just returned the result.
# In[18]
df.eval('D=(A+B)/C',inplace=True)
df.head()
# Out[18]
           A	       B	       C	        D
0	0.850888	0.966709	0.958690	 1.895916
1	0.820126	0.385686	0.061402	19.638139
2	0.059729	0.831768	0.652259	 1.366782
3	0.244774	0.140322	0.041711	 9.232370
4	0.818205	0.753384	0.578851	 2.715013
  • In the same way, any existing column can be modified.
# In[19]
df.eval('D=(A-B)/C',inplace=True)
df.head()
# Out[19]
           A	       B	       C	        D
0	0.850888	0.966709	0.958690	-0.120812
1	0.820126	0.385686	0.061402	 7.075399
2	0.059729	0.831768	0.652259	-1.183638
3	0.244774	0.140322	0.041711	 2.504142
4	0.818205	0.753384	0.578851	 0.111982

Local Variables in DataFrame.eval

  • The DataFrame.eval method supports an additional syntax that lets it work with local Python variables.
# In[20]
column_mean=df.mean(1)
result1=df['A']+column_mean
result2=df.eval('A+@column_mean')
np.allclose(result1,result2)
# Out[20]
True
  • The @ character here marks a variable name rather than a column name, and lets you efficiently evaluate expressions involving the two "namespaces": the namespace of columns, and the namespace of Python objects.
  • This @ character is only supported by the DataFrame.eval method, not by the pandas.eval function, because the pandas.eval function only has access to the one(Python) namespace.

The DataFrame.query Method

  • The DataFrame has another method based on evaluated strings, called query.
# In[21]
result1=df[(df.A < 0.5) & (df.B < 0.5)]
result2=pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1,result2)
# Out[21]
True
  • As with the example used in our discussion of DataFrame.eval, this is an expression involving columns of the DataFrame.
  • It cannot be expressed using DataFrame.eval syntax.
  • Instead, for this type of filtering operation, you can use the query method.
# In[22]
result2=df.query('A < 0.5 and B < 0.5')
np.allclose(result1,result2)
# Out[22]
True
  • In addtion to being a more efficient computation, compared to the masking expression this is much easier to read and understand.
  • query method also accpets the @ flag to mark local variables.
# In[23]
Cmean=df['C'].mean()
result1=df[(df.A < Cmean) & (df.B < Cmean)]
result2=df.query('A < @Cmean and B < @Cmean')
np.allclose(result1,result2)
# Out[23]
True

Performance: When to Use These Functions

  • When considering whether to use eval and query, there are two considerations: computation time and memory use.
  • Every compound expression involving Numpy arrays or Pandas DataFrames will result in implicit creation of temporary arrays.

  • If the size of the temporary DataFrames is significant compared to your available system memory, then it's a good idea to use an eval or query expression.
  • You can check the approximate size of your array in bytes using like this.
# In[24]
df.values.nbytes
# Out[24]
32000
  • On the performance side, eval can be faster even when you are not maxing out your system memory.
  • The difference in computation time between the traditional methods and the eval/query method is usually not significant.
  • The benefit of eval/query is mainly in the saved memory, and the sometimes cleaner syntax they offer.

For more information on eval/query , you can refer these Pandas documentation :
1. Pandas.eval
2. Pandas.DataFrame.eval
3. Pandas.DataFrame.query

profile
노정훈

0개의 댓글