High-Performance Pandas: eval and query

노정훈·2023년 7월 29일

Pandas

목록 보기

12/12

The power of PyData stack is built upon the ability of Numpy and Pandas to push basic operations into lower-level compiled code via an intuitive higher-level syntax.
While these abstractions are effcient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue(=excessive) overhead in computational time and memory use.
To address this, Pandas includes some methods that allow you to directly access C-speed operations without costly allocation of intermediate arrays: eval and query

Motivating query and eval: Compound Expressions

# In[1]
rng=np.random.default_rng(42)
x=rng.random(1000000)
y=rng.random(1000000)
%timeit x+y

# Out[1]
4.26 ms ± 915 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is much faster than doing the addition via a Python loop or comprehension.
But this abstraction can become less efficient when computing compound expressions.

# In[2]
mask=(x>0.5)&(y<0.5)

Because Numpy evaluates each subexpression, this is roughly equivalent to the following code.

# In[3]
tmp1=(x>0.5)
tmp2=(y<0.5)
mask=tmp1&tmp2

In other words, every intermediate step is explicitly allocated in memory.
If the x and y arrays are very large, this can lead to significant memory and computational overhead.
The NumExpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.

# In[4]
import numexpr
mask_numexpr=numexpr.evaluate('(x>0.5)&(y<0.5)')
np.all(mask==mask_numexpr)

# Out[4]
True

The benefit here is that NumExpr evaluates the expression in a way that avoids temporary arrays where possible, and thus can be much more efficient than Numpy, especially for long sequences of computations on large arrays.
The Pandas eval and query tools are essentially Pandas-specific wrappers of NumExpr functionality.

pandas.eval for Efficient Operations

The eval function in Pandas uses string expressions to efficiently compute operations on DataFrame objects.

# In[5]
nrows,ncols=100000, 100
df1,df2,df3,df4=(pd.DataFrame(rng.random((nrows,ncols))) for i in range(4))

To compute the sum of all four DataFrames using the typical Pandas approach, we can just write the sum.

# In[6]
%timeit df1+df2+df3+df4

# Out[6]
167 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The same result can be computed via pd.eval by constructing the expression as a string.
The eval version of this expression is about 50% faster, while giving the same result.

# In[7]
%timeit pd.eval('df1+df2+df3+df4')

# Out[7]
73.5 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# In[8]
np.allclose(df1+df2+df3+df4,pd.eval('df1+df2+df3+df4'))

# Out[8]
True

pd.eval supports a wide range of operations.

# In[10]
df1,df2,df3,df4,df5=(pd.DataFrame(rng.integers(0,1000,(100,3))) 
for i in range(5))

Arithmetic operators

# In[11]
result1=-df1*df2/(df3+df4)-df5
result2=pd.eval('-df1*df2/(df3+df4)-df5')
np.allclose(result1,result2)

# Out[11]
True

Comparison operators

# In[12]
result1=(df1<df2)&(df2<=df3)&(df3!=df4)
result2=pd.eval('df1<df2<=df3!=df4')
np.allclose(result1,result2)

# Out[12]
True

Bitwise operators

# In[13]
result1=(df1<0.5)&(df2<0.5)|(df3<df4)
result2=pd.eval('(df1<0.5)&(df2<0.5)|(df3<df4)')
np.allclose(result1,result2)

# Out[13]
True

Additionally, it supports the use of the literal and and or in Boolean expressions.

# In[14]
result3=pd.eval('(df1<0.5) and (df2<0.5) or (df3<df4)')
np.allclose(result1,result2)

# Out[14]
True

Object attributes and indices

pd.eval supports access to object attributes via the obj.attr syntax and indexes via the obj[index] syntax

# In[15]
result1=df2.T[0]+df3.iloc[1]
result2=pd.eval('df2.T[0]+df3.iloc[1]')
np.allclose(result1,result2)

# Out[15]
True

Other operators

Other operations, such as function calls, conditional statements, loops, and other more involved constructs are currently not implemented in pd.eval.
If you'd like to execute these types of expressions, you can use the NumExpr library itself.

Please reference this url about the np.allclose :
About np.allclose

DataFrame.eval for Column-Wise Operations

Just as Pandas has a top-level pd.eval function, DataFrame objects have an eval method that works in similar ways.
The benefit of the eval method is that columns can be referred to by name.

# In[16]
df=pd.DataFrame(rng.random((1000,3)),columns=['A','B','C'])
df.head()

# Out[16]
           A	       B	       C
0	0.850888	0.966709	0.958690
1	0.820126	0.385686	0.061402
2	0.059729	0.831768	0.652259
3	0.244774	0.140322	0.041711
4	0.818205	0.753384	0.578851

By using pd.eval, we can compute expressions with the three columns.

# In[17]
result1=(df['A']+df['B'])/(df['C']-1)
result2=pd.eval("(df.A+df.B)/(df.C-1)")
np.allclose(result1,result2)

# Out[17]
True

We treat column names as variables within the evaluated expression, and the result is what we would wish.

Assignment in DataFrame.eval

DataFrame.eval also allows assignment to any column.
We can use df.eval to create a new column 'D' and assign to it a value computed from the other columns.
If inplace is True, the value is reflected in the original. On the contrary, False (default), just returned the result.

# In[18]
df.eval('D=(A+B)/C',inplace=True)
df.head()

# Out[18]
           A	       B	       C	        D
0	0.850888	0.966709	0.958690	 1.895916
1	0.820126	0.385686	0.061402	19.638139
2	0.059729	0.831768	0.652259	 1.366782
3	0.244774	0.140322	0.041711	 9.232370
4	0.818205	0.753384	0.578851	 2.715013

In the same way, any existing column can be modified.

# In[19]
df.eval('D=(A-B)/C',inplace=True)
df.head()

# Out[19]
           A	       B	       C	        D
0	0.850888	0.966709	0.958690	-0.120812
1	0.820126	0.385686	0.061402	 7.075399
2	0.059729	0.831768	0.652259	-1.183638
3	0.244774	0.140322	0.041711	 2.504142
4	0.818205	0.753384	0.578851	 0.111982

Local Variables in DataFrame.eval

The DataFrame.eval method supports an additional syntax that lets it work with local Python variables.

# In[20]
column_mean=df.mean(1)
result1=df['A']+column_mean
result2=df.eval('A+@column_mean')
np.allclose(result1,result2)

# Out[20]
True

The @ character here marks a variable name rather than a column name, and lets you efficiently evaluate expressions involving the two "namespaces": the namespace of columns, and the namespace of Python objects.
This @ character is only supported by the DataFrame.eval method, not by the pandas.eval function, because the pandas.eval function only has access to the one(Python) namespace.

The DataFrame.query Method

The DataFrame has another method based on evaluated strings, called query.

# In[21]
result1=df[(df.A < 0.5) & (df.B < 0.5)]
result2=pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1,result2)

# Out[21]
True

As with the example used in our discussion of DataFrame.eval, this is an expression involving columns of the DataFrame.
It cannot be expressed using DataFrame.eval syntax.
Instead, for this type of filtering operation, you can use the query method.

# In[22]
result2=df.query('A < 0.5 and B < 0.5')
np.allclose(result1,result2)

# Out[22]
True

In addtion to being a more efficient computation, compared to the masking expression this is much easier to read and understand.
query method also accpets the @ flag to mark local variables.

# In[23]
Cmean=df['C'].mean()
result1=df[(df.A < Cmean) & (df.B < Cmean)]
result2=df.query('A < @Cmean and B < @Cmean')
np.allclose(result1,result2)

# Out[23]
True

Performance: When to Use These Functions

When considering whether to use eval and query, there are two considerations: computation time and memory use.
Every compound expression involving Numpy arrays or Pandas DataFrames will result in implicit creation of temporary arrays.
If the size of the temporary DataFrames is significant compared to your available system memory, then it's a good idea to use an eval or query expression.
You can check the approximate size of your array in bytes using like this.

# In[24]
df.values.nbytes

# Out[24]
32000

On the performance side, eval can be faster even when you are not maxing out your system memory.
The difference in computation time between the traditional methods and the eval/query method is usually not significant.
The benefit of eval/query is mainly in the saved memory, and the sometimes cleaner syntax they offer.

For more information on eval/query , you can refer these Pandas documentation :
1. Pandas.eval
2. Pandas.DataFrame.eval
3. Pandas.DataFrame.query

노정훈

이전 포스트