Vectorized String Operations

노정훈·2023년 7월 27일

Pandas

목록 보기

10/12

Introducing Pandas String Operations

Vectorization of operations simplifies the syntax of operating on arrays of data.
For array of strings, Numpy does not provide simple access, and thus you're stuck using a more verbose(=long-winded) loop syntax.

# In[1]
data=['peter','Paul','MARY','gUIDO']
[s.capitalize() for s in data]

# Out[1]
['Peter', 'Paul', 'Mary', 'Guido']

This is perhaps sufficient to work with some data, but it will break if there are any missing values, so this approach requires putting in extra checks.

# In[2]
data=['peter','Paul',None,'MARY','gUIDO']
[s if s is None else s.capitalize() for s in data]

# Out[2]
['Peter', 'Paul', None, 'Mary', 'Guido']

Pandas includes features to address both this need for vectorized string operations as well as the need for correctly handling missing data via the str attribute of Pandas Series and Index objects containing strings.

# In[3]
names=pd.Series(data)
names.str.capitalize()

# Out[3]
0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

Tables of Pandas String Methods

Methods Similar to Python String Methods

All of Python's built-in string methods are mirrored by a Pandas vectorized string method.
The Pandas str methods mirror Python string methods.

# In[4]
monte=pd.Series(['Graham Chapman','John Cleese','Terry Gilliam',
'Eric Idle','Terry Jones','Michael Palin'])

# In[5]
monte.str.lower()

# Out[5]
0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

# In[6]
monte.str.len()

# Out[6]
0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

# In[7]
monte.str.startswith('T')

# Out[7]
0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

# In[8]
monte.str.split()

# Out[8]
0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

Methods Using Regular Expression

There are several methods that accept regular expression (regexps) to examine the content of each string element, and follow some of the API conventions of Python's built-in re module.

Mapping between Pandas methods and functions in Python's re module

Method	Description
`match`	Calls `re.match` on each element, returning a Boolean
`extract`	Calls `re.match` on each element, returning matched groups as strings
`findall`	Calls `re.findall` on each element
`replace`	Replaces occurrences of pattern with some other string
`contains`	Calls `re.search` on each element, returning a Boolean
`count`	Counts occurrences of pattern
`split`	Equivalent to `str.split`, but accepts regexps
`rsplit`	Equivalent to `str.rsplit`, but accepts regexps

With these, we can do a wide range of operations.

# In[9]
monte.str.extract('([A-Za-z]+)',expand=False)

# Out[9]
0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

# In[10]
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

# Out[10]
0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

In here, start-of-string(^) and end-of-string($) are used as regular expression characters.

If you want to know more about Pandas string methods and regular expressions, reference these urls :
1. About string methods
2. About regular expressions

Miscellaneous Methods

Other Pandas string methods

Method	Description
`get`	Indexes each element
`slice`	Slices each element
`slice_replace`	Replaces slice in each element with the passed value
`cat`	Concatenates strings
`repeat`	Repeats values
`normalize`	Returns Unicode form of strings
`pad`	Adds whitespace to left, right, or both sides of strings
`wrap`	Splits long strings into lines with length less than a given width
`join`	Joins strings in each element of the Series with the passed separator
`get_dummies`	Extracts dummy variable as a DataFrame

Vectorized item access and slicing

The get and slice operations, in particular, enable vectorized element access from each array.
We can get a slice of the first three characters of each array using str.sliec(0,3)
This behavior is also available through Python's normal indexing syntax; df.str.slice(0,3) is equivalent to df.str[0,3]

# In[11]
monte.str[0:3]

# Out[11]
0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

Indexing via df.str.get(i) and df.str[i] are likewise similar.
These indexing methods also let you access elements of arrays returned by split

# In[12]
monte.str.split().str[-1]

# Out[12]
0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

Indicator variables

get_dummies method is useful when your data has a column containing some sort of coded indicator.

# In[13]
full_monte=pd.DataFrame({'name':monte,
'info':['B | C | D','B | D','A | C','B | D','B | C','B | C | D']})
full_monte

# Out[13]
              name	     info
0	Graham Chapman	B | C | D
1	   John Cleese	    B | D
2	 Terry Gilliam	    A | C
3	     Eric Idle	    B | D
4	   Terry Jones	    B | C
5	 Michael Palin	B | C | D

The get_dummies routine lets us split out these indicator variables into a DataFrame.

# In[14]
full_monte['info'].str.get_dummies('|')

With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.

노정훈

이전 포스트

Pivot Tables

다음 포스트