Vectorized String Operations

노정훈·2023년 7월 27일
0

Pandas

목록 보기
10/12

Introducing Pandas String Operations

  • Vectorization of operations simplifies the syntax of operating on arrays of data.
  • For array of strings, Numpy does not provide simple access, and thus you're stuck using a more verbose(=long-winded) loop syntax.
# In[1]
data=['peter','Paul','MARY','gUIDO']
[s.capitalize() for s in data]
# Out[1]
['Peter', 'Paul', 'Mary', 'Guido']
  • This is perhaps sufficient to work with some data, but it will break if there are any missing values, so this approach requires putting in extra checks.
# In[2]
data=['peter','Paul',None,'MARY','gUIDO']
[s if s is None else s.capitalize() for s in data]
# Out[2]
['Peter', 'Paul', None, 'Mary', 'Guido']
  • Pandas includes features to address both this need for vectorized string operations as well as the need for correctly handling missing data via the str attribute of Pandas Series and Index objects containing strings.
# In[3]
names=pd.Series(data)
names.str.capitalize()
# Out[3]
0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

Tables of Pandas String Methods

Methods Similar to Python String Methods

  • All of Python's built-in string methods are mirrored by a Pandas vectorized string method.
  • The Pandas str methods mirror Python string methods.
# In[4]
monte=pd.Series(['Graham Chapman','John Cleese','Terry Gilliam',
'Eric Idle','Terry Jones','Michael Palin'])

# In[5]
monte.str.lower()
# Out[5]
0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

# In[6]
monte.str.len()
# Out[6]
0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

# In[7]
monte.str.startswith('T')
# Out[7]
0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

# In[8]
monte.str.split()
# Out[8]
0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

Methods Using Regular Expression

  • There are several methods that accept regular expression (regexps) to examine the content of each string element, and follow some of the API conventions of Python's built-in re module.

Mapping between Pandas methods and functions in Python's re module

MethodDescription
matchCalls re.match on each element, returning a Boolean
extractCalls re.match on each element, returning matched groups as strings
findallCalls re.findall on each element
replaceReplaces occurrences of pattern with some other string
containsCalls re.search on each element, returning a Boolean
countCounts occurrences of pattern
splitEquivalent to str.split, but accepts regexps
rsplitEquivalent to str.rsplit, but accepts regexps
  • With these, we can do a wide range of operations.
# In[9]
monte.str.extract('([A-Za-z]+)',expand=False)
# Out[9]
0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

# In[10]
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
# Out[10]
0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object
  • In here, start-of-string(^) and end-of-string($) are used as regular expression characters.

If you want to know more about Pandas string methods and regular expressions, reference these urls :
1. About string methods
2. About regular expressions

Miscellaneous Methods

Other Pandas string methods

MethodDescription
getIndexes each element
sliceSlices each element
slice_replaceReplaces slice in each element with the passed value
catConcatenates strings
repeatRepeats values
normalizeReturns Unicode form of strings
padAdds whitespace to left, right, or both sides of strings
wrapSplits long strings into lines with length less than a given width
joinJoins strings in each element of the Series with the passed separator
get_dummiesExtracts dummy variable as a DataFrame

Vectorized item access and slicing

  • The get and slice operations, in particular, enable vectorized element access from each array.
  • We can get a slice of the first three characters of each array using str.sliec(0,3)
  • This behavior is also available through Python's normal indexing syntax; df.str.slice(0,3) is equivalent to df.str[0,3]
# In[11]
monte.str[0:3]
# Out[11]
0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object
  • Indexing via df.str.get(i) and df.str[i] are likewise similar.

  • These indexing methods also let you access elements of arrays returned by split

# In[12]
monte.str.split().str[-1]
# Out[12]
0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

Indicator variables

  • get_dummies method is useful when your data has a column containing some sort of coded indicator.
# In[13]
full_monte=pd.DataFrame({'name':monte,
'info':['B | C | D','B | D','A | C','B | D','B | C','B | C | D']})
full_monte
# Out[13]
              name	     info
0	Graham Chapman	B | C | D
1	   John Cleese	    B | D
2	 Terry Gilliam	    A | C
3	     Eric Idle	    B | D
4	   Terry Jones	    B | C
5	 Michael Palin	B | C | D
  • The get_dummies routine lets us split out these indicator variables into a DataFrame.
# In[14]
full_monte['info'].str.get_dummies('|')
# Out[14]
    A	B	C	D
0	0	1	1	1
1	0	1	0	1
2	1	0	1	0
3	0	1	0	1
4	0	1	1	0
5	0	1	1	1
  • With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.
profile
노정훈

0개의 댓글