특정 단어 포함한 리스트/인덱스 추출하기

Yubin's velog ! ·2023년 7월 20일

Data Preprocessing python regular expression

data preprocessing 👩🏻‍💻

목록 보기

1/2

background

엑셀 데이터를 전처리 하는 과정에서 특정 열을 삭제해야 하는 케이스인데, 열의 인덱스로 매번 찾을 수 없다는 점이 나의 struggle point 였다.

삭제해야 하는 열의 순서가 계속 뒤바뀌기기에, 열의 순서가 아닌 열의 이름을 기준점으로 삼아 진행해야 했다.

우선 내가 찾아야 하는 타켓은 "Order" 와 "WBS" 라는 단어를 포함하고 있는 해당 열의 인덱스이다.
partial_list = ["Order", "WBS"]

그러나 주어진 raw data를 보면 공백을 상당수 많이 포함하고 있다. (심지어 공백도 매번 데이터를 뽑을 때마다 다르다 .. )

col_list = ['Cocd', 'Doc.data  ', 'Post.data ', 'Vendor    ', 'Bill of Lading                     ', 'Account   ', 'Cost cente', 'Order       ', 'WBS element             ', 'Prctr     ', 'Amount']

# 정말 공백이 불규칙적으로 존재한다

사실 처음 접근 방법은 in 사용이었는데, 이상하게 이 리스트에서는 왜 안먹히는지 모르겠다.
직접 문자열을 쳐서 하면 문제 없이 작동되지만, xls 에서 xlsx로 변환 후 열 이름 리스트를 사용하면 포함여부를 떠나서 계속 False 값만 반환한다.

그래서 생각해낸 방법은 정규표현식을 사용하여 접근하는 방식이었다.
정규표현식은 특정한 규칙이나 문자열을 찾을 때 유용한 방법으로 알고 있기에 정규 표현식을 사용하였다.

import re

matching_words = []
matching_col_index = []

partial_list = ['Order', 'WBS']

for i in range(len(partial_list[i]):
	
    partial = partial_list[i]
    pattern = re.compile(rf'\w*{partial}\w*') 
    #\w = word의 약자이며, a-z, A-Z, 0-9중 힌개 문자를 의미 / partial 앞뒤 
    # 
   
    
    num = 0
  
    for word in col_list:
    	if re.match(pattern, word):
        	matching_words.append(word)
            matching_col_index.append(num)
            
        num += 1
        
print(matching_words)
print(matching_col_i

breakdown !

# 정규표현식 

for i in range(len(partial_list[i]):
	
    partial = partial_list[i]
    pattern = re.compile(rf'\w*{partial}\w*') # partial words 여부 
  
    num = 0
  
    for word in col_list:
    	if re.match(pattern, word): #match(regex object, regex object를 찾을 대상/string)
        	matching_words.append(word)
            matching_col_index.append(num)
            
        num += 1

compile() : compile regular expression pattern into regex object
- regex object 변경 후 matching operation: re.match or re.search 등을 활용하는 경우가 많음 (여기도 마찬가지!)
rf : raw formatted string (=string pre-fix), allows to emded expression inside curly braces "{}" within the string. rf{a} will be evaluated and inserted into the string at runtime
\w* : stands for words, matches zero or more word characters / a-z, A-Z, 0-9
{partial} : partial_list의 loop에 들어간 요솟값이며, 해당 값이 정규 표현식에 들어가서 매치 여부를 확인

summary

리스트에서 특정/부분 단어가 포함하는지 확인해야 하는 Task !
a. in 연산자 활용
b. 정규표현식 활용

참고

Yubin's velog !

@ubeen

다음 포스트

SAP에서 다운로드 받은 파일: xls 에서 xlsx 로 자동 변환하기

2개의 댓글

happy

2023년 7월 20일

글이 잘 정리되어 있네요. 감사합니다.

1개의 답글