[python] 파일에서 중복된 라인 제거하기

숲사람·2022년 6월 23일

유용한 도구

목록 보기

3/5

python에서 파일 중복을 제거하는 동작을 찾다 발견한 함수에서 이것저것 배운 내용들.
이 함수는 인자로 전달한 파일의 중복라인을 제거해준다.

def remove_duplicated_line(target_file):
    no_dup_lines = set() # set is O(1), better then list O(n)
    with open(target_file, 'r+') as fp:
        file_lines = fp.readlines()
        fp.seek(0)
        for line in file_lines:
            if line not in no_dup_lines:
                fp.write(line)
                no_dup_lines.add(line)
        fp.truncate()

source code from https://stackoverflow.com/a/60981617/17898033

file operation

with open(target_file, 'r+') as fp:

open with option r+
read / write 가 모두 가능한 옵션
fp.seek(0)
file stream pointer를 0으로 이동. 이걸 하지 않고 fp.write() 하면, 현재 파일의 끝에 써진다. 동일한 파일을 지우고 다시 작성하는것이기 때문에 fp를 0으로 이동시켜야한다.
fp.truncate()
인자가 없다면 현재 포지션 부터 뒤의 내용은 지운다. 현재 코드에서 중복된 라인이 지워지면 라인수가 줄어들것이기 때문에, 이 동작을 수행하지 않으면 이전파일 마지막 라인들이 남아있을 수 있다.
truncate(n) : Truncate from the first character of the first line of the file. The file is truncated to n characters; no n means truncation from the current position ; all characters after n are deleted after truncation.
https://blog.katastros.com/a?ID=00950-b42d4da4-394c-4b71-91f6-1ee8e291f043

python의 `set` 자료구조

no_dup_lines = list()

no_dup_lines = set() # set is O(1), better then list O(n)

처음에 no_dup_lines를 list로 사용하려고 했다. 그렇게 해도 동작에는 문제는 없다. 다만 if line not in no_dup_lines: 으로 Lookup 하는 동작의 시간 복잡도가 O(N)일 뿐.

반면 set을 사용하면 Lookup의 시간복잡도가 O(1)이다. 처음에 이름만으로 유추하기를 c++ STL과 동일하게 내부는 rb-tree로 구현되어있을것이라 생각하고 O(logN)일것이라고 생각했다. 하지만 set은 hashtable로 구현이 되어있었다. 따라서 탐색시간이 평균 O(1)이다.

다만 set은 list보다 모든 요소를 순회하는 시간이 더 느리다고 한다. list는 요소의 갯수만 확인하면 되지만 set은 hashtable전체를 순회 해야하기 때문이다.
https://111geon.github.io/DataSet/

python의 set 과 dictionary는 각각 c++ STL의 unordered_set과 unordered_map과 동일한 자료구조라는 점이 새삼 흥미로웠다(?)