[프로그래머스] 문자열 압축

chanyeong kim·2021년 9월 23일

목록 보기

6/51

vue image

📩 -->문제설명

데이터 처리 전문가가 되고 싶은 "어피치"는 문자열을 압축하는 방법에 대해 공부를 하고 있습니다. 최근에 대량의 데이터 처리를 위한 간단한 비손실 압축 방법에 대해 공부를 하고 있는데, 문자열에서 같은 값이 연속해서 나타나는 것을 그 문자의 개수와 반복되는 값으로 표현하여 더 짧은 문자열로 줄여서 표현하는 알고리즘을 공부하고 있습니다.

예를 들어, "ababcdcdababcdcd"의 경우 문자를 1개 단위로 자르면 전혀 압축되지 않지만, 2개 단위로 잘라서 압축한다면 "2ab2cd2ab2cd"로 표현할 수 있습니다. 다른 방법으로 8개 단위로 잘라서 압축한다면 "2ababcdcd"로 표현할 수 있으며, 이때가 가장 짧게 압축하여 표현할 수 있는 방법입니다.

다른 예로, "abcabcdede"와 같은 경우, 문자를 2개 단위로 잘라서 압축하면 "abcabc2de"가 되지만, 3개 단위로 자른다면 "2abcdede"가 되어 3개 단위가 가장 짧은 압축 방법이 됩니다. 이때 3개 단위로 자르고 마지막에 남는 문자열은 그대로 붙여주면 됩니다.

압축할 문자열 s가 매개변수로 주어질 때, 위에 설명한 방법으로 1개 이상 단위로 문자열을 잘라 압축하여 표현한 문자열 중 가장 짧은 것의 길이를 return 하도록 solution 함수를 완성해주세요.

제한사항

s의 길이는 1 이상 1,000 이하입니다.
s는 알파벳 소문자로만 이루어져 있습니다.

입출력 예

s	result
"aabbaccc"	7
"ababcdcdababcdcd"	9
"abcabcdede"	8
"abcabcabcabcdededededede"	14
"xababcdcdababcdcd"	17

💡 solution(사용언어: python)

def solution(s):
    result=[]	# unit별로 슬라이싱한 문자열을 담은 리스트
    answer=''	# 압축되어진 문자열
    answer_list=[]	# unit 별로 압축한 문자열의 길이를 담는 리스트
    if len(s)==1:
        return 1
    for unit in range(1,(len(s)//2)+1):
        if unit==1:
            for j in range(1,len(s)+1,unit):
                result.append(s[j-unit:j])
#             print(result)    
            cnt=1
            j=1
            while cnt <= len(result):
                try:
                    if result[cnt-1]==result[cnt]:
                        cnt+=1
                        j+=1
                    else:
                        q_word=str(j)+result[cnt-1]
                        if q_word[0]=='1'and q_word[1].isalpha():
#                             print(q_word)
                            q_word=q_word.replace('1','')    
                        answer=answer+q_word
                        j=1
                        cnt+=1
                except:
                    q_word=str(j)+result[cnt-1]
                    if q_word[0]=='1'and q_word[1].isalpha():
#                         print(q_word)
                        q_word=q_word.replace('1','')   
                    answer=answer+q_word
                    j=1
                    cnt+=1
#             print(answer)
            answer_list.append(len(answer))
            answer=''
        else:
            result=[]
            for j in range(unit,len(s)+1000,unit):      # 여기서 1000을 더해준 이유가,, 특정 unit은 남은 글자를 포함 안시켜 줘서..
    #             print('unit:',unit,'//','j:',j)
                if s[j-unit:j]!='':
                    result.append(s[j-unit:j])
                if len(s[j-unit:j]) < unit:            # 다 확인했으면 더 이상 비교 안하도록 break
                    break
#             print(result)
            cnt=1
            j=1
            while cnt <= len(result):
                try:
                    if result[cnt-1]==result[cnt]:
                        cnt+=1
                        j+=1
                    else:
                        q_word=str(j)+result[cnt-1]
                        if q_word[0]=='1'and q_word[1].isalpha():
#                             print(q_word)
                            q_word=q_word.replace('1','')   
                        answer=answer+q_word
                        j=1
                        cnt+=1
                except:
                    q_word=str(j)+result[cnt-1]
                    if q_word[0]=='1' and q_word[1].isalpha():
#                         print(q_word)
                        q_word=q_word.replace('1','')     
                    answer=answer+q_word
                    j=1
                    cnt+=1
#             print(answer)
            answer_list.append(len(answer))
            answer=''
    return min(answer_list)

👉 설명

기본적인 컨셉은 unit 수만큼 문자를 잘라서 리스트(result)에 담고, 각 리스트에서 앞 뒤 문자와 같은지 비교하는 흐름이다.

    for unit in range(1,(len(s)//2)+1):
        if unit==1:
            for j in range(1,len(s)+1,unit):
                result.append(s[j-unit:j])

전체 문자열의 길이의 절반 이상의 문자열을 압축할 수 없기 때문에, (len(s)//2)+1까지로 범위를 잡아주었고, 이후 각각 unit 별로 문자열을 슬라이싱한 후 리스트에 담아 주었다.

	if unit==1:
            for j in range(1,len(s)+1,unit):
                result.append(s[j-unit:j])
#             print(result)    
            cnt=1
            j=1
            while cnt <= len(result):
                try:
                    if result[cnt-1]==result[cnt]:
                        cnt+=1
                        j+=1
                    else:
                        q_word=str(j)+result[cnt-1]
                        if q_word[0]=='1'and q_word[1].isalpha():
#                             print(q_word)
                            q_word=q_word.replace('1','')    
                        answer=answer+q_word
                        j=1
                        cnt+=1
                except:
                    q_word=str(j)+result[cnt-1]
                    if q_word[0]=='1'and q_word[1].isalpha():
#                         print(q_word)
                        q_word=q_word.replace('1','')   
                    answer=answer+q_word
                    j=1
                    cnt+=1
#             print(answer)
            answer_list.append(len(answer))
            answer=''

내가 생각한 흐름대로라면 unit을 한개로 잘랐을 때와 2개 이상으로 잘랐을 때 범위가 달라서 구분지어서 코딩을 짰다.
문자열을 1개씩 슬라이싱되어 있는 result에서 cnt라는 변수가 len(result)와 같아 질때까지 앞뒤를 비교하는 흐름이다.
앞뒤가 같으면 cnt와 j(앞뒤로 같은 문자열의 갯수)를 1씩 더해주고, 같지 않으면 그 전까지 같았던 문자들 앞에 j를 새롭게 추가해준다.
마지막에서 리스트의 범위를 초과하는 오류가 떠서 try 구문으로 작성했다.
문자열을 압축시킨 새로운 문자열 q_word의 앞이 1로 시작하고 그 다음이 알파벳이라면 1을 제거해 줬다(q_word가 앞뒤가 같지 않아도 1을 넣어 줬기 때문..)
최종 문자열 answer의 길이를 answer_list에 담았다.
2이상으로 잘랐을 때도 동일한 흐름으로 진행된다.

결과

다른 풀이

def compress(text, tok_len):
    words = [text[i:i+tok_len] for i in range(0, len(text), tok_len)]
    res = []
    cur_word = words[0]
    cur_cnt = 1
    for a, b in zip(words, words[1:] + ['']):
        if a == b:
            cur_cnt += 1
        else:
            res.append([cur_word, cur_cnt])
            cur_word = b
            cur_cnt = 1
    return sum(len(word) + (len(str(cnt)) if cnt > 1 else 0) for word, cnt in res)

def solution(text):
    return min(compress(text, tok_len) for tok_len in list(range(1, int(len(text)/2) + 1)) + [len(text)])

a = [
    "aabbaccc",
    "ababcdcdababcdcd",
    "abcabcdede",
    "abcabcabcabcdededededede",
    "xababcdcdababcdcd",

    'aaaaaa',
]

for x in a:
    print(solution(x))

이런 깔끔하고 파이써닉한 코드도 있다..

🌈 느낀 점

추석때도 하루에 한개씩은 풀어보려고 했는데, 결국 추석 지나고 나서야 한 문제를 풀었다...ㅋㅋㅋ 문제가 쉬워 보였는데 생각보다 고민해야할 부분들이 여러개 있었다. 문자열이 1개일 때 앞뒤로 똑같은 문자열의 갯수가 10개 이상일 때 등등..! 여튼 추석도 지났으니 다시 달려봐야겠다!

출처: 프로그래머스

오류가 있으면 댓글 달아주세요🙂