Confluence에 HTML 백업 파일을 일괄 업로드하는 프로그램 개발기

singleheart·2023년 6월 6일

개발 동기

지난번에 Quip에서 Confluence 이사하기에 대해 설명했습니다.
꼭 Quip이 아니더라도 다량의 백업파일이 있고 이를 Confluence에서 가져와야 할 때가 있습니다.
안타깝게도 Confluence의 기본 메뉴에서는 파일 하나만 가져오는 기능을 제공할 뿐, 일괄로 가져오는 기능은 제공하지 않습니다.
(여기서 가져오기란 첨부파일 업로드와는 다른 기능입니다. 파일을 가져오기하면 가져온 파일의 내용로 컨플루언스 페이지의 내용 자체가 대체됩니다)

다행히 컨플루언스는 페이지를 만들고 수정할 수 있는 API를 제공합니다. 페이지의 내용을 조회하는 API를 한번 보겠습니다.

curl -H "Authorization: Bearer 발급받은_토큰" "https://confluence-site-url/rest/api/content/문서번호?expand=body.storage" -H "Content-Type: application/json" | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1167    0  1167    0     0   9059      0 --:--:-- --:--:-- --:--:-- 10060
{
  "id": "문서번호",
  "type": "page",
  "status": "current",
  "title": "anaconda",
  "body": {
    "storage": {
      "value": "<ul>\n\t<li>python3 환경 생성\n\t<ul>\n\t\t<li>$ conda create -n py36 python=3.6 anaconda</li>\n\t</ul>\n\t</li>\n\t<li>환경 적용\n\t<ul>\n\t\t<li>source activate py36</li>\n\t</ul>\n\t</li>\n</ul>\n\n",
      "representation": "storage",
      "_expandable": {
        "content": "/rest/api/content/문서번호"
      }
    },
생략
}

아마 옛날에 만든 페이지인지 파이썬 3.6에서 conda 가상환경을 생성하는 방법에 대한 내용을 담고 있습니다.
중요한 것은 API를 이용하여 컨플루언스 페이지의 내용을 조회할 수 있었다는 점입니다. 이런 식으로 페이지의 생성과 수정도 가능합니다.

API가 있으면 curl 명령으로 다 처리할 수 있을 텐데, 굳이 bash를 쓰지 않고 파이썬으로 프로그램을 개발할 필요가 있을까요?
안타깝게도 HTML 파일을 인자로 받아서 그대로 내용을 업데이트해 주는 편리한 API를 제공하지는 않았습니다.
그래서 파이썬으로 HTML 파일을 읽어서 API의 입력을 만들어주어야 했습니다.
여기에 저는 원래의 디렉토리 구조를 유지하면서 컨플루언스에 페이지를 만들고 싶었기 때문에 해당 기능의 구현도 필요했습니다.

페이지 하나를 HTML 파일로 업데이트하기

우선 컨플루언스에서 페이지의 내용을 업데이트하는 API를 확인해 보겠습니다.

PUT /rest/api/content/456
{
    "version": {
        "number": 2
    },
    "title": "My new title",
    "type": "page",
    "body": {
        "storage": {
            "value": "<p>New page data.</p>",
            "representation": "storage"
        }
    }
}

여기서 456은 업데이트할 페이지의 ID입니다. 컨플루언스는 버전 관리를 하는데 페이지를 업데이트할 때마다 버전이 하나씩 올라갑니다.
여기서는 456의 원래 버전이 1이었기 때문에 업데이트하면서 version.number를 2로 올려 줍니다. 컨플루언스에서는 업데이트를 할 때에도 제목(title)을 반드시 적어 주어야 합니다.
이전과 제목이 동일한 경우에도 제목이 있어야지 API가 작동합니다. type 역시 필수 항목이고, 페이지 내용을 업데이트할 때에는 page라고 적어 줍니다.
body.storage.value에 실제 내용이 들어갑니다. HTML 태그를 쓸 수 있습니다. body.storage.representation도 필수입니다.

이제 HTML을 올리려면 어떻게 해야 할지 짐작할 수 있습니다. HTML 파일의 내용을 저 body.storage.value에 넣어주면 됩니다.

html_file = open("파일명.html", "r")
html = html_file.read()

token = os.environ["CONFLUENCE_TOKEN"]
headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}

payload = {
    "version": {"number": new_version},
    "type": "page",
    "title": "Test Page",
    "body": {"storage": {"value": html, "representation": "storage"}},
}

r = requests.put(
    f"{url}/rest/api/content/{page_id}",
    json=payload,
    headers=headers,
)

HTML 파일을 읽어서 컨플루언스에 업데이트하는 코드입니다. url, page_id에 실제 주소를 대입합니다.
아무나 API를 호출해서 내용을 바꾸면 곤란하니까 Confluence에서는 인증 토큰을 요구합니다.
토큰은 개인별 설정 메뉴에서 만들 수 있습니다. 환경변수에 저장하고 불러서 쓰는 것이 안전하겠지요.

그런데 이 코드를 실행하면 문제가 생깁니다. new_version 때문입니다. 문서를 업데이트하려면 이전 버전보다 하나 올려야 한다고 말씀드렸습니다.
이전 문서를 열고 히스토리를 확인해서 버전 번호를 알아낼 수도 있겠지만, 자동화할 수 있을까요? 가능합니다.

문서의 버전 번호 알아내기

페이지의 내용을 조회하는 API를 다시 살펴봅시다. 여기에 버전 정보가 있습니다. 아래와 같이 호출할 수 있습니다.

params = {"expand": "version"}
r = requests.get(
    f"{url}/rest/api/content/{page_id}",
    headers=headers,
    params=params,
)

old_version = r.json()["version"]["number"]
new_version = old_version + 1

이제 버전을 자동으로 업데이트할 수 있게 되었습니다.

HTML 형식 맞추기

이제 위의 업데이트 코드가 잘 작동할까요? 그럴 수도 있습니다만, HTML의 내용에 따라서 오류가 발생할 수도 있습니다.
왜냐하면 컨플루언스 API에서는 HTML 코드가 형식을 철저히 지킬 것을 요구하기 때문입니다. 태그를 열었으면 반드시 닫아야 합니다.

다행히 파이썬에서 HTML 형식을 맞추는 것은 매우 간단합니다. BeautifulSoup를 사용하면 됩니다.

soup = BeautifulSoup(html, "html.parser")
body = soup.prettify()

이제 body를 올리면 잘 작동할 것입니다.

첨부파일 올리기

앞의 코드로 내용은 업데이트할 수 있었습니다만, 컨플루언스에는 첨부파일도 있습니다. 첨부파일도 자동으로 백업하고 복원할 수 있으면 좋을 것입니다.
다행히 첨부파일을 추가하는 API가 제공됩니다. API 문서에 예시도 제공돼서 curl로 테스트해볼 수 있습니다.

with open(filename, "rb") as f:
    files = {"file": (filename, f)}
    header = {
        "Authorization": f"Bearer {token}",
        "X-Atlassian-Token": "nocheck",
    }
    r = requests.post(
        f"{url}/rest/api/content/{page_id}/child/attachment",
        headers=header,
        files=files,
    )

지금까지 사용했던 코드와 비슷한데 header가 약간 바뀌었습니다. 컨플루언스에서 파일을 올릴 때에는 꼭 헤더에 "X-Atlassian-Token": "nocheck"를 추가해 주어야 합니다.
이 코드는 잘 작동합니다만, 파일명이 한글일 경우 인코딩에 따라서 문제가 발생할 수 있습니다.
혹시 파일명이 %E1%84%89 등으로 인코딩되어 있는 경우 urllib.parse.unquote 함수를 써서 파일명을 변환해 주어야 합니다.

계층 구조를 살려서 문서 전체를 업데이트하기

이제 문서 하나를 올릴 수 있게 되었으니 여러 문서를 한꺼번에 업데이트하는 방법을 알아봅시다.
파이썬에서는 os.walk라는 편리한 함수를 제공해서 계층구조가 복잡한 디렉토리라도 손쉽게 돌아다니면서 처리를 할 수가 있습니다만,
지금은 컨플루언스쪽에서도 계층 구조를 살려서 올리는 것이 목표이기 때문에 이 함수를 쓰기가 곤란합니다.
대신 os.listdir를 쓰고 재귀호출로 진행합니다.

def recursive_upload(input_path: str, parent_id: int):
    parent_id = create_page(parent_id, input_path)

    for d in os.listdir(input_path):
        if os.path.isdir(os.path.join(input_path, d)):
            recursive_upload(os.path.join(input_path, d), parent_id)
        else:
            page_id = create_page(parent_id, d[:-5])
            path = os.path.join(input_path, d)
            upload_page(path, page_id)

이 함수에서는 우선 현재 디렉토리에 대해서 문서를 만들고요, 이 디렉토리에 속한 파일들을 os.listdir로 살펴봅니다.
os.listdir는 파일과 디렉토리를 구별하지 않고 모두 순회하므로 os.path.isdir로 디렉토리인지 아닌지 판단해야 합니다.
디렉토리인 경우에는 재귀호출로 recursive_upload를 다시 호출하고, 파일인 경우에는 해당 페이지를 만들고 내용을 업데이트합니다.

페이지 만들기

페이지를 업데이트하는 방법은 설명했지만, 새로 만드는 방법을 설명하지 않았습니다. 역시 컨플루언스에서 페이지를 새로 만드는 API를 제공합니다.

def create_page(parent_id: int, space_key: str, title: str) -> int:
    payload = {
        "type": "page",
        "ancestors": [{"id": parent_id}],
        "title": title,
        "space": {"key": space_key},
        "body": {"storage": {"value": "", "representation": "storage"}},
    }

    r = requests.post(
        f"{url}/rest/api/content",
        json=payload,
        headers=headers,
    )

페이지를 새로 만들 때에는 이 문서가 어느 스페이스에 속하는지를 나타내는 space_key가 필요합니다.
컨플루언스에서는 보통 프로젝트 키라고도 합니다. 여러분의 컨플루언스 주소를 보면 쉽게 알 수 있습니다.
사실 앞에서도 이 변수를 계속 전달하고 있었는데 설명을 간편하게 하기 위해서 생략했습니다.

전체 코드

앞에서 다룬 내용을 모두 모은 전체 코드입니다. 예외처리도 어느 정도 해서 잘 작동합니다.


# create a confluence page with a html file

import argparse
import os
import re
import requests
import sys
from bs4 import BeautifulSoup
from pathlib import Path, PurePath
from urllib.parse import unquote


# get the confluence url
url = "여기에 컨플루언스 URL을 적습니다"

# authenticate with my public access token
token = os.environ["CONFLUENCE_TOKEN"]
headers = {
    "Authorization": f"Bearer {token}",
    "Content-Type": "application/json",
}


def create_page(parent_id: int, space_key: str, title: str) -> int:
    """Create a new confluence page with the given parent id and return the new page id"""

    # create the json payload
    payload = {
        "type": "page",
        "ancestors": [{"id": parent_id}],
        "title": title,
        "space": {"key": space_key},
        "body": {"storage": {"value": "", "representation": "storage"}},
    }

    # post the page
    r = requests.post(
        f"{url}/rest/api/content",
        json=payload,
        headers=headers,
    )

    if r.status_code != 200:
        if r.status_code == 400:
            # search id of the page with the same title
            r = requests.get(
                f"{url}/rest/api/content",
                headers=headers,
                params={"title": title, "spaceKey": space_key},
            )
            return r.json()["results"][0]["id"]
        else:
            print(r.json())
            sys.exit(-1)

    # return the page id
    return r.json()["id"]


def upload_page(file_path: str, page_id: int):
    print(f"Uploading {file_path} to {page_id}")

    params = {"expand": "version"}
    r = requests.get(
        f"{url}/rest/api/content/{page_id}",
        headers=headers,
        params=params,
    )
    if r.status_code != 200:
        print(file_path, page_id)
        print(r.json())
        return

    # create the json payload
    old_version = r.json()["version"]["number"]
    new_version = old_version + 1

    # get the html file
    html_file = open(file_path, "r")
    html = html_file.read()
    html = re.sub(r"", "", html)

    # parse the html file
    soup = BeautifulSoup(html, "html.parser")
    if soup.h1:
        soup.h1.decompose()

    # upload attachments
    for a in soup.find_all("a", href=True):
        if a["href"].startswith("blobs"):
            upload_file(a, file_path, page_id)

    body = soup.prettify()
    payload = {
        "version": {"number": new_version},
        "type": "page",
        "title": Path(file_path).stem,
        "body": {"storage": {"value": body, "representation": "storage"}},
    }

    # post the page
    r = requests.put(
        f"{url}/rest/api/content/{page_id}",
        json=payload,
        headers=headers,
    )

    # print the response
    if r.status_code != 200:
        print(r.json())


def upload_file(a: dict, file_path: str, page_id: int):
    href = a["href"]
    href = href.replace("&", "&")
    try:
        with open(os.path.join(os.path.dirname(file_path), href), "rb") as f:
            unquoted = unquote(href[6:])
            unquoted = unquoted.split("&")[0]
            a["href"] = f"{url}/download/attachments/{page_id}/{unquoted}"
            print(f"Uploading {href} as {unquoted}")
            files = {"file": (unquoted, f)}
            attachment_header = {
                "Authorization": f"Bearer {token}",
                "X-Atlassian-Token": "nocheck",
            }
            r = requests.post(
                f"{url}/rest/api/content/{page_id}/child/attachment",
                headers=attachment_header,
                files=files,
            )

            if r.status_code not in [200, 400]:
                try:
                    print(r.json())
                except requests.exceptions.JSONDecodeError:
                    print(r.text)

    except FileNotFoundError:
        print(f"File not found: {href}")


def recursive_upload(input_path: str, space_key: str, parent_id: int):
    """upload all files in the given directory recursively"""
    print(f"Uploading {input_path}...")
    dirname = PurePath(input_path).name
    parent_id = create_page(parent_id, space_key, dirname)

    for d in os.listdir(input_path):
        if os.path.isdir(os.path.join(input_path, d)):
            if d != "blobs":
                recursive_upload(os.path.join(input_path, d), space_key, parent_id)
        else:
            if d.endswith(".html"):
                page_id = create_page(parent_id, space_key, d[:-5])
                path = os.path.join(input_path, d)
                upload_page(path, page_id)


if __name__ == "__main__":
    # get the confluence page id
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--page-id",
        help="confluence page id of the uploading root page",
        type=int,
        required=True,
    )
    parser.add_argument("--input-path", help="path to upload", required=True)
    parser.add_argument(
        "--recursive",
        action="store_true",
        help="if path is a directory, upload all files recursively",
    )
    parser.add_argument("--space-key", help="space key to upload")
    args = parser.parse_args()

    # traverse the directory and upload all files
    if args.recursive:
        assert os.path.isdir(args.input_path), "input_path must be a directory"
        recursive_upload(args.input_path, args.space_key, args.page_id)
    else:
        upload_page(args.input_path, args.page_id)

사용법

실제로 백업 및 복원을 하실 분들을 위해서 사용법도 적습니다

python main.py --page-id [시작페이지] --input-path [시작디렉토리] --recursive --space-key [프로젝트키]

page-id 컨플루언스에 올리기 시작할 위치의 pageid
input-path 올릴 html 백업 파일들의 최상위 디렉토리
space-key 컨플루언스 프로젝트의 키. 페이지 좌하단 공간 도구에서 개요로 들어가면 공간 세부사항에서 확인 가능

소스코드

전체 코드를 아래 깃헙에 올렸습니다:
https://github.com/singleheart/quip-to-confluence

singleheart

개발자

이전 포스트

Quip에서 Confluence로 이사하기

다음 포스트

Confluence에 HTML 백업 파일을 일괄 업로드하는 프로그램 개발기

개발 동기

페이지 하나를 HTML 파일로 업데이트하기

문서의 버전 번호 알아내기

HTML 형식 맞추기

첨부파일 올리기

계층 구조를 살려서 문서 전체를 업데이트하기

페이지 만들기

전체 코드

사용법

소스코드

Quip에서 Confluence로 이사하기

Confluence에 HTML 백업 파일을 일괄 업로드하는 프로그램 개발기

0개의 댓글