[Data Engineering] ICEBERG SNAPSHOT 관리 방법 및 구성 확인

Chan hae OH·2024년 7월 28일
1

iceberg

목록 보기
4/4

1. 시작말


안녕하세요.

데이터 엔지니어링 & 운영 업무를 하는 중 알게 된 지식이나 의문점들을 시리즈 형식으로 계속해서 작성해나가며

새로 알게 된 점이나 잘 못 알고 있었던 점을 더욱 기억에 남기기 위해 글을 꾸준히 작성 할려고 합니다.

Iceberg 의 경우 공식 문서와 구글링을 통한 정보를 참고하여 학습하고 있습니다.

ICEBER Documentation

반드시 글을 읽어 주실 때 잘 못 말하고 있는 부분은 정정 요청 드립니다.

저의 지식에 큰 도움이 됩니다. :)



2. Snapshot 이란?


데이터의 변화가 발생(commit)할 때 ICEBERG 는 각 형태들을 버전별로 관리합니다.

이를 통해 잘못된 데이터 매니징이 발생했을 때 데이터를 간단히 복구할 수 있는 강력한 수단을 갖출 수 있습니다.

iceberg 의 snapshot 이란? : https://iceberg.apache.org/terms/#snapshot

https://iceberg.apache.org/docs/nightly/spark-queries/#time-travel



table/metadata 에 기록되어 있는 정보를 토대로 Snapshot 기록들을 읽어옵니다.

그리고 보관 기간은 수로 지정할 수 있습니다.

Snapshot 들 간 time travel 기능을 통해 roll back 하는 그림



3. Snapshot 활용


--trino

USE CATALOG.DB;
SELECT * FROM "TABLE$SNAPSHOTS";

iceberg 에서 $snapshots 키워드를 사용하면 snapshot 이력들을 볼 수 있습니다.


그리고 현재 테이블의 데이터 형태를 해당 snapshots 으로 변경할 수 있습니다.
CALL CATALOG.SYSTEM.ROLLBACK_TO_SNAPSHOT('DB', 'TABLE', snapshot_id);



4. Iceberg 가 Snapshot 을 불러오는 방법


Iceberg 는 파일 시스템에 데이터를 저장할 때 아래와 같은 구조로 저장을 합니다.

  • /root_path/table name + uuid/metadata
  • /root_path/table name + uuid/data

이 때 metadata 와 data 폴더는 insert, delete 등 데이터의 형태가 변형될 때(commit) 마다 각각 목적에 맞게 데이터가 기록됩니다.

metadata 의 경우 컬럼의 통계 정보, 파티셔닝 정보, 스냅샷 아이디 등이 기록되는 manifest file 이 아래 형식으로 저장 됩니다.

  • *.metadata.json
  • *.stats
  • *.avro

Manifest file 이란 ? : https://iceberg.apache.org/terms/#manifest-file

먼저 metadata.json 파일에 대해서 확인해보겠습니다.


4.1. metadata.json

metadata.json 에는 아래와 같은 속성들이 저장 됩니다.

https://iceberg.apache.org/spec/#table-metadata-and-snapshots

Metadata fieldJSON representationExample
format-versionJSON int1
table-uuidJSON string"fb072c92-a02b-11e9-ae9c-1bb7bc9eca94"
locationJSON string"s3://b/wh/data.db/table"
last-updated-msJSON long1515100955770
last-column-idJSON int22
schemaJSON schema (object)See above, read schemas instead
schemasJSON schemas (list of objects)See above
current-schema-idJSON int0
partition-specJSON partition fields (list)See above, read partition-specs instead
partition-specsJSON partition specs (list of objects)See above
default-spec-idJSON int0
last-partition-idJSON int1000
propertiesJSON object: {
  "\": "\",
  ...
}
{
  "write.format.default": "avro",
  "commit.retry.num-retries": "4"
}
current-snapshot-idJSON long3051729675574590000
snapshotsJSON list of objects: [ {
  "snapshot-id": \,
  "timestamp-ms": \,
  "summary": {
    "operation": \,
    ... },
  "manifest-list": "\",
  "schema-id": "\"
  },
  ...
]
[ {
  "snapshot-id": 3051729675574597004,
  "timestamp-ms": 1515100955770,
  "summary": {
    "operation": "append"
  },
  "manifest-list": "s3://b/wh/.../s1.avro"
  "schema-id": 0
} ]
snapshot-logJSON list of objects: [
  {
  "snapshot-id": ,
  "timestamp-ms":
  },
  ...
]
[ {
  "snapshot-id": 30517296...,
  "timestamp-ms": 1515100...
} ]
metadata-logJSON list of objects: [
  {
  "metadata-file": ,
  "timestamp-ms":
  },
  ...
]
[ {
  "metadata-file": "s3://bucket/.../v1.json",
  "timestamp-ms": 1515100...
} ]
sort-ordersJSON sort orders (list of sort field object)See above
default-sort-order-idJSON int0
refsJSON map with string key and object value:
{
  "\": {
  "snapshot-id": \,
  "type": \,
  "max-ref-age-ms": \,
  ...
  }
  ...
}
{
  "test": {
  "snapshot-id": 123456789000,
  "type": "tag",
  "max-ref-age-ms": 10000000
  }
}

`metadata.json` 은 commit 시 변경사항들을 계속해서 저장하게 됩니다.

/root_path/table name + uuid/metadata 경로에 저장되며, 아래와 같이 앞의 5자리 번호가 계속해서 증가하며 저장 됩니다.


4.2 실제 metadata.json 해부

아래는 가장 최근 metadata.json 의 내용 입니다.

snapshots 키에서 ID와 간단한 통계정보를 확인할 수 있습니다.
그리고 snapshots 키 안에서 실제로 활용할 키들이 담겨 있스빈다.

  • manifest-list : avro 파일의 경로가 담겨 있고, 열어보면 추가 메타데이터가 담겨 있습니다.
  • statistics : stats 파일 경로가 있으며 , 해당 스냅샷의 상세 통계 정보가 담겨 있습니다.
  • metadata-log : 스냅샷 히스토리(metadata.json 경로)들이 담겨 있습니다.
"snapshots" : [ {
    "sequence-number" : 21,
    "snapshot-id" : 9051391979230492032,
    "parent-snapshot-id" : 7997487690611892323,
    "timestamp-ms" : 1721715728610,
    "summary" : {
      "operation" : "append",
      "trino_query_id" : "20240723_062207_00392_h42v4",
      "added-data-files" : "1",
      "added-records" : "1",
      "added-files-size" : "1755",
      "changed-partition-count" : "1",
      "total-records" : "899",
      "total-files-size" : "26785",
      "total-data-files" : "2",
      "total-delete-files" : "1",
      "total-position-deletes" : "8",
      "total-equality-deletes" : "0"
    },
    "manifest-list" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/snap-9051391979230492032-1-752058d5-6448-4906-8e45-997a9423f83a.avro",
    "schema-id" : 0
  } ],
  "statistics" : [ {
    "snapshot-id" : 9051391979230492032,
    "statistics-path" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/20240723_062207_00392_h42v4-d516069b-1119-492f-8835-a956b912e566.stats",
    "file-size-in-bytes" : 26455,
    "file-footer-size-in-bytes" : 2337,
    "blob-metadata" : [ {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 1 ],
      "properties" : {
        "ndv" : "891"
      }
    }, {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 2 ],
      "properties" : {
        "ndv" : "2"
      }
    }, {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 3 ],
      "properties" : {
        "ndv" : "3"
      }
    }, {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 4 ],
      "properties" : {
        "ndv" : "891"
      }
    }, {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 5 ],
      "properties" : {
        "ndv" : "2"
      }
    }, {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 6 ],
      "properties" : {
        "ndv" : "88"
      }
    }, {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 7 ],
      "properties" : {
        "ndv" : "7"
      }
    }, {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 8 ],
      "properties" : {
        "ndv" : "7"
      }
    }, {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 9 ],
      "properties" : {
        "ndv" : "681"
      }
    }, {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 10 ],
      "properties" : {
        "ndv" : "248"
      }
    }, {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 11 ],
      "properties" : {
        "ndv" : "147"
      }
    }, {
      "type" : "apache-datasketches-theta-v1",
      "snapshot-id" : 9051391979230492032,
      "sequence-number" : 21,
      "fields" : [ 12 ],
      "properties" : {
        "ndv" : "3"
      }
    } ]
  } ],
  "partition-statistics" : [ ],
  "snapshot-log" : [ {
    "timestamp-ms" : 1721715728610,
    "snapshot-id" : 9051391979230492032
  } ],
  "metadata-log" : [ {
    "timestamp-ms" : 1721610200550,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00000-74cee829-74a3-43c9-9cc6-dfbb1f477ced.metadata.json"
  }, {
    "timestamp-ms" : 1721610200640,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00001-739c7fb2-056d-4166-a7b7-c3a77a9ceb87.metadata.json"
  }, {
    "timestamp-ms" : 1721693452637,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00002-93365312-3c0e-4d0a-86b9-1eca07e549a4.metadata.json"
  }, {
    "timestamp-ms" : 1721693453578,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00003-363a2cf8-25ef-485a-911c-ffcba1183571.metadata.json"
  }, {
    "timestamp-ms" : 1721706852326,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00004-ad57cc43-dd6f-40f7-9323-f7dd8ad53459.metadata.json"
  }, {
    "timestamp-ms" : 1721708006959,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00005-2f53e2bb-d21c-4737-8fef-7c2b261a045a.metadata.json"
  }, {
    "timestamp-ms" : 1721709149005,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00006-1f7dc5fe-6127-4133-863b-2dc3e84ff6c8.metadata.json"
  }, {
    "timestamp-ms" : 1721709309911,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00007-2bbc9822-369f-433d-9cfc-5590d717f5c7.metadata.json"
  }, {
    "timestamp-ms" : 1721709310420,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00008-1200b960-a608-4826-82c2-6a813ce2d234.metadata.json"
  }, {
    "timestamp-ms" : 1721709664389,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00009-2ea02df8-4f8f-4a82-8b68-19ae6e195d8a.metadata.json"
  }, {
    "timestamp-ms" : 1721712646178,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00010-322a552d-19ca-4395-b28d-42b3735bb7ec.metadata.json"
  }, {
    "timestamp-ms" : 1721712647092,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00011-fc050bdf-35be-4b4c-b9a6-8fc10e339315.metadata.json"
  }, {
    "timestamp-ms" : 1721712651158,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00012-a26778a1-1f52-41a9-a208-c5f7a364d663.metadata.json"
  }, {
    "timestamp-ms" : 1721712652062,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00013-f2039d7f-6e5c-4fa1-a5a2-756f67d7937b.metadata.json"
  }, {
    "timestamp-ms" : 1721712656185,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00014-eee472f3-841e-4132-8810-8706190e3b1d.metadata.json"
  }, {
    "timestamp-ms" : 1721712657086,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00015-af42fc97-8103-4023-9bff-c1d2ee044410.metadata.json"
  }, {
    "timestamp-ms" : 1721712661144,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00016-161321ba-7f97-4293-9e50-787ab3ba5281.metadata.json"
  }, {
    "timestamp-ms" : 1721712662052,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00017-e6863195-5dac-4547-9302-9fcde4a0b86d.metadata.json"
  }, {
    "timestamp-ms" : 1721712664618,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00018-b4ef46c0-69e3-4c60-a217-505c21c8a4ab.metadata.json"
  }, {
    "timestamp-ms" : 1721712664717,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00019-bff53305-fdaa-4a4b-9409-0cfae4588f4d.metadata.json"
  }, {
    "timestamp-ms" : 1721712669010,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00020-037112df-6864-4f24-9502-3a1efbd0dff7.metadata.json"
  }, {
    "timestamp-ms" : 1721712669122,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00021-393b3050-ab11-43ce-93e1-e0c5b0867ccc.metadata.json"
  }, {
    "timestamp-ms" : 1721712672103,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00022-f1986a89-155f-47de-a866-24af5ffd3e9c.metadata.json"
  }, {
    "timestamp-ms" : 1721712672604,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00023-04e015aa-00c0-4639-9be4-a131013fbf37.metadata.json"
  }, {
    "timestamp-ms" : 1721712731582,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00024-e2365acc-fbd4-4522-91e8-089d65fbfb67.metadata.json"
  }, {
    "timestamp-ms" : 1721712785589,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00025-72910522-7ba6-4860-82c0-2792ea4927ba.metadata.json"
  }, {
    "timestamp-ms" : 1721712785688,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00026-6be29a3c-68c6-492a-a013-a7a1d2e2df44.metadata.json"
  }, {
    "timestamp-ms" : 1721712788039,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00027-e1e52c34-2005-410c-8392-738e2bbf32ac.metadata.json"
  }, {
    "timestamp-ms" : 1721712788130,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00028-33222463-07ee-4ca1-9a35-7633a0404c3b.metadata.json"
  }, {
    "timestamp-ms" : 1721712794270,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00029-73450c20-586b-4062-8707-2c1c1e4f6654.metadata.json"
  }, {
    "timestamp-ms" : 1721712794364,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00030-efcbbf9a-e3da-4cd1-8b49-4f6267ccd2eb.metadata.json"
  }, {
    "timestamp-ms" : 1721712810821,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00031-f4f926b8-f9b1-4bb0-b5ac-817067ac83b2.metadata.json"
  }, {
    "timestamp-ms" : 1721712811312,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00032-d98fb48f-2091-47d3-ab0c-0e52f903c720.metadata.json"
  }, {
    "timestamp-ms" : 1721712814724,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00033-7127f362-3ecd-41b1-949c-4672db53b6f6.metadata.json"
  }, {
    "timestamp-ms" : 1721712815217,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00034-3763f052-01f2-46e1-928a-5be11dd37576.metadata.json"
  }, {
    "timestamp-ms" : 1721713963627,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00035-40336da2-6f3c-4327-8ebb-1bbe3a8f55cd.metadata.json"
  }, {
    "timestamp-ms" : 1721713963717,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00036-99925f01-9a76-4a7e-aec4-bfc0f27a7763.metadata.json"
  }, {
    "timestamp-ms" : 1721713970906,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00037-8d70f3a0-8225-4e5f-976b-dbff76b80057.metadata.json"
  }, {
    "timestamp-ms" : 1721713970994,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00038-bc68452b-9846-4931-b957-b9170ee19e25.metadata.json"
  }, {
    "timestamp-ms" : 1721714231109,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00039-1c13f00f-cb33-4f34-a28b-3cbd1dd4005c.metadata.json"
  }, {
    "timestamp-ms" : 1721714231600,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00040-d28a907d-73ad-42db-8fe2-2c141f0b1b35.metadata.json"
  }, {
    "timestamp-ms" : 1721714970140,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00041-e197463c-62f3-4e95-a3a1-19dfd2626e02.metadata.json"
  }, {
    "timestamp-ms" : 1721715711554,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00042-d2b5f8ab-f40b-40eb-a77a-ffb93c61187d.metadata.json"
  }, {
    "timestamp-ms" : 1721715728610,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00043-2cb60328-808b-4fc8-b82f-4f19bf20c838.metadata.json"
  }, {
    "timestamp-ms" : 1721715728760,
    "metadata-file" : "hdfs://ec2-54-79-119-0.ap-southeast-2.compute.amazonaws.com:9000/hive/lake/stag.db/titanic_iceberg-2e45403dbbdd4ae6b322e85c786ff120/metadata/00044-9e22507b-dfc6-41ec-bc95-1e6b4c37b15b.metadata.json"
  } ]
}

4.3. Iceberg 의 Snapshot 조회 방법

iceberg 의 core 구현부에서 TableMetadataParser.java 를 살펴보면 아래와 같이 구현되어 있습니다.

...
public class TableMetadataParser {

  public enum Codec {
    NONE(""),
    GZIP(".gz");

    private final String extension;

    Codec(String extension) {
      this.extension = extension;
    }

    public static Codec fromName(String codecName) {
      Preconditions.checkArgument(codecName != null, "Codec name is null");
      try {
        return Codec.valueOf(codecName.toUpperCase(Locale.ENGLISH));
      } catch (IllegalArgumentException e) {
        throw new IllegalArgumentException(String.format("Invalid codec name: %s", codecName), e);
      }
    }

    public static Codec fromFileName(String fileName) {
      Preconditions.checkArgument(
          fileName.contains(".metadata.json"), "%s is not a valid metadata file", fileName);
      // we have to be backward-compatible with .metadata.json.gz files
      if (fileName.endsWith(".metadata.json.gz")) {
        return Codec.GZIP;
      }
      String fileNameWithoutSuffix = fileName.substring(0, fileName.lastIndexOf(".metadata.json"));
      if (fileNameWithoutSuffix.endsWith(Codec.GZIP.extension)) {
        return Codec.GZIP;
      } else {
        return Codec.NONE;
      }
    }
  }
 
 ...

이 때 당연하게도 metadata.json 파일을 읽어 metadata 를 가져오는 것을 알 수가 있습니다.

참고 : https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L68C25-L68C37

참고 : https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L276

이 후 metadata.json 안의 데이터를 분석하는 SnapshotParser.java 를 확인하면 어떻게 데이터를 가져오는 지 알수가 있습니다.

...
public class SnapshotParser {

  private SnapshotParser() {}

  /** A dummy {@link FileIO} implementation that is only used to retrieve the path */
  private static final DummyFileIO DUMMY_FILE_IO = new DummyFileIO();
  private static final String SEQUENCE_NUMBER = "sequence-number";
  private static final String SNAPSHOT_ID = "snapshot-id";
  private static final String PARENT_SNAPSHOT_ID = "parent-snapshot-id";
  private static final String TIMESTAMP_MS = "timestamp-ms";
  private static final String SUMMARY = "summary";
  private static final String OPERATION = "operation";
  private static final String MANIFESTS = "manifests";
  private static final String MANIFEST_LIST = "manifest-list";
  private static final String SCHEMA_ID = "schema-id";

  static void toJson(Snapshot snapshot, JsonGenerator generator) throws IOException {
    generator.writeStartObject();
    if (snapshot.sequenceNumber() > TableMetadata.INITIAL_SEQUENCE_NUMBER) {
      generator.writeNumberField(SEQUENCE_NUMBER, snapshot.sequenceNumber());
    }
    generator.writeNumberField(SNAPSHOT_ID, snapshot.snapshotId());
    if (snapshot.parentId() != null) {
      generator.writeNumberField(PARENT_SNAPSHOT_ID, snapshot.parentId());
    }
    generator.writeNumberField(TIMESTAMP_MS, snapshot.timestampMillis());

    // if there is an operation, write the summary map
    if (snapshot.operation() != null) {
      generator.writeObjectFieldStart(SUMMARY);
      generator.writeStringField(OPERATION, snapshot.operation());
      if (snapshot.summary() != null) {
        for (Map.Entry<String, String> entry : snapshot.summary().entrySet()) {
          // only write operation once
          if (OPERATION.equals(entry.getKey())) {
            continue;
          }
          generator.writeStringField(entry.getKey(), entry.getValue());
        }
      }
      generator.writeEndObject();
    }

...

참고 : https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L35



5. Snapshot 관리


Snapshot 은 현재 변경 사항 이전으로 데이터의 형상을 되돌릴 수 있고, 데이터 교정 및 실수를 방지하기 위해 반드시 필요합니다.

그러나 너무 많은 Snapshot 은 HDFS 의 경우 Small file issue 를 일으킬 수 있고, Metadata 만이 아닌 Data 또한 Snapshot 별로 형태를 가지고 있기 때문에 문제가 될 수 있습니다.

그렇기 때문에 Snapshot 은 주기별로 정리가 필요합니다.

참고 : METADATA & SNAPSHOT 삭제



6. 맺음말


아이스버그에서 지원하는 Snapshot 기능과 Timetravel 기능을 잘 활용하면, 데이터 관리가 더욱 효율적으로 될 듯 합니다.



profile
Data Engineer

0개의 댓글