챗 봇에게 코딩시키기

이두원·2023년 4월 20일

초보 개발자인 나에게 코드를 짜는 것은 무에서 유를 창조하는 만큼 어려운 일이다.
하지만 챗봇의 성능이 이 정도로 발전했는데, 챗봇만으로 코딩을 해도 괜찮게 뽑지 않을까?

목표 정하기
목표는 Yolo모델을 사용해 주어진 이미지를 인식하고, 인식한 결과를 챗봇에게 보내서 이미지에 대해 설명하도록 하는 코드를 짜는 것이다. 예를 들면
위의 사진을 Yolo를 통해 인식하면 person(0.99), umbrella(0.96)...같은 결과가 나올 것이다.
이 결과를 챗봇에게 보내주고, 이것으로 이루어진 이미지를 설명하라고 하고, 이 해변가에 대해 묘사하는 결과가 나오는 것이 최종 목표가 된다.
어떤 챗 봇을 쓸까?
현재 챗 봇은 openai에서 제작한 Chat gpt시리즈와 Bing Chat등이 있는데, Bing Chat은 gpt4를 토대로 만들어 졌다고 하니 사실상 gpt4가 가장 강력한 녀석으로 보인다. 현재 gpt3.5는 무료이고 gpt4는 월 20달러를 구독해야 사용가능하다. 치킨1마리의 가치를 하기를 기도하며 gpt4를 결제했다.
간보기
일단 Yolo모델에 대해 물어보자. 얼마나 잘 알고 있을까?
다 좋은데, 현재 Yolo는 v8까지 나왔다. 최신 데이터는 반영이 안된건가?

그렇다. 이 친구의 기억은 2021년 9월에 멈춰있었다. 이 친구로 코딩을 하려면 v5로 진행해야겠다.

4.코딩

꽤나 복잡한 요구사항이다. 2021년 9월 이후 기억상실에 걸린 gpt4는 잘 해결할 수 있을까?

GPT-4와 YOLO 모델을 함께 사용하여 이미지를 인식하고 인식한 결과를 자연어로 해설하는 코드를 작성하는 것은 흥미로운 아이디어입니다. 아래는 Python을 사용하여 이 작업을 수행하는 예제 코드입니다.

이 예제에서는 PyTorch 및 torchvision 라이브러리를 사용하여 YOLO 모델을 구현하고, OpenAI API를 사용하여 GPT-4를 호출합니다. 먼저 필요한 라이브러리를 설치해야 합니다.

pip install torch torchvision openai

다음으로, 코드를 작성합니다.

import io
import os
import openai
import requests
import torch
import torchvision.transforms as T

from PIL import Image
from torchvision.models.detection import fasterrcnn_resnet50_fpn

# Replace with your OpenAI API key
openai.api_key = "your_openai_api_key"

# Load pre-trained YOLO model (using Faster R-CNN with ResNet-50 backbone as an example)
model = fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

# Download an image from the internet
image_url = "https://example.com/your_image.jpg"
response = requests.get(image_url)
image_data = response.content

# Open the image and convert it to a tensor
image = Image.open(io.BytesIO(image_data)).convert("RGB")
transform = T.Compose([T.ToTensor()])
image_tensor = transform(image)

# Detect objects in the image
with torch.no_grad():
    predictions = model([image_tensor])

# Extract object names and their confidence scores
object_labels = predictions[0]["labels"].tolist()
object_scores = predictions[0]["scores"].tolist()

# Generate a natural language description of the objects detected
detected_objects = ", ".join([f"{score:.1%} {label}" for score, label in zip(object_scores, object_labels)])
prompt = f"Describe an image containing the following objects: {detected_objects}"
response = openai.Completion.create(engine="text-davinci-002", prompt=prompt, max_tokens=50, n=1, stop=None, temperature=0.5)

# Print the generated description
print(response.choices[0].text.strip())

이 코드는 다음을 수행합니다:

이미지를 다운로드하고 변환하여 텐서로 만듭니다.
YOLO 모델 (여기서는 예시로 Faster R-CNN with ResNet-50 backbone을 사용)을 사용하여 이미지에서 객체를 검출합니다.
검출된 객체와 그에 대한 신뢰도 점수를 추출합니다.
검출된 객체 목록을 사용하여 GPT-4에게 이미지에 대한 자연어 설명을 생성하도록 요청합니다.
생성된 설명을 출력합니다.

이 예제는 기본적인 개념을 보여주지만, 실제로 GPT-4와 YOLO를 함께 사용하려면 세부적인 조정이 필요할 수 있습니다. 예를 들어, 사용하는 YOLO 모델에 따라 코드를 수정해야 할 수 있으며, 객체 인식의 정확도와 자연

마지막에 말을 하다 끊긴 했지만 매우 인상적인 답변이 나왔다. 단순히 코드만 던져준게 아니라 친절히 주석에 작동 방식까지 넣어 둔 것이 마치 교과서에 나온 코딩 예제같다. 바로 실행해보자.
결과값이 이상하다. 이유를 물어보자.

GPT-3의 결과값이 원치 않은 형태로 나오는 것은 주로 프롬프트 구성이나 출력 처리 방식 때문입니다. 코드에서 GPT-3 프롬프트는 객체 인식 결과를 바탕으로 작성되며, 객체 인식 결과에서 객체 이름과 확률값이 정확하게 변환되지 않은 것 같습니다.

먼저, 객체 이름과 확률값이 올바르게 변환되도록 코드를 수정해야 합니다. 수정된 코드를 참고하세요:

# Convert class indices to class names and format them with their confidence scores
detected_objects = ", ".join([f"{COCO_INSTANCE_CATEGORY_NAMES[label]} ({score:.1%})" for label, score in zip(object_labels, object_scores)])

# Generate a natural language description of the objects detected
prompt = f"Describe an image containing the following objects: {detected_objects}"
# ... (이하 생략)

위 코드는 객체 이름을 COCO_INSTANCE_CATEGORY_NAMES에서 찾아 확률값과 함께 출력하도록 수정했습니다. 이렇게 하면 GPT-3의 결과값도 더 정확하게 나올 것입니다. 객체 인식 결과가 올바르게 출력되도록 수정한 후에 GPT-3 프롬프트를 다시 확인하십시오.

결과값이 나오긴 하는데 해변가와는 전혀 상관없는 The image contains a pie chart with various percentages listed 라는 결과가 나온다. 이미지에 다양한 백분율이 나열된 원형 차트가 포함되어 있습니다 라는 것인데, 위의 5.5% 28...같은 데이터가 그대로 gpt3에 입력되서 그런게 아닐까? 한번 물어보자.

AI에게 인정받은것 같아 묘하게 기분이 좋다. 문제의 원인을 알았으니 해결책을 물어보자.

GPT-3에게 더 좋은 결과를 얻기 위해서는 라벨의 실제 객체 이름과 신뢰도를 자연어 형태로 입력해야 합니다. 이렇게 하면 GPT-3가 이해하기 쉽고, 이미지에 대한 더 정확한 설명을 생성할 수 있습니다.

예를 들어, 입력 프롬프트를 다음과 같이 구성할 수 있습니다:

detected_objects = ", ".join([f"{object_name} ({score:.1%} confidence)" for object_name, score in zip(object_names, object_scores)])
prompt = f"Describe an image containing the following objects: {detected_objects}"

이렇게 하면 GPT-3에게 전달되는 입력 프롬프트는 다음과 같이 됩니다:

Describe an image containing the following objects: person (99.8% confidence), cup (61.6% confidence), oven (59.0% confidence), ...

이러한 형태의 입력은 GPT-3가 이미지에 대한 더 정확한 설명을 생성하는 데 도움이 됩니다. 객체 라벨을 실제 객체 이름으로 변환하려면, 해당 라벨과 관련된 객체 이름을 포함하는 데이터 구조를 사용할 수 있습니다. 예를 들어, COCO 데이터셋의 경우, 객체 라벨과 이름을 매핑하는 딕셔너리를 사용할 수 있습니다.

예시를 보니 확실히 알 것 같다. gpt3에게는 person (99.8% confidence), cup (61.6% confidence)이런 입력을 집어 놓고 이걸 묘사하라고 시켰으니, 당연히 데이터 차트라고 생각하고 The image contains a pie chart with various percentages listed 라는 결과를 냈을 것이다. gpt3는 죄가 없었다.

이 문제를 해결하기 위해 입력 프롬프트를 수정하여 GPT-3에게 이미지에 대한 명확한 컨텍스트를 제공하는 것이 중요합니다. GPT-3에게 이미지의 장면에 대한 정보를 제공하고, 이미지에서 발견된 객체에 대해 설명하도록 요청할 수 있습니다.

예를 들어, 입력 프롬프트를 다음과 같이 구성할 수 있습니다:

detected_objects = ", ".join([f"{object_name} ({score:.1%} confidence)" for object_name, score in zip(object_names, object_scores)])
prompt = f"Describe a scene in an image containing the following objects: {detected_objects}. Please provide a narrative of what might be happening in the image."

이렇게 하면 GPT-3에게 전달되는 입력 프롬프트는 다음과 같이 됩니다:

Describe a scene in an image containing the following objects: person (99.8% confidence), cup (61.6% confidence), oven (59.0% confidence), ... Please provide a narrative of what might be happening in the image.

이렇게 수정한 프롬프트는 GPT-3에게 이미지에서 발견된 객체들이 어떻게 상호 작용하고 있는지 설명하도록 요청하며, 이를 통해 이미지에 대한 더 정확한 설명을 생성할 수 있습니다.

어...저대로 했지만 결과가 바뀐것 같지는 않다. 좀 더 좋은 방법을 물어보자.
좀 더 보기 좋게 정리한거 같지만, 정작 입력되는 데이터는 변한 것이 없다. gpt3에게 무엇이 있는지 뿐만 아니라, 그게 이미지에서 어디에 있는지 좌표까지 보낸다면 제대로 인식할 수 있지 않을까?

네, 이미지 인식의 결과로 얻은 객체 좌표와 정확도를 GPT-3에게 전달할 수 있습니다. 객체의 좌표는 bounding box의 형태로 표현되며, 일반적으로 (xmin, ymin, xmax, ymax)의 형태로 표현됩니다. 이 정보를 포함한 프롬프트를 작성하여 GPT-3에게 전달하면 됩니다.

예를 들어, 다음과 같은 프롬프트를 사용할 수 있습니다:

이미지에는 여러 객체가 포함되어 있습니다. 한 명의 사람이 99.8%의 확신으로 보이며, 위치는 (xmin1, ymin1, xmax1, ymax1)입니다. 컵이 61.6%의 확신으로 나타나고, 위치는 (xmin2, ymin2, xmax2, ymax2)입니다. 오븐이 59.0%의 확신으로 인식되었으며, 위치는 (xmin3, ymin3, xmax3, ymax3)입니다. 이미지를 자세하게 설명해주세요.

프롬프트에 실제 좌표 값을 채워 넣어 GPT-3에게 전달하면, 이미지 인식 결과에 대한 좌표와 정확도 정보를 모두 고려한 자세한 설명을 얻을 수 있습니다.

bounding box의 좌표를 같이 보낼 수 있다고 한다. 하지만 난 그걸 할 줄 모르는걸? 해줘.

먼저, 객체의 좌표값인 bounding box를 추출하고 프롬프트를 생성해야 합니다. 이를 위해 아래 코드를 수정했습니다:

import io
import os
import openai
import requests
import torch
import torchvision.transforms as T

from PIL import Image
from torchvision.models.detection import fasterrcnn_resnet50_fpn

# Replace with your OpenAI API key
openai.api_key = "your_OpenAI_API"

# Load pre-trained YOLO model (using Faster R-CNN with ResNet-50 backbone as an example)
model = fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

# Download an image from the internet
image_url = "https://cdn3.incapto.com/wp-content/uploads/2020/06/Barista.jpg"
response = requests.get(image_url)
image_data = response.content

# Open the image and convert it to a tensor
image = Image.open(io.BytesIO(image_data)).convert("RGB")
transform = T.Compose([T.ToTensor()])
image_tensor = transform(image)

# Detect objects in the image
with torch.no_grad():
    predictions = model([image_tensor])

# Extract object names and their confidence scores
object_labels = predictions[0]["labels"].tolist()
object_scores = predictions[0]["scores"].tolist()
object_boxes = predictions[0]["boxes"].tolist()

# Generate a natural language description of the objects detected
object_descriptions = []
for label, score, box in zip(object_labels, object_scores, object_boxes):
    object_descriptions.append(f"{label} with {score:.1%} confidence at ({box[0]:.1f}, {box[1]:.1f}, {box[2]:.1f}, {box[3]:.1f})")

detected_objects = ", ".join(object_descriptions)
prompt = f"Describe a scene in an image containing the following objects: {detected_objects}. Please provide a narrative of what might be happening in the image."
response = openai.Completion.create(engine="text-davinci-002", prompt=prompt, max_tokens=50, n=1, stop=None, temperature=0.5)

# Print the generated description
print(response.choices[0].text.strip())
print(detected_objects)

이 코드는 이미지에서 발견된 객체의 라벨, 정확도 및 바운딩 박스 좌표를 포함하는 프롬프트를 생성합니다. 이렇게 생성된 프롬프트를 GPT-3에게 전달하면, 이미지 인식 결과에 대한 좌표와 정확도 정보를 모두 고려한 자세한 설명을 얻을 수 있습니다.

정말 되는건가? 두근두근한 마음으로 실행하지만 에러가 뜬다.
다행히도 좌표정보를 추가하면서 gpt3에 입력된 값이 너무 길어져서 발생한 문제였다. 그걸 상위 5개로 줄여서 길이를 줄이는 해결방안을 제시했는데, gpt3입장에서는 분석할 데이터가 줄었으니 더 정확하게 답하지 않을까? 기대해보자.

사진 속에는 세 명의 사람들이 있는데, 그들 중 두 명은 우산 아래 서로 옆에 서 있습니다. 그
3번째 사람은 우산으로부터 몇 피트 떨어져 있고 비로부터 보호받지 못하고 있습니다.

파라솔을 우산으로 인식했기에 비오는 날이라고 추측하고 결과를 낸 것같다. 그래도 위 사진에서 두 명은 같이 있고 한 명은 따로 있는 것을 인식한 것을 보면, 좌표데이터를 기반으로 물체가 어디에 있는지 확실히 인식했다.

5.결론

AI가 얼마나 코딩을 잘하나 궁금해서 해보았는데, 앞으로 신세 많이 지게될것 같은 수준의 성능이었다. 위의 코드에서는 gpt3와 연동해 결과를 냈지만, gpt4 에서는 이미지 입출력을 지원할 수도 있다는 말을 보면 정말로 AI가 사물을 보고 인식하고 그 상황을 이해하는 단계에 도달할 것같다.
그렇게 된다면 AI는 이 세상을 보고 어떻게 이해하게 될까? 사람이라는 생물체가 지배하는 태양계의 작은 행성? 아니면 미세한 우주의 일부분? 그것도 아니면 존재 할 가치가 없는 것? 영화에서 보던 로봇과의 전쟁이 잠깐 생각났지만, 인간도 바보는 아니니까 알아서 잘 하겠지라는 생각을 하며 챗 봇을 종료했다.
마지막으로 최종결과를 뽑은 코드를 붙여놓고 글을 마치겠다.이 코드는 openai의 api키를 발급받고 사용할 수 있다. 아래 코드는 gpt3기반 코드이니 무료로 사용가능하다. 다들 이 유용한 AI를 한번 쯤 써봤으면 좋을 것 같다.

import io
import os
import openai
import requests
import torch
import torchvision.transforms as T

from PIL import Image
from torchvision.models.detection import fasterrcnn_resnet50_fpn

# Replace with your OpenAI API key
openai.api_key = "your_key"

# Load pre-trained YOLO model (using Faster R-CNN with ResNet-50 backbone as an example)
model = fasterrcnn_resnet50_fpn(weights=True)
model.eval()

# Download an image from the internet
image_url = "https://www.yeosu.go.kr/tour/build/images/p132/p1322008/p1322008678450-1.jpg/666x1x70/666x1_p1322008678450-1.jpg"
response = requests.get(image_url)
image_data = response.content

# Open the image and convert it to a tensor
image = Image.open(io.BytesIO(image_data)).convert("RGB")
transform = T.Compose([T.ToTensor()])
image_tensor = transform(image)

# Detect objects in the image
with torch.no_grad():
    predictions = model([image_tensor])

COCO_INSTANCE_CATEGORY_NAMES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

# Extract object names and their confidence scores

object_labels = predictions[0]["labels"].tolist()
object_scores = predictions[0]["scores"].tolist()
object_boxes = predictions[0]["boxes"].tolist()
object_names = [COCO_INSTANCE_CATEGORY_NAMES[label] for label in object_labels]

top_n = 5
object_labels = object_labels[:top_n]
object_scores = object_scores[:top_n]
object_boxes = object_boxes[:top_n]


# Generate a natural language description of the objects detected
object_descriptions = []
for label, score, box in zip(object_labels, object_scores, object_boxes):
    object_descriptions.append(f"{label} with {score:.1%} confidence at ({box[0]:.1f}, {box[1]:.1f}, {box[2]:.1f}, {box[3]:.1f})")

detected_objects = ", ".join([f"{COCO_INSTANCE_CATEGORY_NAMES[label]} with {score:.1%} confidence at {tuple(box)}" for label, score, box in zip(object_labels, object_scores, object_boxes)])
prompt = f"Describe a scene in an image containing the following objects: {detected_objects}. Please provide a narrative of what might be happening in the image."
response = openai.Completion.create(engine="text-davinci-002", prompt=prompt, max_tokens=50, n=1, stop=None, temperature=0.5)

# Print the generated description
print(response.choices[0].text.strip())
print(detected_objects)

이두원

AI 초보 개발자

다음 포스트

챗 봇에게 코딩시키기

Auto-gpt 맛보기

0개의 댓글