Agentic LLM 실행 코드의 근간

2026 · 05 · 04 · 5 min read

1. 모든 것의 기초: while loop on stop signal

근본은 단순하다. 모델은 텍스트 생성기가 아니라 "다음 행동을 제안하는 함수" 다. 코드는 그 제안을 받아 실행하고, 결과를 다시 컨텍스트에 붙여서 모델을 재호출한다. 끝날 때까지.

while True:
    response = model.generate(messages, tools=tools)
    if response.has_tool_calls():
        results = execute_tools(response.tool_calls)
        messages.append(response)              # assistant turn
        messages.append(results)               # tool results
    else:
        return response.text

이게 전부다. Claude Code 2.1의 분석에 따르면 코드베이스의 1.6%만 AI 결정 로직, 98.4%는 deterministic 인프라 — permission gate, context 관리, tool routing, recovery. 루프 자체는 while문이고, 진짜 엔지니어링은 그 주변에 있다.

종료 조건은 모델이 "더 이상 tool call 없음"을 신호할 때. provider마다 신호 방식만 다르다.

2. Provider별 종료 신호 (이게 핵심)

Claude (Messages API)

stop_reason 필드로 분기한다:

"tool_use" → 계속 (tool 실행 후 재호출)
"end_turn" → 정상 종료
"max_tokens" → token 한도. 보통 에러로 처리
"stop_sequence" → 사용자 정의 stop seq에 걸림
"refusal" → 안전 거부
"pause_turn" → server tool (web_search 등)이 내부 iteration cap에 걸렸을 때. 응답을 그대로 다시 보내면 이어서 진행

while response.stop_reason == "tool_use":
    tool_uses = [b for b in response.content if b.type == "tool_use"]
    tool_results = [
        {"type": "tool_result", "tool_use_id": tu.id, "content": run(tu)}
        for tu in tool_uses
    ]
    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": tool_results})
    response = client.messages.create(model=..., tools=tools, messages=messages)

중요한 디테일: assistant turn에는 response.content 전체를 넣는다. text block + tool_use block이 섞여 올 수 있고, 그걸 다 보존해야 한다. tool_result의 tool_use_id는 반드시 매칭. 그리고 tool_result는 user role로 들어간다 (Claude의 특이점).

OpenAI (Responses API — 2025~2026 표준)

Chat Completions는 레거시. Responses API가 현재 표준이고, 2026 기준 OpenAI는 여기에 agentic execution loop를 내장하는 방향으로 가고 있다 (shell tool, container workspace, compaction 내장).

핵심 변화: Chat Completions는 messages 배열이지만 Responses는 input + items 모델이다. message, function_call, function_call_output이 각각 별개 item으로 분리됨. 그리고 previous_response_id로 turn을 chain할 수 있어서 conversation state를 직접 관리하지 않아도 된다 (store: true).

from openai import OpenAI
client = OpenAI()

response = client.responses.create(
    model="gpt-5.4",
    input=[{"role": "user", "content": "..."}],
    tools=[{"type": "function", "name": "...", ...}],
)

while True:
    function_calls = [item for item in response.output if item.type == "function_call"]
    if not function_calls:
        break
    
    tool_outputs = []
    for call in function_calls:
        result = execute(call.name, json.loads(call.arguments))
        tool_outputs.append({
            "type": "function_call_output",
            "call_id": call.call_id,
            "output": json.dumps(result),
        })
    
    response = client.responses.create(
        model="gpt-5.4",
        previous_response_id=response.id,   # state chain
        input=tool_outputs,
        tools=tools,
    )

print(response.output_text)

종료 판단은 단순: output에 function_call 타입 item이 더 이상 없으면 끝. previous_response_id 덕분에 messages 배열을 누적할 필요가 없다.

Gemini (google-genai SDK)

가장 단순한 형태. response의 candidates[0].content.parts에 function_call이 있으면 계속, 없으면 텍스트 종료.

from google import genai
from google.genai import types

client = genai.Client()
contents = [types.Content(role="user", parts=[types.Part(text=prompt)])]

while True:
    response = client.models.generate_content(
        model="gemini-3-pro",
        contents=contents,
        config=types.GenerateContentConfig(tools=[tools]),
    )
    parts = response.candidates[0].content.parts
    fn_calls = [p.function_call for p in parts if p.function_call]
    
    if not fn_calls:
        break
    
    contents.append(response.candidates[0].content)  # model turn
    
    fn_responses = []
    for fc in fn_calls:
        result = execute(fc.name, dict(fc.args))
        fn_responses.append(types.Part(function_response=types.FunctionResponse(
            name=fc.name, response={"result": result}
        )))
    contents.append(types.Content(role="user", parts=fn_responses))

print(response.text)

Gemini 3에서 주의해야 할 것: temperature를 default(1.0) 그대로 두라는 게 공식 권고다. 0.0으로 떨어뜨리면 tool call이 무한 루프 돌거나 추론이 망가질 수 있음. 옛날 GPT-4 시절 "function call은 temp 0" 휴리스틱은 Gemini 3에는 안 통한다.

또 SDK에 automatic function calling이 내장돼 있어서 ClientSession을 tools에 그냥 넘기면 알아서 루프를 돈다. 명시적 제어가 필요하면 automatic_function_calling=disabled.

3. Parallel tool calls — 디폴트가 됐다

세 provider 모두 한 turn에서 여러 tool_use 블록을 동시에 반환한다. 이건 옵션이 아니라 기본 동작이고, 처리 안 하면 무조건 깨진다.

# Claude — 여러 tool_use 모두 처리하고 모든 tool_result를 한 user 메시지에 묶어 보내야 함
tool_uses = [b for b in response.content if b.type == "tool_use"]
results = await asyncio.gather(*[run_async(tu) for tu in tool_uses])
messages.append({
    "role": "user",
    "content": [{"type": "tool_result", "tool_use_id": tu.id, "content": r} 
                for tu, r in zip(tool_uses, results)]
})

가장 흔한 버그: 병렬 tool 결과를 잘못 포맷하면 (예: 각 결과를 따로 user turn으로 분리) Claude가 학습된 패턴을 잃고 다음 turn부터 parallel call을 안 한다. tool_result는 반드시 한 message 안에 모두. OpenAI도 동일 — function_call_output들을 한 input 배열에 묶어서.

비활성화하려면: Claude는 disable_parallel_tool_use=True, OpenAI는 parallel_tool_calls=False.

4. 반드시 있어야 할 가드레일

production에서 빠뜨리면 안 되는 것들. 위 루프는 educational 버전이고 실제로는:

Step limit. 모델이 confused 상태에 빠지면 tool을 무한 호출한다. MAX_STEPS = 25 정도로 hard cap.

for step in range(MAX_STEPS):
    response = ...
    if no_tool_calls: break
else:
    raise AgentExceededMaxSteps()

Argument validation. 모델은 schema를 무시하고 hallucinate 한다. string에 int 박고, required field 빠뜨리고. tool 실행 진입점에서 항상 validate. 실패해도 throw하지 말고 error 메시지를 tool_result로 돌려주기 — 모델이 보고 자가 교정하게.

def execute(name, args):
    try:
        validated = TOOL_SCHEMAS[name].validate(args)
        return TOOL_FNS[name](**validated)
    except ValidationError as e:
        return {"error": f"Invalid args: {e}. Expected schema: {schema}"}

Claude는 tool_result에 is_error: true 플래그가 있고, 이걸 쓰면 모델이 error로 명확히 인식한다.

Cost/budget cap. token 누적, USD 누적 추적. 임계치 넘으면 abort. agent가 폭주하면 시간당 수백 달러 나가는 거 순식간이다.

Idempotency. 외부 부수효과 있는 tool (DB write, API 결제, email send)은 retry 안전성 확보. tool_use_id를 idempotency key로 쓰는 게 흔한 패턴.

Context window 관리. 긴 agent run은 결국 context를 채운다. 두 가지 전략:

Compaction: OpenAI Responses API는 native compaction 지원 (모델이 prior state를 encrypted token-efficient 형태로 압축). Claude는 memory tool과 별도 summarize step.
Programmatic tool calling (Claude의 새 기능): tool 결과를 모델 컨텍스트에 매번 끌어오는 대신, code execution 환경에서 모델이 직접 tool들을 함수처럼 호출하는 코드를 작성하게 함. 20개 직원 예산 조회 같은 게 20 round-trip 대신 1 round-trip + 코드 1개로 끝남.

5. Streaming은 별개 차원이다

루프와 streaming은 직교한다. streaming을 키면 tool_use 블록도 점진적으로 들어온다 (Claude의 content_block_start/content_block_delta/content_block_stop 이벤트, Gemini 3의 streamFunctionCallArguments, OpenAI Responses의 SSE).

Fine-grained tool streaming (Claude): tool input JSON이 완성되기 전에도 token-by-token으로 받을 수 있음. UX 개선용. 단, JSON parsing은 stream이 끝나야 가능하니 실제 tool 실행은 content_block_stop 이후.

OpenAI는 한 발 더 나가서 WebSocket 기반 persistent connection을 Responses API에 도입함 (Codex). 매 turn마다 HTTP round-trip 안 하고 같은 연결에서 sampling-tool-result를 왔다갔다 → 40% latency 절감.

6. Server-side tools라는 카테고리

여기서 헷갈리지 말 것: "tool"에는 두 종류가 있다.

Client tools (지금까지 설명한 것): 너의 코드가 실행. Claude의 bash/text_editor/computer/memory도 schema는 trained-in이지만 실행은 너의 책임.

Server tools: provider가 실행. Claude의 web_search/web_fetch/code_execution/tool_search, OpenAI의 web_search/code_interpreter/file_search/shell/MCP, Gemini의 Google Search/Code Execution/URL Context. 한 API call 안에서 자기들끼리 루프 돌고 결과만 너에게 돌려준다. 너의 루프 코드가 관여 안 함.

이게 중요한 이유: agentic 시스템 설계할 때 "이 sub-task는 server tool에 위임"하면 latency·코드 복잡도가 확 줄어든다. 반대로 외부 system 통합은 client tool로만 가능.

7. 2026의 변화 — "loop를 누가 소유할 것인가"

세 provider 모두 루프를 추상화하는 방향으로 가고 있는데 결이 다르다:

Anthropic: Claude Agent SDK (자체 프로세스에서 실행) + Claude Managed Agents (hosted runtime, beta, managed-agents-2026-04-01 header, $0.08/session-hour). 두 옵션을 명확히 분리. "loop를 직접 소유하고 싶으면 Messages API, runtime을 자기 프로세스에 두고 싶으면 Agent SDK, hosting까지 맡기고 싶으면 Managed Agents."
OpenAI: Responses API에 점점 더 많이 흡수. shell tool, container workspace, compaction, skills를 한 API call 안에서 다 처리하는 방향. "agentic by default."
Google: Gemini Enterprise Agent Platform (구 Vertex AI)으로 통합. SDK가 automatic function calling으로 루프 자동화, multi-agent 패턴은 platform 레벨에서.

선택 기준: 루프 동작을 정밀하게 제어해야 하면 (custom retry, custom step gating, observability hook 삽입 등) Messages/Responses/generate_content 직접. 그게 아니면 SDK의 자동 루프나 managed runtime이 boilerplate 95% 줄여줌. 단, managed로 가면 cost 모델이 token + runtime로 이중화되니 워크로드 따라 손익 갈림.

8. 빠지기 쉬운 함정 정리

Cache miss. tool 정의를 turn 중간에 바꾸면 prompt cache가 깨진다. 정적 부분(instruction, tools)은 prefix에, 가변 부분은 끝에. 이건 OpenAI/Anthropic 둘 다 동일.
Tool description이 schema보다 중요. JSON schema는 structural validity만 보장. "언제, 왜 호출해야 하는지"는 description에 써야 모델이 옳은 선택을 한다. Anthropic이 최근에 Tool Use Examples를 도입한 게 이 때문 — schema가 표현 못 하는 사용 패턴을 example로 박아넣음.
Tool 개수가 많아지면 context bloat. 5개 server에서 58 tools = 55K tokens. Claude의 Tool Search Tool (tool_search_tool_regex_20251119)은 tool 정의를 매번 컨텍스트에 안 넣고 on-demand로 검색. MCP 환경에서 거의 필수.
Strict mode 활용. Claude는 strict: true, OpenAI는 strict tool, Gemini는 responseSchema. schema에서 절대 벗어나지 않게 강제. argument hallucination 한 단계 차단.
Tool result는 항상 string으로 직렬화. 모델이 보는 건 결국 텍스트. 큰 결과는 truncate하고 "... (showing first 100 of 5000 rows)" 식으로 알려주기. 안 그러면 context 폭발.

---

핵심 한 줄 요약: 루프 자체는 stop signal 기반 while문이고, 종료/병렬/스트리밍/취소/예산을 어떻게 신뢰성 있게 처리하느냐가 agent의 quality를 결정한다. 모델 선택이나 prompt 튜닝은 그다음 문제다.