파이썬으로 잡일을 줄이는 작은 조각들

2026 · 04 · 26 · 1 min read

잡일 자동화 스크립트를 자주 쓴다. 데이터 한 번 정리하고 버리는 것도 있고, 어느새 매주 돌리는 것도 있다. 처음엔 매번 새로 짰는데, 결국 같은 패턴 너댓 개로 수렴했다. 표준 라이브러리만 쓰면, 5년 후에 다시 열어봐도 그대로 돌아간다.

`pathlib`이 바꾼 디렉토리 작업

os.path.join과 os.listdir을 더 이상 쓰지 않는다. pathlib.Path는 운영체제별 경로 차이를 흡수하고, 메서드 체이닝으로 의도를 명확하게 만든다.

from pathlib import Path

ROOT = Path(__file__).resolve().parent
DATA = ROOT / 'data'
OUT = ROOT / 'out'
OUT.mkdir(exist_ok=True)

# 모든 jsonl 파일을 찾아서 처리
for src in DATA.rglob('*.jsonl'):
    rel = src.relative_to(DATA)
    dst = OUT / rel.with_suffix('.json')
    dst.parent.mkdir(parents=True, exist_ok=True)
    text = src.read_text(encoding='utf-8')
    items = [json.loads(line) for line in text.splitlines() if line.strip()]
    dst.write_text(json.dumps(items, ensure_ascii=False, indent=2))
    print(f'{rel} → {len(items)} items')

mkdir(parents=True, exist_ok=True) 한 줄로 디렉토리 트리가 보장된다. with_suffix, relative_to 같은 메서드 이름이 자체 문서가 된다.

안전한 `subprocess` wrapper

os.system이나 subprocess.run을 직접 쓰면 매번 같은 인자를 채운다. 한 번 wrapper를 만들면 호출부가 깔끔해진다.

import subprocess
from typing import Sequence

def sh(cmd: Sequence[str], *, cwd: Path | None = None, check: bool = True) -> str:
    """명령을 실행하고 stdout을 반환. 실패 시 stderr를 raise."""
    result = subprocess.run(
        cmd,
        cwd=cwd,
        capture_output=True,
        text=True,
        check=False,
    )
    if check and result.returncode != 0:
        raise RuntimeError(
            f'{cmd[0]} exited {result.returncode}\n{result.stderr}'
        )
    return result.stdout

# 사용
sha = sh(['git', 'rev-parse', 'HEAD']).strip()
branches = sh(['git', 'branch', '--list']).splitlines()

shell=True를 절대 안 쓴다. 인자는 list로만. 이 두 가지 원칙만 지키면 shell injection은 사실상 사라진다.

빈도 세기는 `Counter` 한 줄

로그 분석이나 단어 통계를 낼 때 dict.get(k, 0) + 1 같은 코드를 쓰지 않는다. collections.Counter가 더 빠르고 더 명확하다.

from collections import Counter
import re

def top_words(text: str, n: int = 20) -> list[tuple[str, int]]:
    words = re.findall(r'[a-zA-Z가-힣]+', text.lower())
    stop = {'the', 'a', 'an', 'is', 'and', 'or', 'of', 'to', 'in'}
    counter = Counter(w for w in words if w not in stop and len(w) > 1)
    return counter.most_common(n)

# 파일 전체에서 상위 20개 단어
for word, count in top_words(Path('post.md').read_text()):
    print(f'{count:>4}  {word}')

most_common(n)은 정렬까지 포함된다. 매번 sorted(d.items(), key=lambda x: -x[1]) 칠 일 없다.

큰 파일은 generator로

GB 단위 로그 파일을 다룰 때 f.readlines()로 한 번에 읽으면 메모리가 폭발한다. for 루프로 한 줄씩 읽는 게 원칙이다.

from collections.abc import Iterator

def parse_log(path: Path) -> Iterator[dict]:
    """로그 파일을 한 줄씩 dict로 yield. 메모리 상수."""
    with path.open('r', encoding='utf-8') as f:
        for raw in f:
            line = raw.strip()
            if not line or line.startswith('#'):
                continue
            try:
                yield json.loads(line)
            except json.JSONDecodeError:
                continue  # malformed line은 조용히 skip

# 1억 줄짜리 파일도 메모리 ~10MB로 처리
error_count = sum(1 for entry in parse_log(LOG) if entry.get('level') == 'ERROR')

yield는 함수를 generator로 바꾼다. 호출자가 next()를 부를 때마다 한 줄만 처리한다. 파일 크기에 무관하게 메모리는 일정하다.

작은 dataclass로 끝내는 설정

전역 dict로 설정을 나르다 보면 config['retry']['max']가 어디서 사용되는지 추적이 안 된다. dataclass + 한 곳에서 로드하면 IDE가 모든 사용처를 추적할 수 있다.

from dataclasses import dataclass
import json, os

@dataclass(frozen=True)
class Config:
    api_url: str
    api_key: str
    retry_max: int = 3
    timeout_sec: float = 10.0

    @classmethod
    def load(cls, path: Path) -> 'Config':
        data = json.loads(path.read_text())
        return cls(
            api_url=data['api_url'],
            api_key=os.environ['API_KEY'],  # 키는 항상 env에서
            retry_max=data.get('retry_max', 3),
            timeout_sec=data.get('timeout_sec', 10.0),
        )

config = Config.load(Path('config.json'))

frozen=True로 immutable. .api_url로 점 접근. 오타는 IDE가 잡는다. dict 시절의 어두운 시간으로 돌아가지 않는다.

마지막

위 다섯 패턴은 전부 Python 표준 라이브러리만 쓴다. requests도 pandas도 없다. 의존성이 0인 스크립트는 가상환경 없이도 돌고, 5년 후에도 같은 결과를 낸다.

시니어가 되는 한 가지 방법은, 매번 새로운 라이브러리를 배우지 않고 표준 라이브러리에 오래 머무는 것이라고 생각한다. 거기엔 이미 충분히 좋은 도구들이 들어 있다.

pathlib이 바꾼 디렉토리 작업

안전한 subprocess wrapper

빈도 세기는 Counter 한 줄

큰 파일은 generator로

작은 dataclass로 끝내는 설정

마지막

`pathlib`이 바꾼 디렉토리 작업

안전한 `subprocess` wrapper

빈도 세기는 `Counter` 한 줄