Upgrades content processing from a single LLM call to a structured
5-step document reconstruction pipeline:
1. Normalize — 구어체 정제, 문장부호 복원, 핵심 엔티티 추출
2. Index Tree — 텍스트 전체 스캔 → 계층적 목차(JSON) 생성
3. Leaf Summarize — 섹션별 상세 요약 (context overlap 300자 적용)
4. Consistency Check — 누락 엔티티 검증 및 보완
5. Assemble — 최종 Markdown 문서 조립 (LLM 불필요)
- Short texts (< 3000 chars): simple 1-pass fallback
- Long texts: full pipeline (N+4 LLM calls where N = section count)
- worker.py: uses body_md from enricher as Obsidian note body
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use Optional[T] + from __future__ import annotations instead of T | None
syntax which requires Python 3.10+.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- youtube.py: fetch real title via YouTube oEmbed API instead of falling back to video ID
- youtube.py: paragraphize transcript text by grouping sentences (4 per para)
- enricher.py: increase max_tokens 1024→2048 to prevent summary truncation
- web.py: restore paragraph breaks after HTML stripping
- core/vocab.py: extract B1-B2 level vocabulary from English content via Gemini Flash
- core/anki.py: register vocab cards to AnkiConnect (English::Vocabulary deck)
- core/enricher.py: add language detection field + summary_ko (Korean summary)
- core/obsidian.py: render Korean + English summary in note
- daemon/worker.py: call vocab extraction and Anki registration for English content