Commit Graph

10 Commits

Author SHA1 Message Date
afc9cdcde6 Refactor Playwright to singleton browser with tab-based crawling
- Add PlaywrightBrowserService: singleton Chromium browser with auto-recovery
- Refactor WebCrawlerService/YouTubeTranscriptService to use shared browser tabs
- Fix YouTube transcript: extract from DOM panel + fmt=json3 fallback
- Keep browser window alive (about:blank instead of page.close)
- Add docs: X Window setup, operation manual, crawling guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:18:33 +00:00
db4155c36d Add error logging and improve HTTP handling for YouTube transcript fetching
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 18:38:23 +00:00
56d5752095 Add YouTube cookie support to Playwright fallback for bot bypass
Load cookies.txt (Netscape format) into Playwright browser context
before navigating to YouTube, enabling authenticated access to bypass
bot detection that blocks transcript retrieval.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 07:37:48 +00:00
677a79978f Use youtube-transcript-api library with Playwright fallback for YouTube transcripts
Replace Jsoup-based approach with io.github.thoroldvix:youtube-transcript-api
as primary method (supports manual/generated captions, language priority).
Playwright head mode kept as fallback when API fails.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 07:26:52 +00:00
1bfe55d5a8 Switch YouTube transcript fetching from Jsoup to Playwright head mode
Jsoup was blocked by YouTube bot detection. Now uses Playwright with
headed Chromium via Xvfb virtual display to bypass restrictions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 07:11:52 +00:00
bb5a601433 Add YouTube transcript auto-fetch button on Knowledge add page
- YouTubeTranscriptService: fetches captions from YouTube page (ko > en > first available)
- GET /api/knowledge/youtube-transcript endpoint
- Frontend: "트랜스크립트 자동 가져오기" button appears when valid YouTube URL entered

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 04:20:13 +00:00
f0f7b62e3d Add Playwright headless browser as 3rd crawling fallback
Crawl chain: Jsoup → Jina Reader → Playwright (headless Chromium).
Error page detection (403, Access Denied, etc.) triggers next fallback.
Switch to exploded classpath for Playwright driver-bundle compatibility.
Fix Next.js standalone static file serving with symlink.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 22:36:24 +00:00
0cc84354f5 Add Jina Reader API fallback for web crawling
Jsoup fails on bot-blocked sites (403). Now tries Jsoup first,
then falls back to Jina Reader (r.jina.ai) for better coverage.
Supports optional API key via JINA_READER_API_KEY env var.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 22:03:09 +00:00
9929322de0 Implement all core features: Knowledge pipeline, RAG chat, Todos, Habits, Study Cards, Tags, Dashboard
- Google OAuth authentication with callback flow
- Knowledge ingest pipeline (TEXT/WEB/YOUTUBE → chunking → categorization → embedding)
- OCI GenAI integration (chat, embeddings) with multi-model support
- Semantic search via Oracle VECTOR_DISTANCE
- RAG-based AI chat with source attribution
- Todos with subtasks, filters, and priority levels
- Habits with daily check-in, streak tracking, and color customization
- Study Cards with SM-2 spaced repetition and LLM auto-generation
- Tags system with knowledge item mapping
- Dashboard with live data from all modules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 21:43:51 +00:00
3d2aa6cf46 Add backend/frontend scaffolding with Oracle ADB wallet config
- Backend: Spring Boot 3 + WebFlux, JWT auth, Oracle ADB wallet,
  8 controllers/services/repositories (Auth~Tag), DTOs, exception handling
- Frontend: Next.js 15, TypeScript, Tailwind CSS, AuthContext,
  7 pages (dashboard, knowledge, chat, study, todos, habits, login)
- DB: V1 migration with 12 tables including VECTOR(1024) + HNSW index
- Ops: PM2 ecosystem config, deploy.sh, start-backend.sh
- CLAUDE.md: DB credentials replaced with env var references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:56:26 +00:00