30 Commits

Author SHA1 Message Date
7bc0464afc gov-scraper: 본문 지원자격 지역제한 필터 추가
- generate_checklist.js: 본문에 '비서울 지역 + 거주/소재/관내/재학' 정방향 패턴이면 제외
- 서울/수도권/전국 포함 시 유지(서울 거주자 가능), 서울 기관 사업도 유지
- 역방향(주소+지역)은 기관 연락처 푸터 오탐이라 미검사
- apply-checklist.md: 지역(제목+주관+본문)+연령+성별/대상 → 109건

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 08:13:29 +00:00
f3587eb130 gov-scraper: 신청 체크리스트 성별/특수대상 필터 추가
- generate_checklist.js: 남성 기준 여성 전용 제외, 특수대상(장애인/보훈/다문화/북한이탈) 전용 제외
- 제목+주관기관 기준(본문 '우대' 가점 언급은 미검사로 오제거 방지)
- 지역 보완: 달구벌(=대구) 추가
- apply-checklist.md: 지역+연령+성별/대상 누적 적용 → 117건

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 08:03:07 +00:00
ffdcea009d gov-scraper: 신청 체크리스트 연령(46세) 필터 추가
- generate_checklist.js: 본문 연령 상한 추출(만 N세 이하/범위) → 46세 미만이면 제외
- 제목 '청년/대학생' = 청년한정 제외, 단 '중장년/만40+이상/연령무관' 신호 있으면 유지
- apply-checklist.md: 지역(서울) + 연령(46세) 적용 → 252→122건

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 07:57:23 +00:00
ad8d200474 gov-scraper: 신청 체크리스트 서울 거주 지역필터 적용
- generate_checklist.js: 서울 거주 기준 타 지역 한정 공고 제외(접두/주관기관 + 안전한 도·권역은 제목 본문까지)
- apply-checklist.md: 252→137건(타지역 115건 제외), 서울+전국 공고만 유지

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 07:52:33 +00:00
afff7a4703 gov-scraper: 신청 체크리스트(apply-checklist.md) 추가
- docs/apply-checklist.md: 예비창업자 자격 + 현재 열린 공고 252건, 마감일 그룹별 체크박스 + URL
- scripts/generate_checklist.js: DB에서 체크리스트 재생성(추적 대상 docs/에 출력)
- 신청 완료 시 [x] 체크하며 진행, 스크립트로 갱신 가능

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 07:41:47 +00:00
e48a45bf71 gov-scraper: 제출용 사업계획서 완성본 추가(시장수치·경쟁사·출처 포함)
- docs/business-plans-full.md: 3개 앱 PSST + TAM/SAM/SOM + 경쟁사 비교표
- 시장조사(병렬 리서치) 반영: 출처·연도 병기, 추정치 명시
  - Tasteby: 외식 153조, 캐치테이블/식신, 숏폼 맛집 통계, 데이터바우처
  - Lyricsy: 언어학습 $837억, 한류 2.25억명, Duolingo, 가사 라이선스(LyricFind/KOMCA)
  - Parents Story: 초고령사회 20.3%, 고령친화 80조, 온디바이스 AI, 경쟁사 전부 클라우드

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 07:36:52 +00:00
cdce7b86bb gov-scraper: 마스터 사업계획서 + 공고 매칭/추출 스크립트 추가
- docs/business-plans.md: Tasteby/Lyricsy/Parents Story 3개 앱 PSST 사업계획서 초안
- scripts/match.js: 앱별 주제 키워드 매칭 조회
- scripts/eligible.js: 예비창업자 자격 + 현재 열린 공고 목록
- scripts/export_eligible_csv.js: 신청 추적용 CSV(exports/) 생성
- exports/ gitignore

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 07:24:39 +00:00
82504e2261 gov-scraper: 기업마당(bizinfo) Open API 소스 추가
- BizinfoApiSource: bizinfo.go.kr 자체 crtfcKey 사용, /uss/rss/bizinfoApi.do
- 페이지네이션 없음 → totCnt 파악 후 전체 일괄 요청(1,463건 검증)
- bsnsSumryCn(HTML) 본문 → stripHtml 로 태그 제거, 단일패스 적재(전건 DETAILED)
- reqstBeginEndDe "YYYY-MM-DD ~ ..." → 신청기간 파싱(706건), 텍스트형은 null
- util: stripHtml, parsePeriodRange 추가
- 데몬 4소스 가동: kstartup/bizinfo/mss/smes

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 06:33:27 +00:00
f2a8f30867 gov-scraper: 중소벤처24(smes) 사업공고 소스 추가
- GenericHtmlSource 확장: 신청기간(period) 날짜 파싱, listOnly(목록 전용) 모드
- smes(중소벤처24 bizApply) config 추가 — href의 PBLN 공고ID 추출, 제목/분야/주관기관/신청기간 적재
- smes 상세는 팝업 전용(JS 다이얼로그)이라 직접 크롤 불가 → 목록 전용으로 적재(18건 검증)
- util: parseFlexibleDate(YY-MM-DD/YYYYMMDD 대응)
- pipeline: skipDetail 소스는 상세 단계 건너뜀

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 05:51:46 +00:00
cbc5ba5663 정부지원사업 공고 수집 데몬(gov-scraper) 추가
- government/ Node 데몬: Open API 우선 + HTML 보조 + 디스커버리 전략
- Strategy 패턴 소스 어댑터: KStartupApiSource(공공데이터 Open API), GenericHtmlSource(config 기반)
- sundol 3단계 폴백 크롤러(cheerio→Jina→Playwright CDP) Node 재구현, sundol-chrome(9222) 재사용
- Oracle thick 모드(Instant Client + sso 지갑) 접속, gov_source/gov_opportunity 적재(중복제거)
- K-Startup 29,017건 + 중기부(mss) 30건 적재 검증, PM2 gov-daemon 등록(60분 주기)
- 기업마당(bizinfo)은 자체 crtfcKey 발급 대기

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 04:36:50 +00:00
5700449bfd docs: Chrome CDP 크롤링 장애 인시던트 보고서 추가
2026-05-18 웹크롤링/유튜브 자막 장애의 증상·진단·근본원인
(Chrome 136+ 기본 프로필 CDP 거부)·조치·검증·재발 점검
체크리스트를 docs/incident-2026-05-18-chrome-cdp-crawling.md 로 정리.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 01:17:11 +00:00
9569309e49 크롤링 복구: Chrome CDP를 PM2 sundol-chrome로 상시화
Chrome 136+가 기본 프로필 디렉토리에서 원격 디버깅(CDP)을 거부하여
4월 13일 이후 웹크롤링 3차 폴백/유튜브 자막 추출이 전부 실패하던 문제 해결.

- 프로필을 non-default 디렉토리(~/.config/google-chrome-cdp)로 이동해
  로그인 세션 유지한 채 CDP 허용
- start-chrome.sh 신규: 기존 Chrome 정리 + stale lock 제거 후
  --remote-debugging-port=9222 --remote-debugging-address=127.0.0.1 로 기동
- ecosystem.config.cjs: sundol-chrome PM2 앱 추가 (수동 실행 금지, PM2 통일)
  ※ frontend script의 /usr/local/bin/node 변경은 이전 작업분이 함께 포함됨
- PlaywrightBrowserService: CDP_URL을 127.0.0.1로 고정 (IPv6 ::1 해석 함정 제거)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 01:06:58 +00:00
20210830cf Fix TTS: switch to 1.7B with ref_audio, speakable text on all lines
- Use 1.7B model (0.6B had tensor mismatch with cached prompts)
- Speak endpoint uses ref_audio directly (not cached pkl) as fallback
- Cache voice clone prompts in memory on startup
- Add SpeakableText component: 🔊 icon on each p and li element
- Remove old TTSReader sequential approach
- Add global exception handler to TTS server
- Fix profile localStorage caching
- inference_mode + bf16 optimization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:14:06 +00:00
1088b23790 Add Notes, Voice Clone TTS, fix auth persistence and maxTokens
Notes:
- notes table with TEXT/AUDIO types, category support
- Audio upload → OpenRouter Gemini STT → OCI GenAI polish/summary
- Raw STT saved separately in raw_content column
- Polish/summary button for manual re-processing
- Async processing with real-time polling

Voice Clone TTS:
- Qwen3-TTS 1.7B model on A10 GPU via FastAPI server
- Voice profile registration (record/upload → save embedding)
- Profile-based TTS generation API
- TTS web page with recording, profile management, generation

Auth fixes:
- Store both access + refresh tokens in localStorage
- Initialize state from localStorage synchronously (no flash)
- Request interceptor reads token from localStorage every request
- Refresh via body (not just cookie)

Other fixes:
- maxTokens 4096 → 65536 (OCI GenAI Gemini supports up to 65536)
- Fix broken Korean chars in source files
- OpenRouter config for STT
- ffmpeg installed for audio conversion
- Ollama + Gemma 4 E4B installed (STT fallback)
- nginx proxy for TTS server (/api/tts/)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 07:34:18 +00:00
6c2129d42e Add category view, pagination, and persist login across deployments
- Add 2-panel category view: sidebar tree + filtered item list
- Category counts use DISTINCT with descendant inclusion
- Hide empty categories, show category badges on item cards
- Add client-side pagination (10 items/page) for both views
- Persist access token in localStorage to survive page refresh
- Fix token refresh retry on backend restart

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 04:11:08 +00:00
f9f710ec90 Add English level settings, improve content structuring and rendering
- Add english_level column to users table (CEFR with TOEIC mapping)
- Add UserController (GET/PATCH /api/users/me) and Settings page
- Enhance structuring prompts: sequential TOC, no summary sections,
  no content overlap, English expression extraction by CEFR level
- Remove sub-TOC analysis (caused content repetition), use simple
  per-section generation with truncation detection and continuation
- Fix CLOB truncation: explicit Clob-to-String conversion in repository
- Replace regex-based markdown rendering with react-markdown
- Add wallet renewal procedure to troubleshooting docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 23:48:38 +00:00
4cde775809 Switch to user Chrome CDP for YouTube transcript, fix auth and ads
- Replace Playwright standalone browser with CDP connection to user Chrome
  (bypasses YouTube bot detection by using logged-in Chrome session)
- Add video playback, ad detection/skip, and play confirmation before transcript extraction
- Extract transcript JS to separate resource files (fix SyntaxError in evaluate)
- Add ytInitialPlayerResponse-based transcript extraction as primary method
- Fix token refresh: retry on network error during backend restart
- Fix null userId logout, CLOB type hint for structured_content
- Disable XFCE screen lock/screensaver
- Add troubleshooting entries (#10-12) and YouTube transcript guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 21:01:49 +00:00
9abb770e37 Add knowledge structuring feature with incremental LLM processing
- Add structured_content column and STRUCTURING pipeline step
- Split LLM structuring into TOC + per-section calls to avoid token limit
- Save intermediate results to DB for real-time frontend polling (3s)
- Add manual "정리하기" button with async processing
- Fix browser login modal by customizing authentication entry point
- Fix standalone build symlinks for server.js and static files
- Add troubleshooting guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:43:21 +00:00
afc9cdcde6 Refactor Playwright to singleton browser with tab-based crawling
- Add PlaywrightBrowserService: singleton Chromium browser with auto-recovery
- Refactor WebCrawlerService/YouTubeTranscriptService to use shared browser tabs
- Fix YouTube transcript: extract from DOM panel + fmt=json3 fallback
- Keep browser window alive (about:blank instead of page.close)
- Add docs: X Window setup, operation manual, crawling guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:18:33 +00:00
db4155c36d Add error logging and improve HTTP handling for YouTube transcript fetching
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 18:38:23 +00:00
56d5752095 Add YouTube cookie support to Playwright fallback for bot bypass
Load cookies.txt (Netscape format) into Playwright browser context
before navigating to YouTube, enabling authenticated access to bypass
bot detection that blocks transcript retrieval.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 07:37:48 +00:00
677a79978f Use youtube-transcript-api library with Playwright fallback for YouTube transcripts
Replace Jsoup-based approach with io.github.thoroldvix:youtube-transcript-api
as primary method (supports manual/generated captions, language priority).
Playwright head mode kept as fallback when API fails.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 07:26:52 +00:00
1bfe55d5a8 Switch YouTube transcript fetching from Jsoup to Playwright head mode
Jsoup was blocked by YouTube bot detection. Now uses Playwright with
headed Chromium via Xvfb virtual display to bypass restrictions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 07:11:52 +00:00
9798cda41e Add Axios interceptor for automatic token refresh with mutex pattern
- api.ts: 401 응답 시 자동으로 refresh → retry, 동시 요청은 큐에 대기 (race condition 방지)
- auth-context.tsx: interceptor에 콜백 연결 (토큰 갱신/로그아웃)
- use-api.ts: 401 retry 로직 제거 (interceptor가 처리)
- build.sh: NEXT_PUBLIC 환경변수 검증 단계 추가
- CLAUDE.md: 프론트엔드 빌드 절차 추가

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 04:42:23 +00:00
bb5a601433 Add YouTube transcript auto-fetch button on Knowledge add page
- YouTubeTranscriptService: fetches captions from YouTube page (ko > en > first available)
- GET /api/knowledge/youtube-transcript endpoint
- Frontend: "트랜스크립트 자동 가져오기" button appears when valid YouTube URL entered

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 04:20:13 +00:00
f0f7b62e3d Add Playwright headless browser as 3rd crawling fallback
Crawl chain: Jsoup → Jina Reader → Playwright (headless Chromium).
Error page detection (403, Access Denied, etc.) triggers next fallback.
Switch to exploded classpath for Playwright driver-bundle compatibility.
Fix Next.js standalone static file serving with symlink.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 22:36:24 +00:00
0cc84354f5 Add Jina Reader API fallback for web crawling
Jsoup fails on bot-blocked sites (403). Now tries Jsoup first,
then falls back to Jina Reader (r.jina.ai) for better coverage.
Supports optional API key via JINA_READER_API_KEY env var.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 22:03:09 +00:00
9929322de0 Implement all core features: Knowledge pipeline, RAG chat, Todos, Habits, Study Cards, Tags, Dashboard
- Google OAuth authentication with callback flow
- Knowledge ingest pipeline (TEXT/WEB/YOUTUBE → chunking → categorization → embedding)
- OCI GenAI integration (chat, embeddings) with multi-model support
- Semantic search via Oracle VECTOR_DISTANCE
- RAG-based AI chat with source attribution
- Todos with subtasks, filters, and priority levels
- Habits with daily check-in, streak tracking, and color customization
- Study Cards with SM-2 spaced repetition and LLM auto-generation
- Tags system with knowledge item mapping
- Dashboard with live data from all modules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 21:43:51 +00:00
3d2aa6cf46 Add backend/frontend scaffolding with Oracle ADB wallet config
- Backend: Spring Boot 3 + WebFlux, JWT auth, Oracle ADB wallet,
  8 controllers/services/repositories (Auth~Tag), DTOs, exception handling
- Frontend: Next.js 15, TypeScript, Tailwind CSS, AuthContext,
  7 pages (dashboard, knowledge, chat, study, todos, habits, login)
- DB: V1 migration with 12 tables including VECTOR(1024) + HNSW index
- Ops: PM2 ecosystem config, deploy.sh, start-backend.sh
- CLAUDE.md: DB credentials replaced with env var references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:56:26 +00:00
5d8b0fcdb8 Initial project setup with env template and gitignore
- .gitignore: Java/Maven, Node.js, IDE, OS, credentials
- .env.sample: backend + frontend environment variable template
- README.md: project overview and getting started guide
- CLAUDE.md: development rules and guidelines
- docs/: SUNDOL spec and design patterns guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:37:07 +00:00