Dự án secondbrain theo thiết kế của tân
- Python 92.7%
- Svelte 2.2%
- Astro 2%
- JavaScript 1.3%
- TypeScript 1.2%
- Other 0.6%
|
|
||
|---|---|---|
| .claude | ||
| docs | ||
| nginx | ||
| packages | ||
| .env.example | ||
| .gitignore | ||
| docker-compose.yml | ||
| README.md | ||
Yuki
A knowledge management system designed as a "Second Brain" for web content.
┌─────────────────────────────────────────────────────────────────────────┐
│ KNOWLEDGE MANAGEMENT SYSTEM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ nginx (Reverse Proxy) │ │
│ │ Port 80 │ │
│ │ /api/* → yuki-core /* → yuki-reader │ │
│ └───────────────────────────────┬─────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ yuki-core │ │ yuki-reader │ │ Ollama │ │
│ │ (Unified │ │ (Web UI) │ │ (LLM) │ │
│ │ Backend) │ │ Port 3000 │ │ Port 11434 │ │
│ │ Port 8000 │ │ (internal) │ │ (External) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Features
Content Collection
- Multi-client fetching: httpx → cloudscraper → nodriver fallback chain
- Smart extraction: Site-specific extractors (VnExpress, Wikipedia, Medium, etc.)
- Background crawling: Async jobs with progress tracking
- Scheduled crawling: Cron-based automatic re-crawl
- Content change detection: Track updates to previously crawled content
- Event system: Webhooks for crawl/item events
- Admin dashboard: Monitor jobs and system status
Knowledge Processing
- Embeddings: Generate vector embeddings via Ollama
- Named Entity Recognition: Extract people, places, organizations
- Knowledge graph: Build entity relationships
- Semantic search: Find content by meaning, not just keywords
- Entity deduplication: Merge duplicate entities automatically
- Real-time updates: WebSocket for processing status
Quick Start
Using Docker (Recommended)
# Clone the repository
git clone https://github.com/user/yuki.git
cd yuki
# Start all services
docker compose up -d
# View logs
docker compose logs -f
Services available at:
- Web UI: http://localhost/
- API: http://localhost/api/
- Health: http://localhost/api/health
Manual Setup
# yuki-core
cd packages/yuki-core
uv sync
uv run uvicorn app.main:app --reload
Note: yuki-core requires Ollama running locally:
ollama pull nomic-embed-text
ollama pull qwen2.5:7b # for NER
API Examples
Extract Content
# Single URL
curl -X POST http://localhost/api/extract \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article"}'
# Batch URLs
curl -X POST http://localhost/api/extract/batch \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com/1", "https://example.com/2"]}'
Background Crawl
# Start crawl job
curl -X POST http://localhost/api/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/series", "mode": "collection"}'
# Check status
curl http://localhost/api/crawl/{job_id}
# Cancel job
curl -X DELETE http://localhost/api/crawl/{job_id}
Semantic Search
# Search by meaning
curl -X POST http://localhost/api/search/semantic \
-H "Content-Type: application/json" \
-d '{"query": "machine learning applications", "limit": 10}'
Knowledge Graph
# Get entities
curl "http://localhost/api/entities?type=PERSON&limit=20"
# Get entity relationships
curl http://localhost/api/graph/entity/{entity_id}/relations
Data Access
# List collections
curl http://localhost/api/collections
# Get items in collection
curl "http://localhost/api/items?collection_id={id}"
# Get plain text content (optimized for AI)
curl http://localhost/api/items/{id}/text
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
YUKI_PORT |
8000 | API server port |
YUKI_DB_PATH |
./data/yuki-core.lance | Database path |
YUKI_DEFAULT_CLIENT |
auto | httpx/cloudscraper/nodriver/auto |
YUKI_MIN_DELAY_MS |
1000 | Min delay between requests |
YUKI_MAX_CONCURRENT_JOBS |
10 | Max concurrent crawl jobs |
YUKI_OLLAMA_BASE_URL |
http://localhost:11434 | Ollama API URL |
YUKI_OLLAMA_EMBED_MODEL |
nomic-embed-text | Embedding model |
YUKI_OLLAMA_LLM_MODEL |
qwen2.5:7b | LLM for NER |
YUKI_WORKER_ENABLED |
true | Enable background processing |
YUKI_MAX_CONCURRENT_TASKS |
3 | Concurrent processing tasks |
YUKI_API_KEY_ENABLED |
false | Enable API key auth |
YUKI_RATE_LIMIT_ENABLED |
false | Enable rate limiting |
YUKI_WEBHOOKS_ENABLED |
false | Enable webhook events |
Project Structure
yuki/
├── packages/
│ ├── yuki-core/ # Unified Backend (port 8000)
│ │ ├── app/
│ │ │ ├── api/ # REST endpoints
│ │ │ ├── fetcher/ # HTTP clients (httpx, cloudscraper, nodriver)
│ │ │ ├── processor/ # Content extraction by domain
│ │ │ ├── processors/ # Embedder, NER, Relations
│ │ │ ├── services/ # Business logic
│ │ │ ├── storage/ # LanceDB (12 tables)
│ │ │ └── worker/ # Background processing pipeline
│ │ └── tests/
│ │
│ └── yuki-reader/ # Web UI (port 3000)
│ └── ...
│
├── nginx/ # Reverse proxy config
├── docs/ # Documentation
├── docker-compose.yml
└── data/ # Runtime data (gitignored)
└── yuki-core/
API Security
When deploying publicly, enable API key authentication:
# Generate secure API key
python -c "import secrets; print(secrets.token_urlsafe(32))"
# Set environment variables
YUKI_API_KEY_ENABLED=true
YUKI_API_KEY=your-generated-key-here
Include API key in requests:
curl http://localhost/api/items \
-H "X-API-Key: your-api-key-here"
Public endpoints (no API key required):
GET /- API infoGET /health- Health checkGET /docs- Swagger documentationGET /metrics- Prometheus metrics
Testing
cd packages/yuki-core && uv run pytest
# With coverage
uv run pytest --cov=app --cov-report=html
Documentation
- Architecture - System design and data flow
- API Reference - Complete API documentation
- Getting Started - Setup guide
- User Guide - Usage examples
- Roadmap - Planned features
License
MIT