AGENTS / GITHUB / arxiv-brew
githubinferredactive

arxiv-brew

provenance:github:Surefire618/arxiv-brew

Keyword-based arXiv paper filtering and digest generation. Designed for LLM agent integration.

View Source ↗First seen 16d agoNot yet hireable
README
# arxiv-brew

Keyword-based arXiv paper filtering and digest generation. Designed to be called by LLM agents for automated daily literature monitoring.

## How it works

```mermaid
flowchart LR
    A["Scrape new IDs\nfrom arxiv.org"] --> B["Fetch metadata\nvia Atom API"]
    B --> C["Keyword filter\nagainst topic clusters"]
    C --> D["Download full text\n(HTML / PDF)"]
    D --> E["Generate digest\ngrouped by cluster"]
    C -.->|optional| F["LLM refine\n+ learn keywords"]
    F -.-> C
```

- **Pull** — scrape today's new paper IDs from configured arXiv categories, fetch metadata via Atom API
- **Filter** — match titles and abstracts against your keyword clusters (word-boundary and context-aware)
- **Download** — retrieve full text, HTML preferred, PDF fallback
- **Digest** — extract affiliations, format results grouped by topic cluster
- **LLM Refine** *(optional)* — an agent judges candidate relevance and suggests new keywords, which are persisted for future runs

## Install

```bash
git clone https://github.com/Surefire618/arxiv-brew.git
cd arxiv-brew
pip install .
```

For development:

```bash
pip install -e .
```

Python 3.10+, stdlib only. No external dependencies.

## Setup

```bash
# 1. Create your research profile
arxiv-brew init

# 2. Edit config/my_research.md with your arXiv categories and keywords
```

**Hint:** if you do not want to write `my_research.md` from scratch, you can ask your chatbot to generate it from the template.

Prompt example:

```text
I am setting up arxiv-brew for daily literature monitoring.
Please generate a complete `my_research.md` file in the exact format of `config/my_research.md.template`.

Use your best understanding of my research field to fill in:
- relevant arXiv categories
- 2-5 topic clusters
- word-boundary keywords
- broad keywords
- context keywords

Requirements:
- keep the same section structure as the template
- output only the final markdown file
- prefer precise technical keywords over vague topics
- include both core methods and target materials / systems when relevant
- do not invent personal details; infer only from my research area

My research area:
[describe your field here]
```

# 3. Build the keyword database
arxiv-keywords update
```

## Usage

```bash
# Daily digest (readable text)
arxiv-brew brew

# JSON output (for agents / scripts)
arxiv-brew brew --json

# Also save full JSON to file
arxiv-brew brew --output result.json
```

### Managing keywords (optional)

You can inspect and edit your keyword database at any time.

```bash
# Show current keywords grouped by cluster
arxiv-keywords list

# Add a keyword: arxiv-keywords add <cluster> <keyword>
arxiv-keywords add "ML Potentials" "graph neural network"

# Remove a keyword: arxiv-keywords remove <cluster> <keyword>
arxiv-keywords remove "ML Potentials" "graph neural network"

# After editing config/my_research.md, sync changes (preserves LLM-learned keywords)
arxiv-keywords update

# Full reset from profile (discards LLM-learned keywords)
arxiv-keywords reset
```

### Step-by-step pipeline (optional)

The commands below run individually what `arxiv-brew brew` does in one step.

```bash
arxiv-pull -o papers.json
arxiv-download papers.json -o downloaded.json
arxiv-summarize downloaded.json --digest-dir digests/
```

### Archive management

```bash
arxiv-db status
arxiv-db cleanup --retention-days 14
arxiv-db cleanup --force              # skip confirmation, useful in cron jobs
```

## Configuration

All keywords and categories come from your research profile (`config/my_research.md`). See [`config/my_research.md.template`](config/my_research.md.template) for the format.

The profile defines:
- **Categories** — which arXiv categories to scan (e.g. `cs.CL`, `cond-mat.mtrl-sci`). Full list: https://arxiv.org/category_taxonomy
- **Topic clusters** — groups of keywords (e.g. "NLP", "Reinforcement Learning")
- **Word boundary keywords** — short acronyms matched as whole words only
- **Broad keywords** — generic terms that require a context keyword to co-occur
- **Context keywords** — terms that validate broad keyword matches

## Agent integration

See [docs/agent_integration.md](docs/agent_integration.md) for how to use arxiv-brew with LLM agents (Claude Code, Codex, OpenClaw, etc.).

## CLI reference

| Command | Description |
|---|---|
| `arxiv-brew brew` | Run pipeline, print today's digest |
| `arxiv-brew brew --json` | Same, but JSON output |
| `arxiv-brew init` | Create research profile |
| `arxiv-brew refine` | Apply LLM refinement to candidates |
| `arxiv-keywords list` | Show all keywords |
| `arxiv-keywords add CLUSTER KW` | Add a keyword |
| `arxiv-keywords remove CLUSTER KW` | Remove a keyword |
| `arxiv-keywords update` | Sync from research profile |
| `arxiv-keywords reset` | Rebuild from profile |
| `arxiv-pull` | Pull and filter papers |
| `arxiv-download` | Download full text |
| `arxiv-summarize` | Generate digest |
| `arxiv-db` | Archive management |

All commands support `--help`.

## License

MIT

PUBLIC HISTORY

First discoveredApr 1, 2026

IDENTITY

inferred

Identity inferred from code signals. No PROVENANCE.yml found.

Is this yours? Claim it →

METADATA

platformgithub
first seenMar 31, 2026
last updatedMar 31, 2026
last crawled1 day ago
version

README BADGE

Add to your README:

![Provenance](https://getprovenance.dev/api/badge?id=provenance:github:Surefire618/arxiv-brew)