Skip to content

Add scripts for scraping more models#2

Merged
JerryLife merged 3 commits intoXtra-Computing:devfrom
yuqiannemo:dna_dev/more_models
Mar 22, 2026
Merged

Add scripts for scraping more models#2
JerryLife merged 3 commits intoXtra-Computing:devfrom
yuqiannemo:dna_dev/more_models

Conversation

@yuqiannemo
Copy link
Contributor

No description provided.

Copy link
Collaborator

@JerryLife JerryLife left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good — a few items to address before merging:

Copy link
Collaborator

@JerryLife JerryLife left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Thanks for adding the model scraping scripts! A few items to address before merging.


1. Output path — write to configs/

The pipeline stores model lists in configs/ following a {source}_llm_list naming convention:

Existing file Purpose
configs/llm_list.txt Generic model list (HuggingFace)
configs/OpenRouter_llm_list.txt OpenRouter model list
configs/llm_metadata.json Model metadata

The new scripts should follow the same convention. Suggested output paths:

  • closed_source_models.pyconfigs/openrouter_llm_list.jsonl
  • open_source_models.pyconfigs/huggingface_llm_list.jsonl

Use a path relative to the script's location so it works regardless of the working directory:

from pathlib import Path

ROOT = Path(__file__).resolve().parent.parent
output_path = ROOT / "configs" / "openrouter_llm_list.jsonl"

2. Add if __name__ == "__main__" guards

Both scripts execute API calls and I/O at module level. This means they cannot be imported without side effects (e.g., for testing or reuse). Please wrap the execution logic:

def main():
    # ... existing logic ...

if __name__ == "__main__":
    main()

3. Add error handling for HTTP requests

closed_source_models.py lines 5–6 — the requests.get() call has no timeout, no status code check, and no exception handling. Suggested fix:

response = requests.get(url, timeout=30)
response.raise_for_status()

4. Declare dependencies

requests and huggingface_hub are not currently declared in the project's dependencies. Either:

  • Add them as optional dependencies in pyproject.toml (e.g., under [project.optional-dependencies]), or
  • Add a comment at the top of each script noting the required packages.

5. modality field is always "unknown"

closed_source_models.py line 43 — architecture.get("modality") does not exist in the OpenRouter API schema. The script already reads input_modalities and output_modalities; consider constructing a modality string from those, e.g.:

input_mods = architecture.get("input_modalities", [])
output_mods = architecture.get("output_modalities", [])
"modality": f"{'|'.join(input_mods)} -> {'|'.join(output_mods)}"

Summary of requested changes

# Item Priority
1 Write output to configs/ with consistent naming (openrouter_llm_list.jsonl, huggingface_llm_list.jsonl) Required
2 Add if __name__ == "__main__" guards Required
3 Add timeout and raise_for_status() to HTTP request Required
4 Declare requests / huggingface_hub dependencies Required
5 Fix dead modality field Minor

@JerryLife
Copy link
Collaborator

Noted that both scripts should be merged into dev branch. Not the main branch.

@JerryLife JerryLife self-assigned this Mar 19, 2026
@JerryLife JerryLife added the enhancement New feature or request label Mar 19, 2026
@yuqiannemo
Copy link
Contributor Author

Noted that both scripts should be merged into dev branch. Not the main branch.

May I ask where the dev branch is? On my side, I saw that only the main branch is available in the current repo. Thanks!

@JerryLife JerryLife changed the base branch from main to dev March 22, 2026 04:54
@JerryLife JerryLife merged commit 358ea19 into Xtra-Computing:dev Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants