Conversation
…ved-tokens-logging
Codecov Report✅ All modified and coverable lines are covered by tests.
🚀 New features to boost your workflow:
|
stephantul
left a comment
There was a problem hiding this comment.
I think most of the code here is overly defensive, lots of guards and checks. But we control the data sources on both ends, so there is likely no need to ever check most of the things. Specifically, I think the gets with defaults of 0 actually hurt rather than help, because they obscure other bugs, and can lead to silent over- or underestimations of the stats.
| ) | ||
|
|
||
| index = SembleIndex(model, bm25, vicinity, chunks) | ||
| index._file_sizes = SembleIndex._compute_file_sizes(path, chunks) |
There was a problem hiding this comment.
this can be self._compute_file_sizes. I would not call this here and just make it part of the __init__, or, if it is fast enough, just make it a property. There's also afaik no need to make this a staticmethod, doing
index.recompute_file_sizes()
is completely fine. I realize you do need the root, but the root could be part of the SembleIndex.
| results = search_semantic(target.content, self.model, self._semantic_index, self.chunks, top_k + 1, selector) | ||
| return [r for r in results if r.chunk != target][:top_k] | ||
| results = [r for r in results if r.chunk != target][:top_k] | ||
| if self._file_sizes: |
There was a problem hiding this comment.
I don't think this ever needs to be checked. It's probably just better to populate file_sizes here if it doesn't exist already.
| """Save stats about a search or find_related call to the stats file.""" | ||
| try: | ||
| snippet_chars = sum(len(result.chunk.content) for result in results) | ||
| file_chars = sum(file_sizes.get(path, 0) for path in {result.chunk.file_path for result in results}) |
There was a problem hiding this comment.
It is probably better to disregard chunks for files for which we don't have size information. In fact, I think not having a specific file here points to some other failure.
| file_chars = sum(file_sizes.get(path, 0) for path in {result.chunk.file_path for result in results}) | ||
|
|
||
| record = { | ||
| "ts": datetime.now(timezone.utc).isoformat(), |
There was a problem hiding this comment.
don't write iso format, just dump the timestamp.
| in_today = record_date == today | ||
| in_last_7 = record_date > seven_days_ago | ||
| except ValueError: | ||
| in_today = in_last_7 = False # unparseable timestamp: count in All time only |
There was a problem hiding this comment.
how could this happen? AFAIK we write this ourselves. In any case, you should not parse a timestamp to a date and then reparse it again. Just use a timestamp
| ] | ||
| for label, bucket in summary.buckets.items(): | ||
| saved_chars = max(0, bucket.file_chars - bucket.snippet_chars) | ||
| saved_tokens = saved_chars // 4 # standard ~4 chars/token approximation |
| try: | ||
| record = json.loads(line) | ||
| except json.JSONDecodeError: | ||
| continue |
There was a problem hiding this comment.
maybe warn here, we don't expect this to happen.
| continue | ||
| snippet_chars = record.get("snippet_chars", 0) | ||
| file_chars = record.get("file_chars", 0) | ||
| call_type = record.get("call", "search") |
There was a problem hiding this comment.
when would we expect any of these to be missing? We write these ourselves.
| raise ValueError(f"Unknown search mode: {mode!r}") | ||
| else: | ||
| raise ValueError(f"Unknown search mode: {mode!r}") | ||
| if self._file_sizes: |
This PR adds
semble savings, a CLI command that tracks and displays token savings across all searches. Stats are recorded automatically to~/.semble/savings.jsonlon every search. There's also a verbose output that shows the calls per call type but it's not that interesting (yet) since we only havesearchandfind_relatedatm.Example output 💅 ✨: