Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CU-8696nbm9j: Add module to convert vocab vectors #504

Merged
merged 1 commit into from
Nov 27, 2024

Conversation

mart-r
Copy link
Collaborator

@mart-r mart-r commented Nov 14, 2024

Adds a module to convert the vocab vectors from the default (or really anything) to a smaller length.

The default vocab vector length is 300. However, we don't really make use of all this information. Experiments show that we can go quite a lot smaller in vocab size and retain the same performance. See e.g: https://gist.github.com/mart-r/e9db909cde1922464bcc753f54006994
Or (somewhat more comprehensively): https://gist.github.com/mart-r/21460286466d17b9f23719ba3f4dc938

The benefits of using a smaller vocab size mainly boil down to (examples at 50 vector size):

  • Smaller saved vocab on disk
    • The vocab size can go from 314MB down to 142MB
    • The CDB size will also go down significantly
      • Because the context vectors stored within it depend on the vectors in the Vocab
      • In a MIMIC-IV trained model it went from 1.7GB to 1.3GB
      • But this effect can be larger if more concepts have been trained on
    • The model pack size will also change accordingly
      • Normal model (MIMIC IV trained) zip was 1.0GB
      • Down sized model (same model) zip was 410MB
  • Potentially faster load/save times
    • Since the files will be smaller
      • Though I don't have good evidence for that
    • Loading already unpacked (this can very well be run-to-run variance)
      • Normal: 17.4s
      • Down sized: 17.1s
    • Loading before unpacking
      • Normal: 25.7s
      • Down sized: 22.1s

NOTE:
There might be improvements we could do here:

  • Should this be in another module?
  • Should we add CLI for model pack conversion?

@tomolopolis
Copy link
Member

@mart-r mart-r merged commit b96310b into master Nov 27, 2024
7 checks passed
@mart-r mart-r deleted the CU-8696nbm9j-downsize-vocab-vectors branch January 23, 2025 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants