Skip to content
This repository was archived by the owner on Jan 14, 2021. It is now read-only.

Hi Julien, I have commited the changes to allow to optionally generate the vector in the same step, as well as exposing the vector params to the plugin #1

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

hugopinto
Copy link

Now the plugin generates a vector file in libsvm format, ready for training, using 4 additional parameters:
minFreq and maxFreq will filter features that are below or above the min and max freq, respectivelly.
After the remaining ones, we will keepNBestAttributes (before it was possible to filter either by min/max or by n best).
Finally, we let the user control if he wants to compactLexicon or not, so that indices remain continguous, instead of having gaps of due to the filtered bits.
CREOLE nos has:

  <PARAMETER NAME="minFreq" RUNTIME="true" DEFAULT="1" OPTIONAL="true">java.lang.Integer</PARAMETER>
  <PARAMETER NAME="maxFreq" RUNTIME="true" DEFAULT="2147483647" OPTIONAL="true">java.lang.Integer</PARAMETER>
  <PARAMETER NAME="keepNBestAttributes" RUNTIME="true" DEFAULT="0" OPTIONAL="true">java.lang.Integer</PARAMETER>
  <PARAMETER NAME="compactLexicon" RUNTIME="true" DEFAULT="True" OPTIONAL="true">java.lang.Boolean</PARAMETER>

Hugo Pinto added 3 commits April 13, 2011 16:43

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
…aining, using 4 additional parameters:

minFreq and maxFreq will filter features that are below or above the min and max freq, respectivelly.
After the remaining ones, we will keepNBestAttributes (before it was possible to filter either by min/max or by n best).
Finally, we let the user control if he wants to compactLexicon or not, so that indices remain continguous, instead of having gaps of due to the filtered bits.
CREOLE nos has:

      <PARAMETER NAME="minFreq" RUNTIME="true" DEFAULT="1" OPTIONAL="true">java.lang.Integer</PARAMETER>
      <PARAMETER NAME="maxFreq" RUNTIME="true" DEFAULT="2147483647" OPTIONAL="true">java.lang.Integer</PARAMETER>
      <PARAMETER NAME="keepNBestAttributes" RUNTIME="true" DEFAULT="0" OPTIONAL="true">java.lang.Integer</PARAMETER>
      <PARAMETER NAME="compactLexicon" RUNTIME="true" DEFAULT="True" OPTIONAL="true">java.lang.Boolean</PARAMETER>
@hugopinto
Copy link
Author

Sorry, I was just messing up - just realized that a simple commit will not commit all modified files. now all are in git

@jnioche
Copy link
Member

jnioche commented Apr 14, 2011

Hi Hugo,

Thanks for sharing this. A few comments below:

  • keepNBestAttributes : I'd rather not use this. It is based on some dubious methods for estimating the usefulness of attributes. If we add this then we'd need to also be able to specify which method to use for the scoring, the threshold etc... In practice this is something that only advanced users might be interested in and as such should be used on the command line
  • generation of the vector : OK to use the SVM format as this is the only type of trainer currently available. We might want to be able to specify the trainer implementation later
  • option "generateVectors" : obviously a boolean one. I thought you'd add something like this to determine whether or not to generate the vectors.
  • min / max : well, we used to be able to specify that before and I removed it. The reason was that often you don't want to make any assumptions as to which values to choose and leave everything then check the accuracy and then play with various cut-off points. When doing that you'd use the command line in order not to have to regenerate the raw file. I'd leave these options as currently done i.e on the command line

The idea behind the generation of the vectors from the PR is to do what most people do first i.e. make no assumptions as to what works best and try without any filtering.

Any reason not to use the latest stable version of the TC api (1.5?), have you changed something on that front?

Hugo Pinto added 2 commits April 19, 2011 12:54
…the end. For some reason, the default Lexicon.saveFile does not work.

It has a dependency on lib-svm, and lib-svm was not available, thus I added it. Seems like the liblinear-with-deps actually is missing deps.
The creole was modified to account for the lib-svm dependency
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants