diff --git a/pharaoh/README.rst b/pharaoh/README.rst index 181e9f908..fc18a027a 100644 --- a/pharaoh/README.rst +++ b/pharaoh/README.rst @@ -2,243 +2,78 @@ Translation Pipeline ==================== -Giza Commands -------------- - - -* **build model** - - * ``giza translate build_translation_model --config `` - * ``giza translate bm -c `` - * These commands create a translation model with the configuration taken from the translate.yaml config. - * First create an empty directory that the script should run in. Specify that as the "project_path" in the config. - * Then create (or resuse) a directory that the auxilary corpus files (the tokenized files , truecased files, etc.) should be written to. Specify that as the "aux_corpus_files" in the config. - * Additionally, specify the paths to the top most mosesdecoder directory and the topmost irstlm directory - * Next create your corpora. This should be one file in the source language and one in the target language for training, tuning, and testing. You can't use multiple corpora with Moses easily. - - * These could be individual corpora such as KDE4 or Europarl, or they could be hybrid ones - * The create_corpora script can be used to create good hubrid corpora. See its documentation for how to use it. - * Specify the paths to the training, tuning and testing directories. The source and target language training files should be together, the files for tuning should be together, and the files for testing should be together. The three sets can have different directories though. The source and target language files for each should have the same name except for the final extensions. For example: in testdir/ you'd have test.es-en.en and test.es-en.es. When specifying the name in the config, leave off the final language extension. - - * Specify your run settings. - - * Threads is the number of threads for multi-threaded moses/irstlm/mgiza commands in any given process. Pool_size is the number of processes that build_model will run at once. - * phrase_table_name and reordering_name are a bit trickier. In general they are 'phrase_table' and 'reordering_table' in some cases- mainly when doing factorized or OSM models- this name changes to something like ``phrase_table.0-0,1``. This would be found under ``project/0/working/train/model``. As such you can't actually know exactly what the answer is before you run it. Usually this will just cause an error late in the script (around tuning or testing) and you'll have to fix the name at then rerun the whole script or those sections if you feel like editing the initial script. - - * Specify your training_parameters. If you know what you want to run you can just make a simple yaml attribute. If you make a list, as shown in the example, it will run all combinations of the parameters in parallel using as many processes as the pool-size allows. - * Run the build model command in the background. Expect it to take a long time. It should email you if it succeeds, however make sure to monitor if the process is still running. ``ps aux | grep 'moses'`` usually does the trick. - * Look at ``data.csv`` in the project directory to get the results from the test. The highest BLEU score is the best result. - * To see a sample from the model, look at ``project/0/working/test.en-es.translate.es`` (note es will be your target language). - * Information about the different configuration options can best be found in the Moses documentation: - - * http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters - * http://www.statmt.org/moses/?n=FactoredTraining.BuildReorderingModel - * http://www.statmt.org/moses/?n=FactoredTraining.AlignWords - * http://www.statmt.org/moses/?n=Moses.AdvancedFeatures - * http://www.statmt.org/moses/?n=FactoredTraining.ScorePhrases - - * There are sample configuration files in this directory. translate_full.yaml has all of the possible options, translate_best.yaml has the best options I found, translate_baseline.yaml has the moses documentation's baseline system. - -* **model results** - * ``giza translate model_results --config `` - * ``giza translate res -c `` - * If for some reason build model doesn't run ``model_results`` or you just want to run it again, this command will run it for you - * It takes the json file from build model and writes the data to a csv file and then emails the person in the config - - * **create corpora** - * ``giza translate create_corpora --config `` - * ``giza translate cc -c `` - * These commands create training, tuning, and testing corpora from mutliple different input corpora - * The first thing to do is create the config file. - - * The container_path is the path to the directory that the corpora will be placed in. If you provide just a name then a directory of that name will be placed in the current directory. - * The source section specifies what percentage of a given file goes to each of training, testing, tuning. You provide the name and the path to the source and target corpora and then the percentages that go into each. The percentages must add up to 100. - * The source contributions section specifices the percentage of each corpus that comes from each of the files. create_corpora finds the minimum total length of the corpus such that all of the lines are used at least once. If one corpus has a higher percentage than it has lines, its lines get repeated, emphasizing them more. For example, say ``f1`` is 100 lines, and under sources we allocate 60% to training. Let's say create_corpora finds that the training corpus should be 200 lines and the source_contributions says ``f1`` should comprise 80% of that corpus. Thus 160 lines need to be taken from ``f1``, so ``f1``'s first 60 lines will be put in twice and then we still need to put in 40 more lines so we'll add the first 40 lines in one more time. - * Create corpora creates both languages at the same time, you must specify the paths to each and the script verifies that they are the same length. - - * After creating the config just run the command and move the container wherever you'd like if you didn't specify it correctly off the bat. - -* **merge translations** - - * ``giza translate merge_translations --output --input ...`` - * ``giza translate mt -o -i ...`` - * These comamnds merge two files together line by line. This is useful for looking at different translations of the same file line by line next to each other. - * It annotates each line so that the user can better line things up. - * To use it just specify an output file and a list of input files. - - * The input files don't have to be the same length but it'll stop when it gets to the bottom of the shortest file. - * Currently it only works with 14 files because of the number of default annotations. If you want to use more files than that, just go into the operations file and add more annotations manually. - - * If you want to compare multiple models, or compare a model to a "correct" translation, or compare a model to the source language, this is the easiest way to visualize it. - -* **po to corpus** - - * ``giza translate po_to_corpus --po --source --target `` - * ``giza translate p2c --po -s -t `` - * These commands are used for creating corpora from po files - * If you have po files that have been translated by a person and are reliable these will parse through them and write them out line by line to parallel files. - * The source and target flags are used for specifying the output files. They are optional and if left off will use default files. - * If you have po files that are translated I highly recommend using them as corpora since they are the best data you could possibly have and are the most similar to the sentences you'll be translating. - -* **dict to corpus** - - * ``giza translate dict_to_corpus --dict --source --target `` - * ``giza translate d2c --dict -s -t `` - * These commands will turn a dictionary into a corpus - * This can be good for trying to fill in words that don't get translated, though adding dictionaries is not so effective as there are no actual phrases - * Dictionaries for this script can be gotten at http://www.dicts.info/uddl.php . - * This command works almost identically to po_to_corpus, though it doens't work for multiple input files. - -* **translate text doc** +Commands +-------- - * ``giza translate translate_text_doc --config --source --target --protected `` - * ``giza translate tdoc -c -s -t -p `` - * These commands will translate any file according to the model specified by the provided (or default) config. - * The file will be translated line by line, so it is primarily meant for text documents that are just text line after line, however obviously it could "translate" any other structured file - * The source is the file to translate, the target is the name of the file after translation. - * If there are regexes that you don't want to tokenize, --protected will handle them for you. +* **Verifier** - * This is good for not translating file names or urls. - * They will still be translated, but their tokens won't be separated off. Thus most likely if you have a special character in a word like a \`` or a < it will probably not be translated as it will have no precedent. + * ``pharaoh verifier`` + * This command starts a verifier server with the configuration found in the config file. + * It uses gunicorn for running the application. + * Use the verifier to have contributors edit and verify machine translations or translations that other contributors produced + * You first put the translations into the backend MongoDB database. The app then looks through the database and allows users to choose a file to edit + * Every sentence they can either approve or edit. + * Two users can't edit the same file at the same time so that they don't accidentally clash. + * Use the admin page to upload or download docs to the database isntead of using the following two commands -* **translate po** +* **po-to-mongo** - * ``giza translate translate_po --config --po --protected `` - * ``giza translate tpo -c --po -p `` - * These commands work just like translate text doc, but rather than translating one text doc they can translate one or more po files - * Just provide a link to a po file or a directory of them and it will traverse them all and translate them all. - * The po files will be translated in place so it's important to copy them beforehand. Moreover, the already translated entries will be emptied. + * ``pharaoh po-to-mongo --po ~/docs --username Judah --status approved --source_language en --target_language es --host localhost --port 28000 --dbname verifier`` + * This command takes po files and puts them into Mongodb + * You can do the same functionality by navigating to the admin page of the verifier and uploading po files + * A new set of po files should be uploaded for every different translator as the editor will be tagged with whatever username is provided. + * If the translations are trusted the status can be uploaded as approved and then they won't be edited again. - * This is intentional as it makes it so every translation has a known source. It would be bad if we conflated human translations with machine translations. This way each set has a consistent source. +* **mongo-to-po** + * ``pharaoh mongo-to-po --po ~/docs --source_language en--target_language es --host localhost --port 28000 --dbname verifier --all`` + * This command takes the translations in the mongodb database and puts them into po files + * It injects the translations into the po files that are provided + * It only overwrites the untranslated entries in those files + * It is good practice to copy the po files first to have a backup Setup ----- -* Follow the instructions in MosesSetup.sh. It is not meant to be a script that you simply run, rather go through it line by line or paragraph by paragraph running the commands. -* Be sure to read the comments as you go along, they may tell you alternate commands to run in certain situations. +* Use ``pip install -e .`` to install all dependencies. +* You will also need to install `MongoDB ` +* Start a Mongod instance on any host and port and fix the host and port in the included config to match. +* Make sure you put any users in the database before having them make edits or uploading po files for them. +* This system is made to work well with `Giza ` Workflow -------- -1. Setup Moses, Giza, and IRSTLM as described above -2. Setup your corpora - - 1. Use more data for better results, preferably data similar to the documents you will be translating from - 2. Plan out the train, tune, and test corpora, with almost all data going to train. To do this first find as many parallel corpora as you want out of which you will create your train, tune, and test corpora - 3. If you have any translations in po files, use ``po_to_corpus`` to pull the data out to use as parallel corpora - 4. Use`` create_corpora`` to make your corpora. You will need to first create a ``corpora.yaml`` file similar to the sample one provided specifying how much of each file goes into train, tune, and test respectively and how much of the train, tune, and test copora will have lines from each file. Note that this second part means that the train, tune, or test corpora may have multiple copies of some input corpora. - 5. Put the same data in multiple times (or make it a higher percentage of the train, tune, or test corpus in ``create_corpora``) to weight it higher. For example, if you have sentences in po files that you know are good and relevant to your domain, these may be the best data you have and should be correspondingly waited higher. Alternatively, unless you're creating a translater for parliamentary data, the europarl corpus should probably have a low weight so your translations do not sound like parliamentary proceedings - -3. Build your model - - 1. Decide what configurations to test and run ``build_model`` with an appropriate config file modeled off of the sample ``translate_full.yaml`` which shows all of the possible settings. Perusing the Moses website will explain a bit more about every setting, but in general most settings either perform faster or perform better. Ones that seem to "do less"- such as by using fewer scoring options, considering only one direction, or considering smaller phrases or words- likely will finish faster but will perform worse. ``translate_best.yaml`` was found to perform very well. ``translate_baseline.yaml`` is the baseline provided by moses. - 2. Wait a while (and read a good book!) while the test runs - 3. At the end of the test look at the out.csv file for the data on how well each configuration did, the BLEU score is the metric you want to look at. - 4. If for some reason the out.csv file isn't there, use ``model_results`` to create it. - -4. Translate your docs - - 1. Use ``translate_po`` to translate your po files. - 2. First copy the files so you have a parallel directory tree, and give ``translate_po`` one of them to translate - -5. Put your docs in MongoDB - - 1. Use ``po_to_mongo`` to move the data into MongoDB - 2. Run this once for every "type" of translation you have. (i.e. Moses, Person1, Person2....), this will make the status and the username correct - 3. You may need to put some users into your database first. Opening up a shell and running ``db.users.insert({"username": "Moses", "num_reviewed": 0, "num_user_approved": 0, "num_got_approved":0, "trust_level": "basic"})`` - 4. ``python po_to_mongo.py ~/docs Jorge approved es 28000 verifier`` - 5. ``python po_to_mongo.py ~/docsMoses Moses SMT es 28000 verifier`` - -6. Run the verifier - - 1. Run the verifier web app and have people contribute to it - -7. Take the approved data from the verifier - - 1. Copy doc directory tree to back it up - 2. Use ``mongo_to_po`` to copy approved translations into the new doc directory tree - 3. This will inject the approved translations into all of the untranslated sentences - -Notes ------ -If you don't want to accidentally convert backticks (`) into apostrophes (') then comment out line 278 of translation_tools/mosesdecoder/scripts/tokenizer/tokenizer.perl: -$text =~ s/\`/\'/g; - -When running any moses .sh files, run with bash, not just sh - -To test, go into the working/train/ folder and run: -``grep ' document ' model/lex.f2e | sort -nrk 3 | head`` - -Get KDE4 corpus from here, it's a mid-size corpus filled with technical sentences: -http://opus.lingfil.uu.se/KDE4.php -Get the PHP Documentation in multiple languages here, which is also good technical documentation: -http://opus.lingfil.uu.se/PHP.php -Other corpora can be found here, the News-Commentary corpus was found to do well: -http://www.statmt.org/wmt13/translation-task.html#download - -These scripts, especailly the tuning and training phases, can take a long time. Take proper measures to background your processes so that they do not get killed part way. -``nohup``- makes sure that training is not interrupted when done over SSH -``nice``- makes sure the training doens't hold up the entire computer. run with ``nice -n 15`` -### Explanation of Moses scripts - -* **Tokenizing** - - * Tokenizing is splitting every meaningful linguistic object into a new word. This primarily separates off punctuation as it's own "word" and escaping special characters - * Running this with the ``-protected`` flag will mark certain tokens to not be split off. It takes a file with a list of regex's and anything that matches won't be tokenized. - * After translation use the detokenizer to replace escaped characters with their original form. It does not get rid of the extra spacing added, so use ``-protected`` where this becomes an issue. - -* **Truecasing** - - * Trucasing is the process of turning all words to a standard case. For most words this means making them lower case, but for others, like MongoDB, it keeps them capitalized but in a standard form. After translation you must go back through (recasing) and make sure the capitalization is correct for the language used. The truecaser first needs to be trained to create the truecase-model before it can be used. The trainer counts the number of times each word is in each form and chooses the most common one as the standard form. - -* **Cleaning** - - * Cleaning removes long and empty sentances which can cause problems and mis-alignment. Numbers at the end of the commandare minimum line size and maximum line size: ``clean-corpus-n.perl CORPUS L1 L2 OUT MIN MAX`` - -* **Language Model** - - * The Language model ensures fluent output, so it is built with the target language in mind. Perplexity is a measure of how probable the language model is. IRSTLM computes the perplexity of the test set. The language model counts n-gram frequencies and also estimates smoothing parameters. - - * ``add-start-end.sh``: adds sentence boundary symbols to make it easier to parse. This creates the ``.sb`` file. - * ``build-lm.sh``: generates the language model. ``-i`` is the input ``.sb`` file, ``-o`` is the output LM file, ``-t`` is a directory for temp files, ``-p`` is to prune singleton n-grams, ``-s`` is the smoothing method, ``-n`` is the order of the language model (typically set to 3). The output theoretically is an iARPA file with a ``.ilm.gz`` extension, though moses says to use ``.lm.es``. This step may be run in parallel with ``build-lm-qsub.sh`` - * ``compile-lm``: turns the iARPA into an ARPA file. It appears you need the ``--text`` flag alone (as opposed to ``--text yes``) to make it work properly. - * ``build_binary``: binarizes the ARPA file so it's faster to use - * More info on IRSTLM here: http://hermes.fbk.eu/people/bertoldi/teaching/lab_2010-2011/img/irstlm-manual.pdf - * Make sure to export the irstlm environment variable either in your ``.bash_profile`` or in the code itself ``export IRSTLM=/home/judah/irstlm-5.80.03`` - -* **Training** - - * Training teaches the model how to make good translations. This uses the MGIZA++ word alignment tool which can be run multi-threaded. A factored translation model taking into account parts of speech could improve training though it makes the process more complicated and makes it take longer. - - * ``-f`` is the "foreign language" which is the source language - * ``-e`` is the "english language" which is the target language. This comes from the convention of translating INTO english, not out of english as we are doing. - * ``--parts n`` allows training on larger corpora, 3 is typical - * ``--lm factor:order:filename:type`` +1. Translate your docs - * ``factor`` = factor that the model is modeling. There are separate models for word, lemma, pos, morph - * ``order`` = n-gram size - * ``type`` = the type of language model used. 1 is for IRSTLM, 8 is for KenLM. + 1. Use Giza, which can be found in Pypi, to translate your po files. + 2. First copy the files so you have a parallel directory tree, one for every distinct translator (consider machine translation to be one translator, unless you have multiple systems). - * ``--score-options`` used to score phrase translations with different metrics. ``--GoodTuring`` is good, the other options could make it run faster but make performance suffer. See http://www.statmt.org/moses/?n=FactoredTraining.ScorePhrases for more info. - * For informationa about the reordering model, see here: http://www.statmt.org/moses/?n=FactoredTraining.BuildReorderingModel +2. Put your docs in MongoDB -* **Tuning** + 1. Use ``po-to-mongo`` to move the data into MongoDB. + 2. Run this once for every "type" of translation you have. (i.e. Moses, Person1, Person2....), this will make the status and the username correct in each case. + 3. You may need to put some users into your database first. Open up a shell and run the following for any users: ``db.users.insert({"username": "Moses", "num_reviewed": 0, "num_user_approved": 0, "num_got_approved":0, "trust_level": "basic"})`` + 4. Alternatively use the admin page to do the same thing. Upload the docs you want. You will need to put them in a ``.tar.gz`` file before uplaoding them. You can't just upload a directory of docs. - * Tuning changes the weights of the different scores in the moses.ini file. Tuning takes a long time and is best to do with small tuning corpora as a result. It is best to tune on sentences VERY similar to those you are actually trying to translate. +3. Run the verifier -* **Binarize the model** + 1. Run the verifier web app and have people contribute to it. + 2. Make sure to add new users to the database before they begin translating. - * This makes the decoder load the model faster and thus the decoder starts faster. It does not speed up the actual decoding process +4. Take the approved data (or all) from the verifier - * ``-ttable`` refers to the size of the phrase table. For a standard configuration just use 0 0. - * ``-nscores`` is number of scores used in translation table, to find this, open ``phrase-table.gz`` (first use gunzip to unzip it), and then count how many scores there are at the end. - * ``sed`` searches and replaces - * NOTE: The extensions are purposefully left off of the replacements done by sed. This is the way moses intends for it to be used. + 1. Copy doc directory tree to back it up. + 2. Use ``mongo_to_po`` to copy approved translations into the new doc directory tree. + 3. This will inject the approved translations into all of the untranslated sentences. + 4. Alternatively use the admin page to do the same thing. It will download a new copy of the translations rather than overwriting an old copy as the pharaoh command does. -* **Testing the model** +Work to be Done +--------------- - * Running just uses the ``moses`` script and takes in the ``moses.ini`` file. If the model was filtered, binarised, or tuned, the "most recent" ``moses.ini`` file should be used. - * ``detruecase.perl``: recapitalizes the beginnings of words appropriately - * ``detokenizer.perl``: fixes up the tokenization by replacing escaped characters with the original character - * Use ``mail -s "{subject}" {email} <<< "{message}"`` to find out when long running processes are done running +1. Authentication- This is key to it every being production ready. As part of adding authentication make adding users a more seemless process. Currently they have to be manually added into the database to be able to be used. Making it so that users can be created would be good. Also more improtantly adding better handling for users not in the system is a must. This could use JIRA or ideally would be more general so you can plug in different authentication systems. +2. Upload Docs Fixes- If the documentation ever gets edited (as it always will) currently the system can't handle it well. Having the upload and download scripts handle these better would be great. Uploading shouldn't overwrite sentences that haven't changed and it should remove sentences that don't exist anymore and add in ones that do now in the proper order (requires fixing sentence numbers for everything). +3. Translations from Scratch- Currently you need a set of docs on top of which you can do translations. It would be good to make it so that you can just start from a blank slate for a new language or for a language already in there. If the language already exists we shouldn't get multiple blank slate sets popping up, rather just one set of blank slate docs and one set of machine translation verifications. +4. Docs Pages Priorities- The infrustructure exists for prioritizing pages for translations, however there is no method for actually putting in these priorities well. Having a method in the upload script for adding in priorities could fix this. Google Analytics page views could be a good metric. +5. Edit Approved Translations- If someone makes a mistake but it accidentally gets approved there should be a way for trusted or admin users to unapprove them and allow others or themselves to edit them. +6. Gamification- Make it more like a game with awards, badges, and points for translating more things. Getting a higher reputation score should get you some improvements similar to how Stack Overflow works. diff --git a/pharaoh/pharaoh/app/static/javascript/scripts.js b/pharaoh/pharaoh/app/static/javascript/scripts.js index 584b1f515..9d1b26ff0 100644 --- a/pharaoh/pharaoh/app/static/javascript/scripts.js +++ b/pharaoh/pharaoh/app/static/javascript/scripts.js @@ -1,3 +1,19 @@ +/** + * Copyright 2014 MongoDB, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. +**/ + $(document).ready(function(){ $(".edit").on('click',edit); $(".approve").on('click',approve); diff --git a/pharaoh/pharaoh/app/views.py b/pharaoh/pharaoh/app/views.py index dc233d68c..65e1faa49 100644 --- a/pharaoh/pharaoh/app/views.py +++ b/pharaoh/pharaoh/app/views.py @@ -131,6 +131,9 @@ def unapprove_translation(): @app.route('/edit////423') def lock_error(username, language, file): + ''' This function is called when a user tries to do something in a file + but doesn't have a lock and thu can't + ''' return render_template("423.html", username=username, language=language, @@ -138,6 +141,8 @@ def lock_error(username, language, file): @app.route('/download-all//') def download_single_po(language, file): + ''' This function downloads all translations from a single po file + ''' po = generate_fresh_po_text(file, 'en', language, db, True) response = make_response(po) response.headers["Content-Disposition"] = "attachment; filename={0}.po".format(file) @@ -145,6 +150,9 @@ def download_single_po(language, file): @app.route('/download-approved//') def download_single_po_approved(language, file): + ''' This function downloads all approved translations from a single + po file + ''' po = generate_fresh_po_text(file, 'en', language, db, False) response = make_response(po) response.headers["Content-Disposition"] = "attachment; filename={0}.po".format(file) @@ -153,6 +161,9 @@ def download_single_po_approved(language, file): @app.route('/download-all//') @app.route('/download-all/') def download_all_po(language): + ''' This function downloads all translations from all + po files + ''' po = generate_all_po_files( 'en', language, db, True) response = make_response(po) response.headers["Content-Disposition"] = "attachment; filename={0}.tar.gz".format(language) @@ -161,6 +172,9 @@ def download_all_po(language): @app.route('/download-approved//') @app.route('/download-approved/') def download_all_po_approved(language): + ''' This function downloads all approved translations from all + po files + ''' po = generate_all_po_files( 'en', language, db, False) response = make_response(po) response.headers["Content-Disposition"] = "attachment; filename={0}.tar.gz".format(language) @@ -168,11 +182,13 @@ def download_all_po_approved(language): @app.route('/admin', methods=['GET']) def admin(): + ''' This function produces an admin page''' files = models.get_file_paths() return render_template("admin.html", file_list=files) @app.route('/upload', methods=['POST']) def upload(): + ''' This function uploads the given tar ball to mongodb''' app.logger.info(request.files['file']) app.logger.info(request.form) put_po_data_in_mongo(request.files['file'], diff --git a/pharaoh/pharaoh/mongo_to_po.py b/pharaoh/pharaoh/mongo_to_po.py index 7af1a800f..e9821e342 100644 --- a/pharaoh/pharaoh/mongo_to_po.py +++ b/pharaoh/pharaoh/mongo_to_po.py @@ -38,6 +38,8 @@ def write_po_file(po_fn, source_language, target_language, db, is_all): ''' writes approved or all trnalstions to file :param string po_fn: the path to the current po file to write + :param string source_language: language to translate from + :param string target_language: language to translate to :param database db: mongodb database :param boolean is_all: whether or not you want all or just approved translations ''' @@ -65,6 +67,8 @@ def write_po_file(po_fn, source_language, target_language, db, is_all): def write_mongo_to_po_files(path, source_language, target_language, db_host, db_port, db_name, is_all): ''' goes through directory tree and writes po files to mongo :param string path: the path to the top level directory of the po_files + :param string source_language: language to translate from + :param string target_language: language to translate to :param string db_host: the hostname of the database :param int db_port: the port of the database :param string db_name: the name of the database @@ -84,6 +88,13 @@ def write_mongo_to_po_files(path, source_language, target_language, db_host, db_ write_po_file(fn, source_language, target_language, db, is_all) def generate_fresh_po_text(po_fn, source_language, target_language, db, is_all): + ''' goes through all of the sentences in a po file in the database and writes them out to a fresh po file + :param string po fn: the path to a given po file as it would be found in the database + :param string source_language: language to translate from + :param string target_language: language to translate to + :param database db: the instance of the database + :param boolean is_all: whether or not you want all or just approved translations + ''' po = polib.POFile() po.metadata = { u'Project-Id-Version': 'uMongoDB Manual', @@ -116,13 +127,21 @@ def generate_fresh_po_text(po_fn, source_language, target_language, db, is_all): entry = polib.POEntry( msgid=unicode(sentence['source_sentence'].strip()), msgstr=unicode(translation), - #comment=unicode(sentence['source_location'].strip()), + comment=unicode(sentence['source_location'].strip()), tcomment=unicode(sentence['sentenceID'].strip()) ) po.append(entry) return getattr(po, '__unicode__')() def generate_all_po_files(source_language, target_language, db, is_all): + ''' goes through all of the files in the database for a pair of langauges and + writes them all out to fresh po files. It then tars them up before returning + the value of the tar + :param string source_language: language to translate from + :param string target_language: language to translate to + :param database db: the instance of the database + :param boolean is_all: whether or not you want all or just approved translations + ''' file_list = db['files'].find({'source_language': source_language, 'target_language': target_language}, {'_id': 1, 'file_path': 1}) diff --git a/pharaoh/pharaoh/po_to_mongo.py b/pharaoh/pharaoh/po_to_mongo.py index 6e74731b8..b52e9e0f7 100644 --- a/pharaoh/pharaoh/po_to_mongo.py +++ b/pharaoh/pharaoh/po_to_mongo.py @@ -104,15 +104,15 @@ def put_po_files_in_mongo(path, username, status, source_language, target_langua def put_po_data_in_mongo(po_tar, username, status, source_language, target_language, db): - '''go through directories and write the po file to mongo - :param string po_fn: the filename for the po file - :param string po_data: the po_file data + '''go through a tar of directories and write the po file to mongo + :param string po_tar: the tar of a set of po files :param string username: the username of the translator :param string status: the status of the translations :param string source_language: The source_language of the translations :param string target_language: The target_language of the translations :param database db: the mongodb database ''' + tar = tarfile.open(fileobj=po_tar) for member in tar.getmembers(): if os.path.splitext(member.name)[1] not in ['.po', '.pot']: diff --git a/pharaoh/pharaoh/serialization.py b/pharaoh/pharaoh/serialization.py index 20a5d7dd7..d0ac1e26a 100644 --- a/pharaoh/pharaoh/serialization.py +++ b/pharaoh/pharaoh/serialization.py @@ -14,6 +14,10 @@ import json import yaml +import logging + +logger = logging.getLogger('pharaoh.serialization') + def ingest_yaml_list(*filenames): o = [] diff --git a/pharaoh/pharaoh/utils.py b/pharaoh/pharaoh/utils.py index 4b3f6d1ce..69d03e9b1 100644 --- a/pharaoh/pharaoh/utils.py +++ b/pharaoh/pharaoh/utils.py @@ -21,6 +21,10 @@ logger = logging.getLogger('pharaoh.utils') def load_json(file_name, db): + ''' This function loads json into a dictionary + :param string file_name: The name of the json file + :param database db: The instance of the mongodb database + ''' file_no_ext = os.path.basename(file_name) file_no_ext = os.path.splitext(file_no_ext)[0] with open(file_name,"r") as file: @@ -31,6 +35,10 @@ def load_json(file_name, db): def get_file_list(path, input_extension=['po', 'pot']): + '''Returns of a list of files with certain extensions in a given directory tree + :param string path: The path to the top of directory tree + :param list input_extension: list of file extensions (without a dot) that are returned + ''' file_list = [] if os.path.isfile(path): return [path]