Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 2023_VillaIslas_preHispanicMexico #205

Merged
merged 20 commits into from
Mar 18, 2025

Conversation

jbv2
Copy link
Contributor

@jbv2 jbv2 commented Aug 15, 2024

PR Checklist for a new package submission

  • The package does not exist already in the community archive, also not with a different name.
  • The package title in the POSEIDON.yml conforms to the general title structure suggested here: <Year>_<Last name of first author>_<Region, time period or special feature of the paper>, e.g. 2021_Zegarac_SoutheasternEurope, 2021_SeguinOrlando_BellBeaker or 2021_Kivisild_MedievalEstonia.
  • The package is stored in a directory that is named like the package title.

  • The package is complete and features the following elements:
    • Genotype data in binary PLINK format (not EIGENSTRAT format).
    • A POSEIDON.yml file with not just the file-referencing fields, but also the following meta-information fields present and filled: poseidonVersion, title, description, contributor, packageVersion, lastModified (see here for their definition)
    • A reasonably filled .janno file (for a list of available fields look here and here for more detailed documentation about them).
    • A .bib file with the necessary literature references for each sample in the .janno file.
  • Every file in the submission is correctly referenced in the POSEIDON.yml file and there are no additional, supplementary files in the submission that are not documented there.
  • Genotype data, .janno and .bib file are all named after the package title and only differ in the file extension.
  • The package version in the POSEIDON.yml file is 1.0.0.
  • The poseidonVersion of the package in the POSEIDON.yml file is set to the latest version of the Poseidon schema.
  • The POSEIDON.yml file contains the corresponding checksums for the fields genoFile, snpFile, indFile, jannoFile and bibFile.
  • There is either no CHANGELOG file or one with a single entry for version 1.0.0.

  • The Publication column in the .janno file is filled and the respective .bib file has complete entries for the listed mentioned keys.
  • The .janno file does not include any empty columns or columns only filled with n/a.
  • The order of columns in the .janno file adheres to the standard order as defined in the Poseidon schema here.
  • The .janno and the .ssf files are not fully quoted, so they only use single- or double quotes ("...", '...') to enclose text fields where it is strictly necessary (i.e. their entry includes a TAB).

  • The package passes a validation with trident validate --fullGeno.

  • Large genotype data files are properly tracked with Git LFS and not directly pushed to the repository. For an instruction on how to set up Git LFS please look here. If you accidentally pushed the files the wrong way you can fix it with git lfs migrate import --no-rewrite path/to/file.bed (see here).

@jbv2 jbv2 changed the title 2023 villa islas science 2023_Villa-IslasScience Aug 15, 2024
@jbv2 jbv2 changed the title 2023_Villa-IslasScience 2023_Villa-IslasScience.janno Aug 15, 2024
@nevrome nevrome added the only .janno This PR does not feature a full package, but only a .janno file label Aug 18, 2024
@stschiff
Copy link
Member

Super @jbv2, we'll take a look! Thanks!

@TCLamnidis TCLamnidis self-assigned this Sep 6, 2024
@nevrome nevrome changed the title 2023_Villa-IslasScience.janno Create 2023_Villa-IslasScience.janno Sep 6, 2024
@stschiff
Copy link
Member

stschiff commented Dec 3, 2024

@jbv2 we were wondering whether you could perhaps reach out to the first author and ask for the genotype data to be added here? That would be ideal. We also happen to have the package no in the Minotaur archive, but author-provided genotype data would be welcome for this repository, too!

@jbv2
Copy link
Contributor Author

jbv2 commented Dec 3, 2024

Yes, I will write back here when I have an answer.

@stschiff
Copy link
Member

Do we have an update on this, @jbv2, any luck? And @nevrome should we re-open this issue? I think it was closed by accident.

@stschiff stschiff assigned jbv2 and unassigned TCLamnidis Jan 31, 2025
@stschiff stschiff removed the only .janno This PR does not feature a full package, but only a .janno file label Jan 31, 2025
@nevrome nevrome reopened this Feb 1, 2025
@stschiff stschiff marked this pull request as draft February 3, 2025 14:11
@jbv2
Copy link
Contributor Author

jbv2 commented Feb 4, 2025

Yes @stschiff
On December 17th, I was asked which panel we wanted. I answered, but I don't have the data yet. I sent her a reminder on Friday in case she missed my previous message. I will update here once I have the data.

@stschiff
Copy link
Member

Cool, thanks for that, @jbv2, let's hope we get the data soon.

@jbv2
Copy link
Contributor Author

jbv2 commented Feb 12, 2025

@stschiff I just got the hg19 plink files. Can someone guide me on what to do next?
Thanks!

@stschiff stschiff deleted the branch poseidon-framework:dev February 13, 2025 07:01
@stschiff stschiff closed this Feb 13, 2025
@stschiff stschiff reopened this Feb 13, 2025
@jbv2 jbv2 marked this pull request as ready for review February 13, 2025 12:38
@nevrome nevrome changed the title Create 2023_Villa-IslasScience.janno Add 2023_Villa-IslasScience Feb 13, 2025
@nevrome
Copy link
Member

nevrome commented Feb 13, 2025

Great that this came together, finally! I added the usual checklist to your comment above, @jbv2. Maybe you could quickly go through that check the boxes. When this is done I'll try to find a reviewer.

I already have a quick review comment about the package name: I think we should avoid - characters, because they cause problems in some file systems. And we should replace the journal name Science with some content-related qualifier. I suggest 2023_VillaIslas_preHispanicMexico, but other suggestions are fine as well 👍

@jbv2
Copy link
Contributor Author

jbv2 commented Feb 17, 2025

@nevrome thanks. I renamed the package and made some minor changes to check all the boxes.

@nevrome nevrome changed the title Add 2023_Villa-IslasScience Add 2023_VillaIslas_preHispanicMexico Feb 18, 2025
@nevrome
Copy link
Member

nevrome commented Feb 18, 2025

@martynamolak Your submission (#253) just got reviewed. Would you, in turn, be available to have a brief look at this package? Specifically the .janno, .bib, and .yml file? Feel free to decline if you're not, at the moment.

@martynamolak
Copy link
Contributor

@nevrome , sorry the message slipped somehow and only read it now.
Sure, I will review it. I hope I will know how to though :)

@martynamolak
Copy link
Contributor

Hi,

This is my first time reviewing a Poseidon package so I hope I will make any sense here.
Bib file looks good so I only have some comments to the janno file:

  • Group_Name
    I'm not sure what Poseidon's policy is here, but these samples are already included in the AADR v62 and the sample IDs and group names are quite different. This is not necessarily a problem. Regardless, I would suggest applying Group_Names that would also make sense to people analysing the data without having to go deep into the original paper. In this sense the AADR Pop labels make more sense (eg. "Mexico_Queretaro_Medieval.SG" as opposed to "Toluquilla;CentralMexico;SierraGorda"). Since Group_Name is ultimately the identifier many people would use to assemble datasets I would argue that use of roughLocation+Period/culture is information we should aim for having in the Group_Name in Poseidon (@nevrome , what's your take on this? or do you think that such information being present already in the package name is sufficient?)

  • The sex assignment in the janno file is not entirely concordant with the sex assignment from the paper supplement. Or actully it rather reports the morphological sex assignment and not the genetic one (eg. 2417J_TOL_a was morphologically M but genetically XX so I think it should go as F rather than as M as currently is in the janno; there are also some samples that you have marked U but have sex assignment in the SI - eg. 333B) . Please verify and correct these.

  • It's not clear to me which columns in janno file are required (although your package passes trident validate so I assume it has all the ones that are strictly obligatory), but since you'd be fixing janno anyway I suggest adding the Country_ISO column (with "MX" for each sample) since it wouldn't be much work for you and might make it easier for someone to process the data.

  • Guanajuato Canada de la Virgen Longitude coordinates are missing "-" putting them in Laos

  • and actually when I was verifying these coordinates the La Ventana Cave is actually located in Arizona so the Country field should probably be changed from "Mexico" to "United States of America"

  • @nevrome Are we aiming for avoiding non-Latin characters in all columns? There is Querétaro and Michoacán in Location column.

  • Date_BC_AD_Stop Date_Note values seem to have an excel automatic fill in issue with numbers of S ("Dates are taken from Date (CE) in SX", which I assume stands for Supplementary Table 1 increment for consecutive entries)

  • are the dates really "contextual"? Some of them are super-precise (eg. 647-825 CE). But I can't find information on the dating in the original paper at all; I'm sure you know the paper a bit better @jbv2 so just wanted to double-confirm this is really the case that they are contextual; if there is indeed no info on how the dates were established, I guess "contextual" is the way to go

  • Nr_SNPs seems to have "Positions covered in hg19 >=1x" values rather than the snps covered (eg. for sample F9_ST_a it is 1 970 412 787 and the snp file has 1 149 511 snps). I guess you could run genoStats or something to get the actual number of SNPs covered

I hope I haven't missed too many things. Also, hope I didn't pick on too many things that didn't really need to be addressed.

@nevrome
Copy link
Member

nevrome commented Feb 21, 2025

Thank you for this thorough review, @martynamolak! Some quick comments:

  1. For the group names in the community archive I think the idea was to stay close to the paper. I get your point, though - these names are less informative, globally. Maybe we should raise this in the next Poseidon meeting. One way to resolve this quickly would be to add the AADR group names as an additional name in the list, so Toluquilla;CentralMexico;SierraGorda;Mexico_Queretaro_Medieval. These labels can then be used in trident's forge language or in xerxes fstats command. Would that be a feasible compromise, @jbv2?
  2. I agree. The column is for the genetic sex.
  3. Very few columns are mandatory. Only Poseidon_ID, Group_Names, and Genetic_Sex. But of course a package with only these is merely an empty dummy. Adding the Country_ISO would be helpful.
  4. Wow - good spotting!
  5. ASCII characters are only mandatory in the Poseidon_ID and the first name in Group_Names. Anywhere else we support UTF-8.
  6. That's a common issue.
  7. contextual actually also applies if the age ranges IS, in fact, derived from a radiocarbon date, but the relevant uncalibrated age and error are not known.
  8. Right - that's a good observation. Eventually we will add a feature to rectify for this: trident rectify could have an option to check and fill the Nr_SNPs Janno-column  poseidon-hs#298

@jbv2 I hope these ToDos are feasible. Feel free to address them when you have time.

@nevrome
Copy link
Member

nevrome commented Mar 13, 2025

@jbv2 Quick reminder that this is still open. Are the ToDos clear or do you have some open questions? Happy to help 👍

@jbv2
Copy link
Contributor Author

jbv2 commented Mar 14, 2025

Ok, sorry for the delay @nevrome, and thanks for the comments @martynamolak !

  1. I added the labels from AADR so it is more informative.
  2. Updated genetic sex, thanks!
  3. Added the Country_Iso
  4. Corrected the coordinates. Thanks again!
  5. Since in Site these characters are supported, I added the "ñ" in Cañada.
  6. Fixed the Supplementary S1 that was auto-filled.
  7. Updated the number of SNPs with the current plink files that were genotyped to the 1240k provided by the author.

Please let me know if you need anything else.

@nevrome
Copy link
Member

nevrome commented Mar 18, 2025

Perfect! Thank you for the clean update. I set the version number to 1.0.0 and will merge now.

@nevrome nevrome merged commit 854f795 into poseidon-framework:dev Mar 18, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants