Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add package 2024_Barquera_ChichenItza #211

Merged
merged 2 commits into from
Feb 12, 2025

Conversation

RodrigoBarquera
Copy link

@RodrigoBarquera RodrigoBarquera commented Sep 5, 2024

PR Checklist for a new package submission

  • The package does not exist already in the community archive, also not with a different name.
  • The package title in the POSEIDON.yml conforms to the general title structure suggested here: <Year>_<Last name of first author>_<Region, time period or special feature of the paper>, e.g. 2021_Zegarac_SoutheasternEurope, 2021_SeguinOrlando_BellBeaker or 2021_Kivisild_MedievalEstonia.
  • The package is stored in a directory that is named like the package title.

  • The package is complete and features the following elements:
    • Genotype data in binary PLINK format (not EIGENSTRAT format).
    • A POSEIDON.yml file with not just the file-referencing fields, but also the following meta-information fields present and filled: poseidonVersion, title, description, contributor, packageVersion, lastModified (see here for their definition)
    • A reasonably filled .janno file (for a list of available fields look here and here for more detailed documentation about them).
    • A .bib file with the necessary literature references for each sample in the .janno file.
  • Every file in the submission is correctly referenced in the POSEIDON.yml file and there are no additional, supplementary files in the submission that are not documented there.
  • Genotype data, .janno and .bib file are all named after the package title and only differ in the file extension.
  • The package version in the POSEIDON.yml file is 1.0.0.
  • The poseidonVersion of the package in the POSEIDON.yml file is set to the latest version of the Poseidon schema.
  • The POSEIDON.yml file contains the corresponding checksums for the fields genoFile, snpFile, indFile, jannoFile and bibFile.
  • There is either no CHANGELOG file or one with a single entry for version 1.0.0.

  • The Publication column in the .janno file is filled and the respective .bib file has complete entries for the listed mentioned keys.
  • The .janno file does not include any empty columns or columns only filled with n/a.
  • The order of columns in the .janno file adheres to the standard order as defined in the Poseidon schema here.
  • The .janno and the .ssf files are not fully quoted, so they only use single- or double quotes ("...", '...') to enclose text fields where it is strictly necessary (i.e. their entry includes a TAB).

  • The package passes a validation with trident validate --fullGeno.

  • Large genotype data files are properly tracked with Git LFS and not directly pushed to the repository. For an instruction on how to set up Git LFS please look here. If you accidentally pushed the files the wrong way you can fix it with git lfs migrate import --no-rewrite path/to/file.bed (see here).

@nevrome nevrome changed the title added new package named 2024_Barquera_ChichenItza Add package 2024_Barquera_ChichenItza Sep 6, 2024
@stschiff stschiff self-assigned this Sep 9, 2024
@stschiff
Copy link
Member

stschiff commented Oct 9, 2024

Hi @RodrigoBarquera, this is great. Super that you even entered the relationship columns, which I know is a lot of work.

Sorry for taking so long to give feedback, but I have some points:

  • We actually would like the Collection_ID column to reflect the ID from the actual collection. I see that you've used the column Alternative_IDs for that. I suggest that you simply rename the Alternative_IDs column to Collection_ID and remove the empty Collection_ID column.
  • You have only given date information for the few samples that you've C14-dated. But I'm sure you can also give dates for all samples that have no C14-date, right? We have contextual in the Date_Type for that, and it would be good to fill. We generally aspire to have at least contextual dates for every single sample, to facilitate meta-analyses through space and time. Note that with contextual dates, you should only fill columns Date_BC_AD_Start, Date_BC_AD_Median and Date_BC_AD_End, where the median can just be the mid-point of the interval.
  • I see that you've left columns Endogenous, Nr_SNPs, Coverage_on_Target_SNPs, Damage, Contamination, Contamination_Err, Contamination_Meas and Contamination_Note empty. I'm sure these information are available in your paper, right? Do you need help with these? We have three student assistants now who can help with this. Let us know! I would be willing to leave these empty for now, but if it's just about needing help, let us help.
  • The Genetic_Source_Accession_IDs should be filled. They can all have the exact same Project Accession ID entry from the ENA.

Again, let us know if you need help with this and we can ask someone from our team.

@stschiff
Copy link
Member

stschiff commented Dec 3, 2024

@RodrigoBarquera did you have a chance to look into this, or do you need help from one of us?

@nevrome
Copy link
Member

nevrome commented Jan 14, 2025

@RodrigoBarquera Another reminder. Please let us know if you would like to hand this over to an other assignee.

@nevrome nevrome assigned RodrigoBarquera and unassigned stschiff Jan 14, 2025
@nevrome nevrome requested a review from stschiff January 14, 2025 16:14
@stschiff stschiff removed their request for review January 31, 2025 16:07
@stschiff stschiff marked this pull request as draft January 31, 2025 16:08
@RodrigoBarquera
Copy link
Author

Hi Stephan! :)

With the help of Thiseas, I finally managed to complete the missing spots. Two things were left unchanged: the dating for all samples, as we found in the same context individuals spanning 500 years, we cannot accurately date any further by context. And the contamination estimates we don't have them consistently for all libraries for all individuals. If it is absolutely necessary, I can compile them and make a weighted average for the contamination estimates, but since eager 1 and 2 were used at the time, I would need to rerun everything on eager 2 to be consistent. But this would be from the TF data, since the SG is in all cases too low for it to get an accurate estimate.

@stschiff
Copy link
Member

Thanks, @RodrigoBarquera. OK, maybe then forget about contamination, that's OK. But for the dates, please write here what the span is. Even something like 1000 - 2000 AD would be better than nothing, given that our repository has data from the last 40,000 years or so. I am happy to input them for you.

@stschiff
Copy link
Member

Nevermind, just looked at your paper and saw 500-900 AD. That's perfect. I will enter this.

@stschiff
Copy link
Member

OK, so if nobody objects, I would like to take over this PR now, by merging into a local branch and making the final touches.

@nevrome nevrome assigned stschiff and unassigned RodrigoBarquera Feb 12, 2025
@stschiff stschiff changed the base branch from master to 2024_Barquera February 12, 2025 15:29
@stschiff stschiff marked this pull request as ready for review February 12, 2025 15:29
@stschiff stschiff merged commit 7663b11 into poseidon-framework:2024_Barquera Feb 12, 2025
1 check passed
@stschiff
Copy link
Member

This PR is being further worked on in #250

@stschiff stschiff mentioned this pull request Feb 12, 2025
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants