Skip to content

Commit ccdeaf0

Browse files
committed
Merge branch 'simplified_headings'
2 parents 0b166a5 + bcc3435 commit ccdeaf0

26 files changed

+87
-95
lines changed

.github/workflows/addPDF.yml

+5-5
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,12 @@ jobs:
2828
- name: Modify and prepare main documentation files
2929
run: |
3030
# remove everything up to "## Guide for" in software guides
31-
sed -i '1,/^\s*## Guide for/ {/## Guide for/ !d}' janno_r_package.md
32-
sed -i '1,/^\s*## Guide for/ {/## Guide for/ !d}' qjanno.md
33-
sed -i '1,/^\s*## Guide for/ {/## Guide for/ !d}' trident.md
34-
sed -i '1,/^\s*## Guide for/ {/## Guide for/ !d}' xerxes.md
31+
sed -i '1,/^\s*# Guide for/ {/# Guide for/ !d}' janno_r_package.md
32+
sed -i '1,/^\s*# Guide for/ {/# Guide for/ !d}' qjanno.md
33+
sed -i '1,/^\s*# Guide for/ {/# Guide for/ !d}' trident.md
34+
sed -i '1,/^\s*# Guide for/ {/# Guide for/ !d}' xerxes.md
3535
# add extra # at the beginning of janno_details.md
36-
echo -e "#$(cat janno_details.md)" > janno_details.md
36+
# echo -e "#$(cat janno_details.md)" > janno_details.md
3737
3838
- name: Convert files
3939
run: |

janno_details.md

+13-15
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# .janno file details
22

3-
### Background
3+
## Background
44

55
The `.janno` file columns are specified in the Poseidon package specification [here](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv). The following documentation includes additional background information for many of the variables. This should make it more easy to compile the necessary information for both published and unpublished data. The `.pdf` version of the latest version of this document is available [here](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/janno_details.pdf).
66

7-
### Identifiers
7+
## Identifiers
88

99
The `Poseidon_ID` column represents each sample with an ideally world-wide unique identifier string often equal to the identifier used in the respective accompanying publication. There is no central authority to issue these identifiers, so it remains in the hand of the authors to avoid duplication. We are aware of this inconsistency and hope the aDNA community will eventually come together to establish a mechanism to ensure uniqueness of identifiers. If there are multiple samples from one individual, then they have to be clearly distinguished with relevant suffixes added to the `Poseidon_ID`. `Poseidon_ID`s are also employed in the genetic data files in a Poseidon package and therefore have to adhere to certain constraints.
1010

@@ -14,7 +14,7 @@ The `Collection_ID` column stores an additional, secondary identifier as it is o
1414

1515
The `Group_Name` column contains one or multiple group or population names for each individual, separated by `;`. The first entry must be identical to the one used in the genotype data for the respective sample in a Poseidon package, and whitespace is not allowed in any of the entries. Assigning group and population names is a hard problem in archeogenetics [@Eisenmann2018](https://doi.org/10.1038/s41598-018-31123-z), so the `.janno` file allows for more than one identifier.
1616

17-
### Relations among samples/individuals
17+
## Relations among samples/individuals
1818

1919
To systematically document biological relationships uncovered among samples/individuals in one or multiple Poseidon datasets (e.g. with software like READ [@MonroyKuhn2018](https://doi.org/10.1371/journal.pone.0195491) or BREADR [@Rohrlach2023](https://doi.org/10.1101/2023.04.17.537144)), the `.janno` file can be fit with a set of columns featuring the `Relation_*` prefix. Across these columns it should be possible to encode all kinds of pairwise, biological relationships an individual might have.
2020

@@ -46,7 +46,7 @@ Unlike `Relation_Degree`, `Relation_Type` can be left empty even if there are en
4646

4747
The `Relation_Note` column allows to add free-form text information about the relationships of this individual. This might also include information about the method used to infer the degree and type.
4848

49-
### Spatial position
49+
## Spatial position
5050

5151
The `.janno` file contains six columns to describe the spatial origin of an individual sample: `Country`, `Country_ISO`, `Location`, `Site` and finally `Latitude` and `Longitude`.
5252

@@ -60,11 +60,11 @@ The `Site` column should contain a site name, ideally in the latin alphabet and
6060

6161
The `Latitude` and `Longitude` columns should contain geographic coordinates (WGS84) in decimal degrees (DD) with a precision of not more than five places after the decimal point. This yields a precision of about [1.1132m at the equator](https://en.wikipedia.org/wiki/Decimal_degrees) which is sufficient to describe the position of an archaeological site. Coordinates in other formats like for example Degrees Minutes Seconds (DMS) or in completely different coordinate reference systems should be transformed. There exist many open source software solutions to do that, most based on the [PROJ library](https://proj.org) e.g. the [The World Coordinate Converter](https://twcc.fr/en).
6262

63-
### Temporal position
63+
## Temporal position
6464

6565
The temporal position of a sample is encoded with seven different columns in the `.janno` file: `Date_C14_Labnr`, `Date_C14_Uncal_BP`, `Date_C14_Uncal_BP_Err`, `Date_BC_AD_Median`, `Date_BC_AD_Start`, `Date_BC_AD_Stop`, `Date_Type`.
6666

67-
#### General structure
67+
### General structure
6868

6969
The `Date_Type` column handles the general distinction between the most common forms of age information:
7070

@@ -74,7 +74,7 @@ The `Date_Type` column handles the general distinction between the most common f
7474

7575
So `Date_C14_Labnr`, `Date_C14_Uncal_BP` and `Date_C14_Uncal_BP_Err` only go along with `Date_Type = C14`, whereas `Date_BC_AD_Median`, `Date_BC_AD_Start`, `Date_BC_AD_Stop` complement both `Date_Type = C14` and `Date_Type = contextual`. Radiocarbon dates that only serve as secondary evidence for a contextual dating should NOT be reported in `Date_C14_Labnr`, `Date_C14_Uncal_BP` and `Date_C14_Uncal_BP_Err`.
7676

77-
#### The columns in detail
77+
### The columns in detail
7878

7979
Each radiocarbon date has a unique identifier: the "lab number". It consists of a lab code issued by the journal [Radiocarbon](https://radiocarbon.org/laboratories) for each laboratory and a serial number. This lab number makes the date well identifiable and should be reported in `Date_C14_Labnr` with the lab code separated from the serial number with a minus symbol.
8080

@@ -90,9 +90,9 @@ In the columns `Date_BC_AD_Median`, `Date_BC_AD_Start`, `Date_BC_AD_Stop` ages a
9090

9191
The column `Date_Note` stores arbitrary free-form text information about the dating of a sample.
9292

93-
### Genetic summary data
93+
## Genetic summary data
9494

95-
#### Individual properties
95+
### Individual properties
9696

9797
The `Genetic_Sex` column should encode the biological sex as determined from the DNA read distribution on the X and Y chromosome. It only allows for the entries
9898

@@ -106,7 +106,7 @@ The `MT_Haplogroup` column is meant to store the human mitochondrial DNA haplogr
106106

107107
The `Y_Haplogroup` column holds the respective human Y-chromosome DNA haplogroup in a simple string. To avoid confusion from using different haplotype naming systems, the notation should follow a syntax with the main branch + the most terminal derived Y-SNP separated with a minus symbol (e.g. R1b-P312), similar to that used by [Yfull](https://www.yfull.com/sc/tree/).
108108

109-
#### Library properties
109+
### Library properties
110110

111111
The `Source_Tissue` column documents the skeletal, soft tissue or other elements from which source material for DNA library preparation was extracted. If multiple samples have been taken from different elements, these can be listed separated by `;`. Specific bone names should be reported with an underscore (e.g. bone_phalanx, tooth_molar).
112112

@@ -143,20 +143,18 @@ The `Genotype_Ploidy` column stores whether the genotype calls for this individu
143143

144144
The column `Data_Preparation_Pipeline_URL` should finally store an URL that links to a complete and human-readable description of the computational pipeline (for example a specific configuration for nf-core/eager [@FellowsYates2021](https://doi.org/10.7717/peerj.10947)) by which the sample data was processed.
145145

146-
#### Data yield
146+
### Data yield
147147

148148
The `Endogenous` column holds the percentage of mapped reads over the total amount of reads that went into the mapping pipeline. That boils down to the DNA percentage of the library that matches the (human) reference. It should be determined from Shotgun libraries (so before any hybridization capture), not on target (i.e. across the whole genome, not specific positions), and before any mapping quality filtering. In case of multiple libraries only the highest value should be reported. The % endogenous DNA can be calculated for example with the [endorS.py](https://github.com/aidaanva/endorS.py) script.
149149

150150
The `Nr_SNPs` column gives the number of SNPs reported in the genotype data files for this individual.
151151

152152
The `Coverage_on_Target_SNPs` column reports the mean fold coverage on the SNP set of the genotype dataset (e.g. 1240K) for the merged libraries of this sample. To calculate the coverage it is necessary to determine which SNPs are covered how many times by the mapped reads. Individual SNPs might be covered multiple times, whereas others may not be covered at all by the highly deteriorated ancient DNA. The coverage for each SNP is therefore a number between 0 and n. The statistic can be determined for example with the QualiMap [@Okonechnikov2015](https://doi.org/10.1093/bioinformatics/btv566) software package. In case of multiple libraries, the total coverage should be given across all libraries.
153153

154-
#### Data quality
154+
### Data quality
155155

156156
The `Damage` column contains the % damage on the first position of the 5' end for the main Shotgun library used for sequencing or capture. This is an important statistic to verify the age of ancient DNA. In case of multiple libraries you should report a value from the merged read alignment.
157157

158-
##### Contamination
159-
160158
Contamination of ancient DNA with foreign reads is a major challenge for archaeogenetics. There exist multiple competing ideas, algorithms and software tools to estimate the degree of contamination for individual samples (e.g. ANGSD [@Korneliussen2014](https://doi.org/10.1186/s12859-014-0356-4), contamLD [@Nakatsuka2020](https://doi.org/10.1186/s13059-020-02111-2) or hapCon [@Huang2022](https://doi.org/10.1093/bioinformatics/btac390)), with some methods only applicable under certain circumstances (e.g. popular X-chromosome based approaches only work on male individuals). Also the results of different methods tend to differ both in the degree of contamination they estimate and in the way the output is usually encoded. To cover the multitude of methods in this domain, and to make the results representable in the `.janno` file, we offer the `Contamination_*` column family.
161159

162160
`Contamination` is a list column to represent the different contamination values estimated for a sample with one or multiple software tools. As usual multiple values are separated by `;`.
@@ -175,7 +173,7 @@ This setup has the consequence that the columns `Contamination`, `Contamination_
175173

176174
The `Contamination_Note` column is a free text field to add additional information about the contamination estimates, e.g. which parameters where used with the respective software tools.
177175

178-
### Context information
176+
## Context information
179177

180178
The `Genetic_Source_Accession_IDs` column was introduced to link the derived genotype data in Poseidon with the raw sequencing data typically uploaded to archives like the ENA [@Burgin2022](https://doi.org/10.1093/nar/gkac1051) or SRA [@Katz2021](https://doi.org/10.1093/nar/gkab1053). There, projects and individual samples are given clear unique identifiers: Accession IDs. This janno column is supposed to store one or multiple of these Accessions IDs for each individual/sample in Poseidon. If multiple are entered, then they should be arranged by descending specificity from left to right (e.g. project id > sample id > sequencing run id).
181179

janno_details.pdf

-369 Bytes
Binary file not shown.

janno_r_package.md

+11-11
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
<popup :custom-text="`<p><a href='https://nevrome.github.io/uni.tuebingen.poseidon.intro.2h.2024'>A short introduction to the Poseidon genotype data management framework</a> by Clemens Schmid: A Poseidon tutorial also covering <a href='https://nevrome.github.io/uni.tuebingen.poseidon.intro.2h.2024/spacetime.html'>the janno R package</a></p>`"></popup>
22

3-
# janno R package <!-- {docsify-ignore-all} -->
3+
<h1>janno R package</h1>
44

55
`janno` (formerly known as poseidonR) is an R package to simplify the interaction with `.janno` files in Poseidon packages. It provides a dedicated R S3 class `janno` that inherits from `tibble` and allows to tidily read and manipulate the context information stored in them. The code is available on [GitHub](https://github.com/poseidon-framework/janno/).
66

@@ -15,13 +15,13 @@ The guide below explains the main functions in the package. It is available in .
1515

1616
- [🗎 Guide for the janno R package v1.0.0](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/janno_r_package.pdf) (shown below)
1717

18-
## Guide for the janno R package v1.0.0
18+
# Guide for the janno R package v1.0.0
1919

20-
### Installation
20+
## Installation
2121

2222
See the Poseidon website (<https://www.poseidon-adna.org/#/janno_r_package>) or the GitHub repository (<https://github.com/poseidon-framework/janno>) for up-to-date installation instructions.
2323

24-
### Read `.janno` files
24+
## Read `.janno` files
2525

2626
You can read `.janno` files with
2727

@@ -41,7 +41,7 @@ Usually the `.janno` files are first loaded as normal `.tsv` files with every co
4141

4242
`read_janno()` returns an object of class `janno`. This class is derived from the [`tibble`](https://tibble.tidyverse.org/) class, which integrates well with the tidyverse [@Wickham2019](https://doi.org/10.21105/joss.01686) and its packages, e.g. `dplyr` or `ggplot2`.
4343

44-
### Validate `.janno` files
44+
## Validate `.janno` files
4545

4646
You can validate `.janno` files with
4747

@@ -51,7 +51,7 @@ my_janno_issues <- janno::validate_janno("path/to/my/janno_file.janno")
5151

5252
`validate_janno` returns a `tibble` with issues in the respective `.janno` files. For edge cases this validation may yield slightly different results than `trident validate`.
5353

54-
### Write `janno` objects back to `.janno` files
54+
## Write `janno` objects back to `.janno` files
5555

5656
`janno` objects usually contain list columns, that can not directly be written to a flat text file like the `.janno` file. The function `write_janno` solves that. It employs a helper function `flatten_janno()`, which translates list columns to the string list format in `.janno` files (so: multiple values for one cell separated by `;`).
5757

@@ -64,7 +64,7 @@ janno::write_janno(
6464
)
6565
```
6666

67-
### Process age information in `janno` objects
67+
## Process age information in `janno` objects
6868

6969
`.janno` files contain age information in multiple different columns. See the `.janno` file specification and documentation for a list and detailed explanations of these variables. The function `janno::process_age()` works with this age information to calculate different derived columns, which are then added to the input `janno` object.
7070

@@ -83,7 +83,7 @@ janno::process_age(
8383

8484
The `choices` argument contains the list of columns that should be calculated and added by `process_age()`. `n` is the number of samples that should be drawn for `Date_BC_AD_Sample`.
8585

86-
#### Output column `Date_BC_AD_Prob`
86+
### Output column `Date_BC_AD_Prob`
8787

8888
`Date_BC_AD_Prob` is a list column with a `data.frame` for each `janno` row, so each individual/sample. This `data.frame` stores a density distribution (`sum_dens`) over a set of years BC/AD (`age`). Additionally the boolean column `two_sigma` documents if a given year is within the 2-sigma high-density regions of the distribution. `center` is also a boolean column with only one `TRUE` value for the year that corresponds to the calibrated median age of the sample.
8989

@@ -96,15 +96,15 @@ The `choices` argument contains the list of columns that should be calculated an
9696

9797
The density distributions are either the result of (sum) calibration on radiocarbon dates or -- for samples that are only contextually dated -- a uniform distribution over the archaeologically determined age range.
9898

99-
#### Output column `Date_BC_AD_Median_Derived`
99+
### Output column `Date_BC_AD_Median_Derived`
100100

101101
`Date_BC_AD_Median_Derived` is a simple integer column with the median age (in years BC/AD) as determined from `Date_BC_AD_Prob`.
102102

103-
#### Output column `Date_BC_AD_Sample`
103+
### Output column `Date_BC_AD_Sample`
104104

105105
`Date_BC_AD_Sample` is again a list column with a vector of `n` ages (in years BC/AD) for each `.janno` file individual/sample. These ages are randomly drawn with `base::sample(prob = ...)` using the probability distribution calculated for `Date_BC_AD_Prob`.
106106

107-
### General helper functions
107+
## General helper functions
108108

109109
When you are preparing a `.janno` file and want to determine the entries for the columns `Date_BC_AD_Median`, `Date_BC_AD_Start` and `Date_BC_AD_Stop` from radiocarbon dates, then `janno::quickcalibrate()` might come in handy.
110110

pdf_conversion/pandoc_pdf_config.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
table-of-contents: true
2-
shift-heading-level-by: -2
2+
shift-heading-level-by: -1
33
number-sections: true
44
highlight-style: tango
55
variables:

pdf_conversion/pdf_conversion_list.tsv

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,4 +17,4 @@ janno_details.md janno_details.pdf
1717
janno_r_package.md janno_r_package.pdf
1818
qjanno.md qjanno.pdf
1919
trident.md trident.pdf
20-
xerxes.md xerxes.pdf
20+
xerxes.md xerxes.pdf

0 commit comments

Comments
 (0)