Merge branch 'docs'

ivan-aksamentov · ivan-aksamentov · commit 4737732ffcf0 · 2025-03-20T06:39:31.000+01:00
diff --git a/docs/user/input-files/03-genome-annotation.md b/docs/user/input-files/03-genome-annotation.md
@@ -20,33 +20,10 @@ The fundamental unit for Nextclade is a single `CDS`.
 
 When a linked `gene` and `CDS` are present (`CDS`s specify their parents by listing the `gene`'s `ID` in the `Parent` attribute), the `gene` is effectively ignored for all purposes but display in the web UI. `CDS` segments are joined if they have the same `ID`, otherwise they are treated as independent.
 
-Example gene map for SARS-CoV-2:
-
-```
-# seqname	source	feature	start	end	score	strand	frame	attribute
-.	.	gene	266	21555	.	+	.	gene=ORF1ab;ID=gene-ORF1ab
-.	.	CDS	266	13468	.	+	.	gene=ORF1ab;ID=cds-ORF1ab;Parent=gene-ORF1ab
-.	.	CDS	13468	21555	.	+	.	gene=ORF1ab;ID=cds-ORF1ab;Parent=gene-ORF1ab
-.	.	CDS	21563	25384	.	+	.	gene=S
-.	.	CDS	25393	26220	.	+	.	gene=ORF3a
-.	.	CDS	26245	26472	.	+	.	gene=E
-.	.	CDS	26523	27191	.	+	.	gene=M
-.	.	CDS	27202	27387	.	+	.	gene=ORF6
-.	.	CDS	27394	27759	.	+	.	gene=ORF7a
-.	.	CDS	27756	27887	.	+	.	gene=ORF7b
-.	.	CDS	27894	28259	.	+	.	gene=ORF8
-.	.	CDS	28284	28577	.	+	.	gene=ORF9b
-.	.	CDS	28274	29533	.	+	.	gene=N
-```
-
-More example annotations can be found in the [Nextclade data repository](https://github.com/search?q=repo%3Anextstrain%2Fnextclade_data++path%3Agenome_annotation.gff3&type=code).
+Example annotations can be found in the [Nextclade data repository](https://github.com/search?q=repo%3Anextstrain%2Fnextclade_data%20path%3Adata%2F**%2F*.gff*&type=code).
 
 Nextclade Web (advanced mode): accepted in "Genome annotation" drag & drop box.
 
 Nextclade CLI flag: `--input-annotation`/`-m`
 
-Note: For historical reasons, Nextclade uses _gene name_ when it really means _CDS_ name. The "gene name" is taken from the `CDS`'s first attribute found in the following list: `Gene`, `gene`, `gene_name`, `locus_tag`, `Name`, `name`, `Alias`, `alias`, `standard_name`, `old-name`, `product`, `gene_synonym`, `gb-synonym`, `acronym`, `gb-acronym`, `protein_id`, `ID`.
-
-It is recommended that the `gene` attribute is used to specify the gene/CDS name.
-
 > 💡 Nextclade CLI supports file compression and reading from standard input. See section [Compression, stdin](./compression.md) for more details.
diff --git a/docs/user/input-files/04-reference-tree.md b/docs/user/input-files/04-reference-tree.md
@@ -8,13 +8,16 @@ Accepted formats: Auspice JSON v2 ([description](https://nextstrain.org/docs/bio
 
 The phylogenetic reference tree which serves as a target for phylogenetic placement (see [Algorithm: Phylogenetic placement](../algorithm/03-phylogenetic-placement.md)). Nearest neighbor information is used to assign clades (see [Algorithm: Clade Assignment](../algorithm/04-clade-assignment.md)) and to identify private mutations, including reversions.
 
-The tree **must** be rooted at the sample that matches the [reference sequence](../terminology.md#reference-sequence). A workaround in case one does not want to root the tree to be rooted on the reference is to attach the mutational differences between the tree root and the reference on the branch leading to the root node. This can be accomplished by passing the reference sequence to `augur ancestral`'s `--root-sequence` argument (see the [`augur ancestral` docs](https://docs.nextstrain.org/projects/augur/en/stable/usage/cli/ancestral.html#inputs)).
+> 💡 Nextclade CLI supports file compression and reading from standard input. See section [Compression, stdin](./compression) for more details.
 
-The tree **must** contain a clade definition for every node (including internal): every node must have a value at `node_attrs.clade_membership` (although it can be an empty string).
+### Requirements
 
-The tree **should** be sufficiently large and diverse to meet clade assignment expectations of a particular use-case, study or experiment. Only clades present on the reference tree can be assigned to [query sequences](../terminology.md#query-sequence).
+1. The tree **should** be rooted at the sample that matches the [reference sequence](02-reference-sequence.md). Otherwise the results of the analysis will be incorrect. It's user's or dataset author's responsibility that this assumption holds. Nextclade can sometimes detect a mismatch in certain cases, but not always. 
 
-> 💡 Nextclade CLI supports file compression and reading from standard input. See section [Compression, stdin](./compression) for more details.
+   > ⚠️ A workaround in case one does not want the tree to be rooted on the reference is to attach the mutational differences between the tree root and the reference on the branch leading to the root node.
+   > This can be accomplished by passing the reference sequence to `augur ancestral`'s `--root-sequence` argument (see the [`augur ancestral` docs](https://docs.nextstrain.org/projects/augur/en/stable/usage/cli/ancestral.html#inputs)).
+
+2. The tree **should** be sufficiently large and diverse to meet clade assignment expectations of a particular use-case, study or experiment. Only clades present on the reference tree can be assigned to [query sequences](01-sequence-data.md).
 
 ### Extensions