Update README.md

cafferychen777 · Mar 17, 2024 · bf7a023 · bf7a023
1 parent 0ebdb31
commit bf7a023
Showing 1 changed file with 6 additions and 2 deletions.
diff --git a/...aration/laying-the-foundation-creating-the-microbiomestat-data-object/README.md b/...aration/laying-the-foundation-creating-the-microbiomestat-data-object/README.md
@@ -11,9 +11,13 @@ This section outlines the core components of the MicrobiomeStat data object. We
 
 <figure><img src="../../.gitbook/assets/Screenshot 2023-10-11 at 10.35.59.png" alt=""><figcaption><p>peerj32.obj$feature.tab</p></figcaption></figure>
 
-**Component 2: meta.dat (data.frame)** The **meta.dat** is a structured data.frame that holds metadata corresponding to the samples. The rows represent the samples and they must align precisely with the columns of the feature.tab. Each column in the meta.dat acts as an **annotation** for the samples. The column names should be informative and concise. To facilitate analysis, for continuous/numeric variables,  make sure they are of "numeric" type ("as.numeric" to convert), and for categorical variables, make sure they are of "factor" type with informative factor levels ("as.factor" to convert). It is a good practice to order the levels properly when creating a factor and always put the reference level in the first.  Level ordering is especially important for ordinal variables. For the meta data of a longitudinal dataset, we expect to see the following variables or similar:
+**Component 2: meta.dat (data.frame)** The **meta.dat** is a structured data.frame that holds metadata corresponding to the samples. The rows represent the samples and they must align precisely with the columns of the feature.tab. The row names of meta.dat should be the unique sample IDs. Each column in the meta.dat acts as an **annotation** for the samples. The column names should be informative and concise. To facilitate analysis, for continuous/numeric variables,  make sure they are of "numeric" type ("as.numeric" to convert), and for categorical variables, make sure they are of "factor" type with informative factor levels ("as.factor" to convert). It is a good practice to order the levels properly when creating a factor and always put the reference level in the first.  Level ordering is especially important for ordinal variables. For the meta data of a longitudinal dataset, we expect to see the following variables or similar:
 
-* **subject**: a factor that indicates individual subjects or experimental units in the study. This can be alphanumeric and unique for each participant. For example, "subj001", "subj002", etc.
+* **subject**: a factor that indicates individual subjects or experimental units in the study. This can be alphanumeric and unique for each participant. For example, "subj001", "subj002", etc. In a longitudinal study, each subject should have multiple samples collected over time. Therefore, the number of unique subjects should be less than the total number of samples. For instance, if you have 20 subjects and plan to sample each subject at 4 time points, you would expect a total of 80 samples (rows) in your meta.dat, but only 20 unique subjects. 
+
+However, in reality, not all subjects may have samples at every time point. Some subjects might miss certain sampling time points due to various reasons such as dropout, missed visits, or sample quality issues. In such cases, you will have fewer than the expected number of total samples. For example, if out of the 20 subjects, 5 subjects missed the third time point, you would have a total of 75 samples (rows) in your meta.dat instead of 80.
+
+MicrobiomeStat can handle such unbalanced longitudinal data, where not all subjects have samples at every time point. The key is to ensure that each sample (row) in your meta.dat has a valid subject identifier and a corresponding time point. Subjects with missing time points will simply have fewer rows associated with them in the meta.dat.
 
 * **timepoint**: a factor indicating the specific time points at which measurements were taken. The order of the levels should reflect the order of the time points. For example, if measurements were taken at baseline, 6 months, and 12 months, then the levels might be ordered as "baseline", "6mo", "12mo". Yes, we do support continuous types for timepoints. For example, if the exact days or hours of measurement are known, they can be represented as numerical values such as 0, 180, 365 for days since study start.