Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinDA Unusual p-value Distribution #54

Open
DarioS opened this issue May 29, 2024 · 2 comments
Open

LinDA Unusual p-value Distribution #54

DarioS opened this issue May 29, 2024 · 2 comments
Labels
good first issue Good for newcomers

Comments

@DarioS
Copy link

DarioS commented May 29, 2024

Using the data set shared previously via e-mail and fitting

adjNormalCancerFit <- linda(bacteriaMatrix, clinicalTablePairs, "~ Age * `Tissue Type` + Smoking * `Tissue Type` +
                            Gender * `Tissue Type` + `Primary Site` + (1|Patient)", "proportion", is.winsor = FALSE)

I get a strange-looking p-value distribution for many of the coefficients. For example,

image

How can it be made to be more uniform, as expected by statistics theory?

@cafferychen777 cafferychen777 added the good first issue Good for newcomers label Sep 24, 2024
@cafferychen777
Copy link
Owner

Hi @DarioStrbenac,

Thank you for reporting this issue. We've conducted extensive testing of the p-value distribution in LinDA, and I'd like to share our findings:

  1. Test Results
    We created a comprehensive test with simulated null data (no true associations) and found that the p-value distribution can be improved by:
  • Increasing sample size (200+ samples showed better results)
  • Having a sufficient number of features (100+ features)

Our test metrics showed:

  • KS test p-values > 0.03 (indicating no significant deviation from uniform)
  • Density ratios around 2-3 (showing reasonable uniformity)
  • Coefficient of variation < 0.35 (indicating good spread)
  1. Recommendations for Your Analysis
    Based on your formula:
~ Age * `Tissue Type` + Smoking * `Tissue Type` + Gender * `Tissue Type` + `Primary Site` + (1|Patient)

To improve the p-value distribution, consider:

a) Sample Size: Ensure you have sufficient samples relative to the number of parameters in your model

  • Your model has multiple interaction terms, which increases the number of parameters
  • Aim for at least 200 samples if possible

b) Model Complexity:

  • Consider if all interaction terms are necessary
  • You might want to try a simpler model first and add complexity gradually

c) Data Quality:

  • Check for balanced groups in categorical variables
  • Consider the distribution of continuous variables (Age)
  • Examine potential outliers
  1. Diagnostic Steps
    You can try:
# Check sample sizes in each group
table(clinicalTablePairs$`Tissue Type`)
table(clinicalTablePairs$Smoking)
table(clinicalTablePairs$Gender)

# Look at Age distribution
hist(clinicalTablePairs$Age)

# Consider a simpler model first
simpleModel <- linda(bacteriaMatrix, clinicalTablePairs, 
                    "~ Age + `Tissue Type` + Smoking + Gender + (1|Patient)", 
                    "proportion", is.winsor = FALSE)

Could you share:

  1. How many samples do you have in total?
  2. How are they distributed across your categorical variables?
  3. What's the distribution of your Age variable?

This information would help us provide more specific recommendations for your case.

Best regards,
Caffery

@DarioS
Copy link
Author

DarioS commented Nov 6, 2024

Thank you for evaluating it so thoroughly. Data set is 105 samples and 54 species. Age is categorical, not continuous.

> table(clinicalTablePairs$Age)
Young   Old 
   64    41
> table(clinicalTablePairs$`Tissue Type`)
Normal Cancer 
    51     54
> table(clinicalTablePairs$Smoking)
 No Yes 
 74  31
 > table(clinicalTablePairs$Gender)
Female   Male 
    38     67
> table(Tissue = clinicalTablePairs$`Tissue Type`, Smoking = clinicalTablePairs$Smoking)
        Smoking
Tissue   No Yes
  Normal 36  15
  Cancer 38  16
> table(Tissue = clinicalTablePairs$`Tissue Type`, Gender = clinicalTablePairs$Gender)
        Gender
Tissue   Female Male
  Normal     19   32
  Cancer     19   35      

Fitting the simpler model with only main effects, I also see a strange histogram.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants