v0.2.6.r296

wangyibin · wangyibin · commit eb2e23a8bb22 · 2025-03-22T19:49:46.000+08:00
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,4 +1,4 @@
-## [v0.2.6] - 2025-03-20
+## [v0.2.6] - 2025-03-22
 #### New features
 - `collapse`, rescue the collapsed contigs  
 #### Enhancement
diff --git a/bin/cphasing-rs b/bin/cphasing-rs
diff --git a/cphasing/__init__.py b/cphasing/__init__.py
@@ -33,7 +33,7 @@
 __email__ = ("yibinwang96@outlook.com", "zhangxingtan@caas.cn")
 __license__ = "BSD"
 __status__ = "Development"
-__version__ = "0.2.6.r295"
+__version__ = "0.2.6.r296"
 __url__ = "https://github.com/wangyibin/CPhasing"
 __doc_url__ = "https://wangyibin.github.io/CPhasing"
 __epilog__ =  f"""
diff --git a/cphasing/pqs.py b/cphasing/pqs.py
@@ -677,6 +677,7 @@ def to_hg_df(self, chunks, contig_idx,
                 )
         
         results = list(filter(lambda x: x is not None, results))
+        results = list(filter(lambda x: len(x) > 0, results))
         if len(results) == 0:
             logger.warning("No data found in the given region.")
             return
@@ -988,10 +989,10 @@ def process_chunk_to_cool_global(chunk, binsize,
             chunk = chunk.filter(pl.col("mapq") >= min_mapq).drop("mapq")
     
     bin1_id = (chunk["pos1"] // binsize) + chunk["chrom1"].map_elements(
-            bin_offset_db.get
+            bin_offset_db.get, skip_nulls=False
     ).cast(schema["pos1"])
     bin2_id = (chunk["pos2"] // binsize) + chunk["chrom2"].map_elements(
-        bin_offset_db.get
+        bin_offset_db.get, skip_nulls=False
     ).cast(schema["pos2"])
     chunk = (
         chunk.with_columns([bin1_id.alias("bin1_id"), bin2_id.alias("bin2_id")])
@@ -1053,6 +1054,8 @@ def process_chunk_hg(chunk_name, bed_dict, contigsizes,
     columns = ["chrom1", "pos1", "chrom2", "pos2", "mapq"]
     chunk = pl.scan_parquet(chunk_name).select(columns)
     chunk_name = Path(chunk_name).stem
+
+    
     if min_mapq > 1:
         chunk = chunk.filter(pl.col("mapq") >= min_mapq)
 
diff --git a/docs/faq.md b/docs/faq.md
@@ -1,6 +1,11 @@
 ### The results of the first round partition are unsatisfactory.
 In our two-round partition algorithm, the first round partition depends on the h-trans errors between homologous chromosomes; if you input a contig assembly with low level switch errors or input a high accuracy pore-c data, the h-trans will be not enough to cluster all contigs to correct homologous groups, resulting in unsatisfactory results. You can set the `-q1 0` for `hyperpartition` to increase the rate of h-trans errors. However, this parameter may raise error of `out of memory` when you input huge pore-c data in porec table or hic contacts in pairs file. 
 
+### The total size of the chromosomes significantly smaller than the estimate genome size
+If the following two conditions exist, you can adjust the mode of the `cphasing pipeline` to either (` --preset sensitive`) or (`--preset very-sensitive`). 
+1. The amount of entered data is low. 2. The input genome is relatively complex, with many homozygous or nearly homozygous regions. It should be noted that the above two modes will cause some very small contig to cluster or sort incorrectly. In the second case, greedy clustering may occurs, in which two highly homologous sets of chromosomes are grouped into one group.
+
+
 ### How to set the `-n` parameter when assembling an aneuploid genome. 
 The aneuploid genome, such as modern cultivated sugarcane, contains unequal homologous chromosomes. The `-n` parameter can be set to zero (`-n 10:0`) to automatically partition contigs into different chromosomes within a homologous group.     
 However, we also allow the user to input a file with two columns: the first column is the index(1-base) of the first round partition, and the second column is the chromosome number of each homologous. And then specified the `-n 10:second.number.tsv` in `cphasing pipeline` or `cphasing hyperpartition`.
diff --git a/docs/faq.zh.md b/docs/faq.zh.md
@@ -1,5 +1,11 @@
 ### 第一轮分组的结果不好:   
-在我们的两轮聚类算法中，第一轮聚类依赖于同源染色体之间比对错误；如果用户输入低水平Switch error的contigs或输入高精度的Pore-c数据，h-*trans*将不足以支撑将来自同源染色体的contig聚到一起，这容易导致结果不理想。用户可以为`hyperpartition`或`pipeline`设置`-q1 0`以增加h-*trans*错误率。但是，当您在孔表或配对文件中输入大量的Pore-C数据时，此参数可能会引发内存不足的错误。 
+在我们的两轮聚类算法中，第一轮聚类依赖于同源染色体之间比对错误；如果用户输入低水平Switch error的contigs或输入高精度的Pore-c数据，h-*trans*将不足以支撑将来自同源染色体的contig聚到一起，这容易导致结果不理想。用户可以为`hyperpartition`或`pipeline`设置`-q1 0`以增加h-*trans*错误率。但是，当您在porec.gz或pairs.gz中输入大量的互作数据时，此参数可能会引发内存不足的错误。 
+
+
+### 挂载上的染色体总大小远低于预估基因组大小
+如果存在以下两种情况，可以通过调整 `cphasing pipeline`的模式至敏感（`--preset sensitive`）或者超敏感（`--preset very-sensitive`）
+1. 输入的数据量低。2. 输入的基因组较为复杂，存在大量的纯合或者近乎纯合的区域。 需要注意的是，以上两种模式会让部分较碎的contig聚类或者排序错误。同时如果属于第二种情况，容易发生贪婪的聚类，即两条高度同源的染色体组被分到一组里面。
+
 
 ### 如何在组装非整倍体基因组时设置`-n`参数:  
 非整倍体基因组，如现代栽培的甘蔗，包含数目不相等的同源染色体。我们建议`-n`参数可以设置为零（`-n 0:0`），让程序自动判别分组数

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-## [v0.2.6] - 2025-03-20`
	`1`	`+## [v0.2.6] - 2025-03-22`
`2`	`2`	`#### New features`
`3`	`3`	- `collapse`, rescue the collapsed contigs
`4`	`4`	`#### Enhancement`