Vojtěch Zeisek
2017-10-27 14:25:02 UTC
Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is
suppose to remove columns/rows containing only the given characters, but if I
use it and export data (ape::write.dna or ape::write.nexus.data), some samples
consist only of N characters...
The DNAbin object being processed was originally imported from VCF using vcfR
(read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf, consensus=TRUE,
extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count NAs
and then drop respective columns. And as sequences in DNAbin are stored in
binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.
--
VojtÄch Zeisek
https://trapa.cz/en/
Department of Botany, Faculty of Science
Charles University, Prague, Czech Republic
https://www.natur.cuni.cz/biology/botany/
Institute of Botany, Czech Academy of Sciences
Průhonice, Czech Republic
http://www.ibot.cas.cz/en/
I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is
suppose to remove columns/rows containing only the given characters, but if I
use it and export data (ape::write.dna or ape::write.nexus.data), some samples
consist only of N characters...
The DNAbin object being processed was originally imported from VCF using vcfR
(read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf, consensus=TRUE,
extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count NAs
and then drop respective columns. And as sequences in DNAbin are stored in
binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.
--
VojtÄch Zeisek
https://trapa.cz/en/
Department of Botany, Faculty of Science
Charles University, Prague, Czech Republic
https://www.natur.cuni.cz/biology/botany/
Institute of Botany, Czech Academy of Sciences
Průhonice, Czech Republic
http://www.ibot.cas.cz/en/