Discussion:
[R-sig-phylo] Removing columns containing "N" in DNA alignment
Vojtěch Zeisek
2017-10-27 14:25:02 UTC
Permalink
Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is
suppose to remove columns/rows containing only the given characters, but if I
use it and export data (ape::write.dna or ape::write.nexus.data), some samples
consist only of N characters...
The DNAbin object being processed was originally imported from VCF using vcfR
(read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf, consensus=TRUE,
extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count NAs
and then drop respective columns. And as sequences in DNAbin are stored in
binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.
--
Vojtěch Zeisek
https://trapa.cz/en/

Department of Botany, Faculty of Science
Charles University, Prague, Czech Republic
https://www.natur.cuni.cz/biology/botany/

Institute of Botany, Czech Academy of Sciences
Průhonice, Czech Republic
http://www.ibot.cas.cz/en/
Andreas Kolter
2017-10-27 15:02:45 UTC
Permalink
Hello V.
Because you speak of columns I assume you are handling an alignment,
right? If you handle an alignment all sequences have the same length and
you can do as.matrix

Like this?

library(magrittr)
#maximum number of n's
thresh <- 0.005 #0.5%
seq <- as.matrix(seq)
temp <- seq %>% sapply(.,grep,pattern="n") %>% unlist(.,use.names=F) %>%
table
seq[,-(names(temp)[which(temp/ncol(seq)>thresh)] %>% as.integer)]

Greetings,
Andreas
Post by Vojtěch Zeisek
Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and
ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is
suppose to remove columns/rows containing only the given characters, but if I
use it and export data (ape::write.dna or ape::write.nexus.data), some samples
consist only of N characters...
The DNAbin object being processed was originally imported from VCF using vcfR
(read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
consensus=TRUE,
extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count NAs
and then drop respective columns. And as sequences in DNAbin are stored in
binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.
_______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
_______________________________________________
R-sig-phylo mailing list - R-sig-***@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-***@r-pr
Vojtěch Zeisek
2017-10-27 15:52:28 UTC
Permalink
Thank You,
Andreas, yes, I try to manipulate an alignment. This is nice trick, although
it returns empty alignment regardless threshold value used (I do have some
data in the alignment:-)...
Have a nice weekend,
V.
Post by Andreas Kolter
Hello V.
Because you speak of columns I assume you are handling an alignment,
right? If you handle an alignment all sequences have the same length and
you can do as.matrix
Like this?
library(magrittr)
#maximum number of n's
thresh <- 0.005 #0.5%
seq <- as.matrix(seq)
temp <- seq %>% sapply(.,grep,pattern="n") %>% unlist(.,use.names=F) %>%
table
seq[,-(names(temp)[which(temp/ncol(seq)>thresh)] %>% as.integer)]
Greetings,
Andreas
Post by Vojtěch Zeisek
Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and
ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it
is suppose to remove columns/rows containing only the given characters,
but if I
use it and export data (ape::write.dna or ape::write.nexus.data), some
samples consist only of N characters...
The DNAbin object being processed was originally imported from VCF
using vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
consensus=TRUE,
extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count NAs
and then drop respective columns. And as sequences in DNAbin are stored
in binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.
--
Vojtěch Zeisek
https://trapa.cz/en/

Department of Botany, Faculty of Science
Charles University, Prague, Czech Republic
https://www.natur.cuni.cz/biology/botany/

Institute of Botany, Czech Academy of Sciences
Průhonice, Czech Republic
http://www.ibot.cas.cz/en/
Matthew Van Dam
2017-10-27 19:21:29 UTC
Permalink
HI Vojtech,

Below is a function modified from package *ips deleteEmptyCells *function,
it works on "?" percentage in the alignment, but that can be easily
modified with the above suggestions by Emmanuel.

Best,
Matt


DeleteFunkyColumns =

function (DNAbin, cutoff=0.3, quiet = FALSE)

{


isfunky = function(DNAbin) {

nsets = "?" ###

nn = as.raw(2)

names(nn) = "?"

nsets = nn[nsets]

len = dim(DNAbin)[2]

DNAbin[,1:len] == nsets

}

size = dim(DNAbin)

zztop = isfunky(DNAbin) # phylip

colids = which(apply(zztop,2, sum)/dim(DNAbin)[1] >= cutoff)

if (length(colids)!=0) {



DNAbin = DNAbin[, -colids]

} else {

print("The cutoff is too high, try lowering it, nothing will be done to
the alignment.")

}

if (!quiet) {

size = size - dim(DNAbin)

cat("\n\t", size[2], "columns deleted from alignment, they are:",
colids)

}

DNAbin

}


#### example of how to use below

phylip = read.phy("uce-1289.nexus.phylip")

phylip

newphylip = DeleteFunkyColumns(phylip, cutoff=0.37)

newphylip

newphylip = DeleteFunkyColumns(phylip, cutoff=0.25)

newphylip
Post by Vojtěch Zeisek
Thank You,
Andreas, yes, I try to manipulate an alignment. This is nice trick, although
it returns empty alignment regardless threshold value used (I do have some
data in the alignment:-)...
Have a nice weekend,
V.
Post by Andreas Kolter
Hello V.
Because you speak of columns I assume you are handling an alignment,
right? If you handle an alignment all sequences have the same length and
you can do as.matrix
Like this?
library(magrittr)
#maximum number of n's
thresh <- 0.005 #0.5%
seq <- as.matrix(seq)
temp <- seq %>% sapply(.,grep,pattern="n") %>% unlist(.,use.names=F) %>%
table
seq[,-(names(temp)[which(temp/ncol(seq)>thresh)] %>% as.integer)]
Greetings,
Andreas
Post by Vojtěch Zeisek
Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and
ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it
is suppose to remove columns/rows containing only the given characters,
but if I
use it and export data (ape::write.dna or ape::write.nexus.data), some
samples consist only of N characters...
The DNAbin object being processed was originally imported from VCF
vcfR2DNAbin(x=myvcf,
Post by Andreas Kolter
Post by Vojtěch Zeisek
consensus=TRUE,
extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count NAs
and then drop respective columns. And as sequences in DNAbin are stored
in binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.
--
Vojtěch Zeisek
https://trapa.cz/en/
Department of Botany, Faculty of Science
Charles University, Prague, Czech Republic
https://www.natur.cuni.cz/biology/botany/
Institute of Botany, Czech Academy of Sciences
Průhonice, Czech Republic
http://www.ibot.cas.cz/en/
_______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-
[[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-***@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-s
Vojtěch Zeisek
2017-10-27 15:02:45 UTC
Permalink
Hm, I tried a dirty hack: I exported the DNAbin object using ape::write.dna
and replaced all occurrences of "n" in any sequence by "-" and imported the
file back to R with ape::read.dna. Then I tried the mentioned functions. They
did nothing. When I exported the file to disk, the FASTA file did not contain
any "-", only "n". DO I do something wrong, or is there a bug in ape as it
seems to confuse "n" and "-"?
Sincerely,
V.
Post by Vojtěch Zeisek
Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is
suppose to remove columns/rows containing only the given characters, but if
I use it and export data (ape::write.dna or ape::write.nexus.data), some
samples consist only of N characters...
The DNAbin object being processed was originally imported from VCF using
vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count
NAs and then drop respective columns. And as sequences in DNAbin are stored
in binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.
--
Vojtěch Zeisek
https://trapa.cz/en/

Department of Botany, Faculty of Science
Charles University, Prague, Czech Republic
https://www.natur.cuni.cz/biology/botany/

Institute of Botany, Czech Academy of Sciences
Průhonice, Czech Republic
http://www.ibot.cas.cz/en/
Emmanuel Paradis
2017-10-27 15:44:10 UTC
Permalink
This post might be inappropriate. Click to display it.
Vojtěch Zeisek
2017-10-30 10:26:03 UTC
Permalink
Thank You, Emmanuel,
sorry for late reaction, I forge to copy the data to another computer before I
left at Friday... It is a nice trick, although it returns unchanged alignment.
:-( It is same regardless which threshold value I choose. I created it using
vcfR::vcfR2DNAbin, but it looks like any other DNAbin object. With indels
("-") this function worked well.

minua.dna
117 DNA sequences in binary format stored in a matrix.

All sequences of same length: 11946

Labels:
M151_1_1
M151_2_4
M152_2_52
M153_1_109
M153_2_276
M153_2_294
...

Base composition:
a c g t
0.213 0.280 0.295 0.212

minua.dna.red <- del.colgapsonly.n(x=minua.dna, threshold=5, freq.only=FALSE)
minua.dna.red
117 DNA sequences in binary format stored in a matrix.

All sequences of same length: 11946

Labels:
M151_1_1
M151_2_4
M152_2_52
M153_1_109
M153_2_276
M153_2_294
...

Base composition:
a c g t
0.213 0.280 0.295 0.212

And yes, the alignment contains various bases...
base.freq(x=minua.dna, all=TRUE)
a c g t r m w
s k y
0.14505732 0.19109712 0.20145212 0.14439408 0.06586477 0.01543627 0.01724498
0.01034570 0.01635994 0.06292132
v h d b n - ?
0.00000000 0.00000000 0.00000000 0.00000000 0.12982638 0.00000000 0.00000000

del.colgapsonly.n
function (x, threshold = 1, freq.only = FALSE)
{
if (!inherits(x, "DNAbin"))
x <- as.DNAbin(x)
if (!is.matrix(x))
stop("DNA sequences not in a matrix")
foo <- function(x) sum(x == 240)
g <- apply(x, 2, foo)
if (freq.only)
return(g)
i <- which(g/nrow(x) >= threshold)
if (length(i))
x <- x[, -i]
x
}
<environment: namespace:ape>

BTW, what about generalize the function like this:

del.colgapsonly.X <- function (x, threshold=1, freq.only=FALSE, char="-")
{
if (!inherits(x, "DNAbin"))
x <- as.DNAbin(x)
if (!is.matrix(x))
stop("DNA sequences not in a matrix")
value <- as.integer(as.DNAbin(char))
foo <- function(x) sum(x == value)
g <- apply(x, 2, foo)
if (freq.only)
return(g)
i <- which(g/nrow(x) >= threshold)
if (length(i))
x <- x[, -i]
x
}
:-)
Sincerely,
V.
Hi Vojtěch,
toto <- del.colgapsonly
foo <- function(x) sum(x == 4)
and replace 4 by 240. Save and close. Now you can use toto() in the same
way than del.colgapsonly(); for instance, to get the number of N's in
R> toto(woodmouse, freq.only = TRUE)
[1] 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
....
R> as.integer(as.DNAbin("-"))
[1] 4
R> as.integer(as.DNAbin("N"))
[1] 240
This gives a lot of possibilities to hack the function. For instance, if
R> as.integer(as.DNAbin("R"))
[1] 192
R> as.integer(as.DNAbin("Y"))
[1] 48
foo <- function(x) sum(x == 192 | x == 48)
HTH
Best,
Emmanuel
Post by Vojtěch Zeisek
Hm, I tried a dirty hack: I exported the DNAbin object using
ape::write.dna
and replaced all occurrences of "n" in any sequence by "-" and imported the
file back to R with ape::read.dna. Then I tried the mentioned functions.
They did nothing. When I exported the file to disk, the FASTA file did
not contain any "-", only "n". DO I do something wrong, or is there a bug
in ape as it seems to confuse "n" and "-"?
Sincerely,
V.
Post by Vojtěch Zeisek
Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and
ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is
suppose to remove columns/rows containing only the given characters, but if
I use it and export data (ape::write.dna or ape::write.nexus.data), some
samples consist only of N characters...
The DNAbin object being processed was originally imported from VCF using
vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count
NAs and then drop respective columns. And as sequences in DNAbin are
stored in binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.
--
Vojtěch Zeisek
https://trapa.cz/
Andreas Kolter
2017-10-30 11:03:05 UTC
Permalink
Dear Vojtech,
my code snippet works, and the other solutions were also correct. The
problem is with your alignment. Your 3rd sequence only contains N's,
therefore whatever approach you use, all coloumns are filtered out if
you filter all coloumns containing N.
Please open your alignment in a text editor or use
image.DNAbin(youralignment) to visualise your alignment before you
process it in any way.
Best wishes,
Andreas
Post by Vojtěch Zeisek
Thank You, Emmanuel,
sorry for late reaction, I forge to copy the data to another computer before I
left at Friday... It is a nice trick, although it returns unchanged alignment.
:-( It is same regardless which threshold value I choose. I created it using
vcfR::vcfR2DNAbin, but it looks like any other DNAbin object. With indels
("-") this function worked well.
minua.dna
117 DNA sequences in binary format stored in a matrix.
All sequences of same length: 11946
M151_1_1
M151_2_4
M152_2_52
M153_1_109
M153_2_276
M153_2_294
...
a c g t
0.213 0.280 0.295 0.212
minua.dna.red <- del.colgapsonly.n(x=minua.dna, threshold=5,
freq.only=FALSE)
minua.dna.red
117 DNA sequences in binary format stored in a matrix.
All sequences of same length: 11946
M151_1_1
M151_2_4
M152_2_52
M153_1_109
M153_2_276
M153_2_294
...
a c g t
0.213 0.280 0.295 0.212
And yes, the alignment contains various bases...
base.freq(x=minua.dna, all=TRUE)
a c g t r m
w
s k y
0.14505732 0.19109712 0.20145212 0.14439408 0.06586477 0.01543627 0.01724498
0.01034570 0.01635994 0.06292132
v h d b n -
?
0.00000000 0.00000000 0.00000000 0.00000000 0.12982638 0.00000000 0.00000000
del.colgapsonly.n
function (x, threshold = 1, freq.only = FALSE)
{
if (!inherits(x, "DNAbin"))
x <- as.DNAbin(x)
if (!is.matrix(x))
stop("DNA sequences not in a matrix")
foo <- function(x) sum(x == 240)
g <- apply(x, 2, foo)
if (freq.only)
return(g)
i <- which(g/nrow(x) >= threshold)
if (length(i))
x <- x[, -i]
x
}
<environment: namespace:ape>
del.colgapsonly.X <- function (x, threshold=1, freq.only=FALSE, char="-")
{
if (!inherits(x, "DNAbin"))
x <- as.DNAbin(x)
if (!is.matrix(x))
stop("DNA sequences not in a matrix")
value <- as.integer(as.DNAbin(char))
foo <- function(x) sum(x == value)
g <- apply(x, 2, foo)
if (freq.only)
return(g)
i <- which(g/nrow(x) >= threshold)
if (length(i))
x <- x[, -i]
x
}
:-)
Sincerely,
V.
Post by Emmanuel Paradis
Hi Vojtěch,
toto <- del.colgapsonly
foo <- function(x) sum(x == 4)
and replace 4 by 240. Save and close. Now you can use toto() in the same
way than del.colgapsonly(); for instance, to get the number of N's in
R> toto(woodmouse, freq.only = TRUE)
[1] 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
....
R> as.integer(as.DNAbin("-"))
[1] 4
R> as.integer(as.DNAbin("N"))
[1] 240
This gives a lot of possibilities to hack the function. For instance, if
R> as.integer(as.DNAbin("R"))
[1] 192
R> as.integer(as.DNAbin("Y"))
[1] 48
foo <- function(x) sum(x == 192 | x == 48)
HTH
Best,
Emmanuel
Post by Vojtěch Zeisek
Hm, I tried a dirty hack: I exported the DNAbin object using ape::write.dna
and replaced all occurrences of "n" in any sequence by "-" and imported the
file back to R with ape::read.dna. Then I tried the mentioned functions.
They did nothing. When I exported the file to disk, the FASTA file did
not contain any "-", only "n". DO I do something wrong, or is there a bug
in ape as it seems to confuse "n" and "-"?
Sincerely,
V.
Post by Vojtěch Zeisek
Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and
ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is
suppose to remove columns/rows containing only the given characters, but if
I use it and export data (ape::write.dna or ape::write.nexus.data), some
samples consist only of N characters...
The DNAbin object being processed was originally imported from VCF using
vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count
NAs and then drop respective columns. And as sequences in DNAbin are
stored in binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.
_______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
_______________________________________________
R-sig-phylo mailing list - R-sig-***@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r
Vojtěch Zeisek
2017-10-30 13:36:56 UTC
Permalink
Thank You, dear Andreas,
yes, I should notice this myself, my fault. Indeed, the nearly empty sample
was problematic, although, I wish to filter columns with N over certain
proportion. After manual edition it works. I just still wonder if there was
something related to the conversion from VCF...
Thank You all for Your help,
V.
Post by Andreas Kolter
Dear Vojtech,
my code snippet works, and the other solutions were also correct. The
problem is with your alignment. Your 3rd sequence only contains N's,
therefore whatever approach you use, all coloumns are filtered out if
you filter all coloumns containing N.
Please open your alignment in a text editor or use
image.DNAbin(youralignment) to visualise your alignment before you
process it in any way.
Best wishes,
Andreas
Post by Vojtěch Zeisek
Thank You, Emmanuel,
sorry for late reaction, I forge to copy the data to another computer before I
left at Friday... It is a nice trick, although it returns unchanged alignment.
:-( It is same regardless which threshold value I choose. I created it
using vcfR::vcfR2DNAbin, but it looks like any other DNAbin object. With
indels ("-") this function worked well.
minua.dna
117 DNA sequences in binary format stored in a matrix.
All sequences of same length: 11946
M151_1_1
M151_2_4
M152_2_52
M153_1_109
M153_2_276
M153_2_294
...
a c g t
0.213 0.280 0.295 0.212
minua.dna.red <- del.colgapsonly.n(x=minua.dna, threshold=5,
freq.only=FALSE)
minua.dna.red
117 DNA sequences in binary format stored in a matrix.
All sequences of same length: 11946
M151_1_1
M151_2_4
M152_2_52
M153_1_109
M153_2_276
M153_2_294
...
a c g t
0.213 0.280 0.295 0.212
And yes, the alignment contains various bases...
base.freq(x=minua.dna, all=TRUE)
a c g t r m
w
s k y
0.14505732 0.19109712 0.20145212 0.14439408 0.06586477 0.01543627 0.01724498
0.01034570 0.01635994 0.06292132
v h d b n -
?
0.00000000 0.00000000 0.00000000 0.00000000 0.12982638 0.00000000 0.00000000
del.colgapsonly.n
function (x, threshold = 1, freq.only = FALSE)
{
if (!inherits(x, "DNAbin"))
x <- as.DNAbin(x)
if (!is.matrix(x))
stop("DNA sequences not in a matrix")
foo <- function(x) sum(x == 240)
g <- apply(x, 2, foo)
if (freq.only)
return(g)
i <- which(g/nrow(x) >= threshold)
if (length(i))
x <- x[, -i]
x
}
<environment: namespace:ape>
del.colgapsonly.X <- function (x, threshold=1, freq.only=FALSE, char="-")
{
if (!inherits(x, "DNAbin"))
x <- as.DNAbin(x)
if (!is.matrix(x))
stop("DNA sequences not in a matrix")
value <- as.integer(as.DNAbin(char))
foo <- function(x) sum(x == value)
g <- apply(x, 2, foo)
if (freq.only)
return(g)
i <- which(g/nrow(x) >= threshold)
if (length(i))
x <- x[, -i]
x
}
:-)
Sincerely,
V.
Hi Vojtěch,
toto <- del.colgapsonly
foo <- function(x) sum(x == 4)
and replace 4 by 240. Save and close. Now you can use toto() in the same
way than del.colgapsonly(); for instance, to get the number of N's in
R> toto(woodmouse, freq.only = TRUE)
[1] 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
....
R> as.integer(as.DNAbin("-"))
[1] 4
R> as.integer(as.DNAbin("N"))
[1] 240
This gives a lot of possibilities to hack the function. For instance,
R> as.integer(as.DNAbin("R"))
[1] 192
R> as.integer(as.DNAbin("Y"))
[1] 48
foo <- function(x) sum(x == 192 | x == 48)
HTH
Best,
Emmanuel
Post by Vojtěch Zeisek
Hm, I tried a dirty hack: I exported the DNAbin object using ape::write.dna
and replaced all occurrences of "n" in any sequence by "-" and imported
the file back to R with ape::read.dna. Then I tried the mentioned
functions.
They did nothing. When I exported the file to disk, the FASTA file did
not contain any "-", only "n". DO I do something wrong, or is there a
bug in ape as it seems to confuse "n" and "-"?
Sincerely,
V.
Post by Vojtěch Zeisek
Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and
ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over
certain threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is
suppose to remove columns/rows containing only the given characters, but if
I use it and export data (ape::write.dna or ape::write.nexus.data),
some samples consist only of N characters...
The DNAbin object being processed was originally imported from VCF using
vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count
NAs and then drop respective columns. And as sequences in DNAbin are
stored in binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.
--
Vojtěch Zeisek
https://trapa.cz/
Loading...