开发者

Testing if rows of a matrix or data frame are sorted in R

开发者 https://www.devze.com 2023-04-10 00:45 出处:网络
What is an efficient way to test if rows in a matrix are sorted?[Update: see Aaron\'s Rcpp answer - straightforward & very fast.]

What is an efficient way to test if rows in a matrix are sorted? [Update: see Aaron's Rcpp answer - straightforward & very fast.]

I am porting some code that uses issorted(,'rows') from Matlab. As it seems that is.unsorted does not extend beyond vectors, I'm writing开发者_如何转开发 or looking for something else. The naive method is to check that the sorted version of the matrix (or data frame) is the same as the original, but that's obviously inefficient.

NB: For sorting, a la sortrows() in Matlab, my code essentially uses SortedDF <- DF[do.call(order, DF),] (it's wrapped in a larger function that converts matrices to data frames, passes parameters to order, etc.). I wouldn't be surprised if there are faster implementations (data table comes to mind).


Update 1: To clarify: I'm not testing for sorting intra-row or intra-columns. (Such sorting generally results in an algebraically different matrix.)

As an example for creating an unsorted matrix:

set.seed(0)
x <- as.data.frame(matrix(sample(3, 60, replace = TRUE), ncol = 6, byrow = TRUE))

Its sorted version is:

y <- x[do.call(order, x),]

A proper test, say testSorted, would return FALSE for testSorted(x) and TRUE for testSorted(y).

Update 2: The answers below are all good - they are concise and do the test. Regarding efficiency, it looks like these are sorting the data after all.

I've tried these with rather large matrices, such as 1M x 10, (just changing the creation of x above) and all have about the same time and memory cost. What's peculiar is that they all consume more time for unsorted objects (about 5.5 seconds for 1Mx10) than for sorted ones (about 0.5 seconds for y). This suggests they're sorting before testing.

I tested by creating a z matrix:

z <- y
z[,2] <- y[,1]
z[,1] <- y[,2]

In this case, all of the methods take about 0.85 seconds to complete. Anyway, finishing in 5.5 seconds isn't terrible (in fact, that seems to be right about the time necessary to sort the object), but knowing that a sorted matrix is 11X faster suggests that a test that doesn't sort could be even faster. In the case of the 1M row matrix, the first three rows of x are:

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1  3  1  2  2  3  1  3  3  2   2
2  1  1  1  3  2  3  2  3  3   2
3  3  3  1  2  1  1  2  1  2   3

There's no need to look beyond row 2, though vectorization isn't a bad idea.

(I've also added the byrow argument for the creation of x, so that row values don't depend on the size of x.)

Update 3: Another comparison for this testing can be found with the sort -c command in Linux. If the file is already written (using write.table()), with 1M rows, then time sort -c myfile.txt takes 0.003 seconds for the unsorted data and 0.101 seconds for the sorted data. I don't intend to write out to a file, but it's a useful comparison.

Update 4: Aaron's Rcpp method bested all other methods offered here and that I've tried (including the sort -c comparison above: in-memory is expected to beat on-disk). As for the ratio relative to other methods, it's hard to tell: the denominator is too small to give an accurate measurement, and I've not extensively explored microbenchmark. The speedups can be very large (4-5 orders of magnitude) for some matrices (e.g. one made with rnorm), but this is misleading - checking can terminate after only a couple of rows. I've had speedups with the example matrices of about 25-60 for the unsorted and about 1.1X for the sorted, as the competing methods were already very fast if the data is sorted.

Since this does the right thing (i.e. no sorting, just testing), and does it very quickly, it's the accepted answer.


If y is sorted then do.call(order,y) returns 1:nrow(y).

 testSorted = function(y){all(do.call(order,y)==1:nrow(y))}

note this doesn't compare the matrices, but it doesn't dip out as soon as it finds a non-match.


Well, why don't you use:

all(do.call(order, y)==seq(nrow(y)))

That avoids creating the ordered matrix, and ensures it checks your style of ordering.


Newer: I decided I could use the Rcpp practice...

library(Rcpp)
library(inline)
isRowSorted <- cxxfunction(signature(A="numeric"), body='
  Rcpp::NumericMatrix Am(A);
  for(int i = 1; i < Am.nrow(); i++) {
    for(int j = 0; j < Am.ncol(); j++) {
      if( Am(i-1,j) < Am(i,j) ) { break; }
      if( Am(i-1,j) > Am(i,j) ) { return(wrap(false)); }
    }
  }
  return(wrap(true));
', plugin="Rcpp")

rownames(y) <- NULL # because as.matrix is faster without rownames
isRowSorted(as.matrix(y))

New: This R-only hack is the same speed for all matrices; it's definitely faster for sorted matrices; for unsorted ones it depends on the nature of the unsortedness.

iss3 <- function(x) {
  x2 <- sign(do.call(cbind, lapply(x, diff)))
  x3 <- t(x2)*(2^((ncol(x)-1):0))
  all(colSums(x3)>=0)
}

Original: This is faster for some unsorted matrices. How much faster will depends on where the unsorted elements are; this looks at the matrix column by column so unsortedness on the left side will be noticed much faster than unsorted on the right, while top/bottomness doesn't matter nearly as much.

iss2 <- function(y) {
  b <- c(0,nrow(y))
  for(i in 1:ncol(y)) {
    z <- rle(y[,i])
    b2 <- cumsum(z$lengths)
    sp <- split(z$values, cut(b2, breaks=b))
    for(spi in sp) {
      if(is.unsorted(spi)) return(FALSE)
    }
    b <- c(0, b2)
  }
  return(TRUE)
}


Well, the brute-force approach is to loop and compare, aborting as soon as a violation is found.

That approach can be implemented and tested easily in R, and then be carried over to a simple C++ function we can connect to R via inline and Rcpp (or plain C if you must) as looping is something that really benefits from an implementation in a compiled language.

Otherwise, can you not use something like diff() and check if all increments are non-negative?


You can use your do.call statement with is.unsorted:

issorted.matrix <- function(A) {!is.unsorted(do.call("order",data.frame(A)))}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号