开发者

Reading a large file in functional scala

开发者 https://www.devze.com 2023-04-13 07:54 出处:网络
I\'m attempting to process a large binary file with scala. If possible I\'d like to use a functional approach. My main method for this looks like this right now:

I'm attempting to process a large binary file with scala. If possible I'd like to use a functional approach. My main method for this looks like this right now:

def getFromBis( buffer:List[Byte], bis:BufferedInputStream ):(Byte,List[Byte],Boolean) = {
    buffer match {
        case Nil =>
            val buffer2 = new Array[Byte](100000)
            bis.read(buffer2) match {
                case -1 => (-1,Nil,false)
                case _  => 
                    val buffer3 = buffer2.toList
                    (buffer3.head,buffer3.tail,true)
            }
        case b::tail => return (b,tail,true)
    }
}

It takes a list buffer and a buffered input stream. If the buffer isn't empty it simply returns the head and tail, if it is empty it gets the next chunk from the file and uses that as the buffer instead.

As you can see this isn't very functional. I'm trying to do this in a way where there's as few underlying io calls as possible, which is why I'm doing this in a chunked fashion. The problem here is the new Array. Everytime I run through the function it creates a new array, and judging 开发者_JAVA百科by the constantly increasing memory usage as the program runs, I don't think they're getting destroyed.

My question is this: Is there a better way to be reading a large file in a chunked fashion using scala? I'd like to keep a completely functional approach, but at the very least I need a function which could act as a black box for the rest of my functional program.


You almost certainly don't want to store bytes in a List. You need a new object for each byte. That's really inefficient, and will cause probably 20x more memory usage than you need.

The easiest way to do this is to create an iterator that stores internal state:

class BisReader(bis: BufferedInputStream) {
  val buffer = new Array[Byte](100000)
  var n = 0
  var i = 0
  def hasNext: Boolean = (i < n) || (n >= 0 && {
    n = bis.read(buffer)
    i = 0
    hasNext
  })
  def next: Byte = {
    if (i < n) {
      val b = buffer(i)
      i += 1
      b
    }
    else if (hasNext) next
    else throw new IOException("Input stream empty")
  }
}
implicit def reader_as_iterator(br: BisReader) = new Iterator[Byte] {
  def hasNext = br.hasNext
  def next = br.next
}

One could have BisReader extend Iterator[Byte], but since Iterator isn't specialized, this would require boxing for raw next/hasNext access. This way, you can get low-level (next/hasNext) access at full speed when you need it, and use handy iterator methods otherwise.

Now you've isolated your ugly nonfunctional Java IO stuff in a single class with a clean interface, and can go back to being functional.


Edit: except, of course, IO is order-dependent and has side effects, but the previous method doesn't get around that either.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号