开发者

Efficient binary serialization for Clojure/Java

开发者 https://www.devze.com 2023-04-12 05:47 出处:网络
I\'m looking for a way to efficiently serialize Clojure objects into a binary format - i.e. not just doing the classic print and read text serialization.

I'm looking for a way to efficiently serialize Clojure objects into a binary format - i.e. not just doing the classic print and read text serialization.

i.e. I want to do something like:

(def orig-data {:name "Data Object" 
                :data (get-big-java-array) 
      开发者_如何转开发          :other (get-clojure-data-stuff)})

(def binary (serialize orig-data))

;; here "binary" is a raw binary form, e.g. a Java byte array
;; so it can be persisted in key/value store or sent over network etc.

;; now check it works!

(def new-data (deserialize binary))

(= new-data orig-data)
=> true

The motivation is that I have some large data structures that contain a significant amount of binary data (in Java arrays), and I want to avoid the overhead of converting these all to text and back again. In addition, I'm trying to keep the format compact in order to minimise network bandwidth usage.

Specific features I'd like to have:

  • Lightweight, pure-Java implementation
  • Support all of Clojure's standard data structures as well as all Java primitives, arrays etc.
  • No need for extra build steps / configuration files - I'd rather it just worked "out of the box"
  • Good performance both in terms of processing time required
  • Compactness in terms of binary encoded representation

What's the best / standard approach to doing this in Clojure?


I may be missing something here, but what's wrong with the standard Java serialization? Too slow, too big, something else?

A Clojure wrapper for plain Java serialization could be something like this:

(defn serializable? [v]
  (instance? java.io.Serializable v))

(defn serialize 
  "Serializes value, returns a byte array"
  [v]
  (let [buff (java.io.ByteArrayOutputStream. 1024)]
    (with-open [dos (java.io.ObjectOutputStream. buff)]
      (.writeObject dos v))
    (.toByteArray buff)))

(defn deserialize 
  "Accepts a byte array, returns deserialized value"
  [bytes]
  (with-open [dis (java.io.ObjectInputStream.
                   (java.io.ByteArrayInputStream. bytes))]
    (.readObject dis)))

 user> (= (range 10) (deserialize (serialize (range 10))))
 true

There are values that cannot be serialized, e.g. Java streams and Clojure atom/agent/future, but it should work for most plain values, including Java primitives and arrays and Clojure functions, collections and records.

Whether you actually save anything depends. In my limited testing on smallish data sets serializing to text and binary seems to be about the same time and space.

But for the special case where the bulk of the data is arrays of Java primitives, Java serialization can be orders of magnitude faster and save a significant chunk of space. (Quick test on a laptop, 100k random bytes: serialize 0.9 ms, 100kB; text 490 ms, 700kB.)

Note that the (= new-data orig-data) test doesn't work for arrays (it delegates to Java's equals, which for arrays just tests whether it's the same object), so you may want/need to write your own equality function to test the serialization.

user> (def a (range 10))
user> (= a (range 10))
true
user> (= (into-array a) (into-array a))
false
user> (.equals (into-array a) (into-array a))
false
user> (java.util.Arrays/equals (into-array a) (into-array a))
true


Nippy is one of the best choices imho: https://github.com/ptaoussanis/nippy


Have you considered Google's protobuf? You might want to check the GitHub repository with the interface for Clojure.


If you don't have a schema ahead of time, serializing to text is probably your best bet. To serialize arbitrary data in general, you need to do a lot of work to preserve the object graph, and do reflection to see how to serialize everything...at least Clojure's printer can do a static, no-reflection lookup of the print-method for each item.

Conversely, if you really want an optimized wire format, you need to define a schema. I've used thrift from java, and protobuf from clojure: neither is loads of fun, but it's not hideously onerous if you plan in advance.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号