开发者

mysql cluster catching up with cassandra?

开发者 https://www.devze.com 2023-03-29 14:46 出处:网络
I have been recently looking at nosql solutions for our quite big upcoming database and found that cassandra is good but there are very less resources available online about new releases of cassandra

I have been recently looking at nosql solutions for our quite big upcoming database and found that cassandra is good but there are very less resources available online about new releases of cassandra and most of the blogs and articles are related to 0.6 version while now it has also implemented support for hadoop and hive. While on the other hand mysql cluster version is also specifically made to run 开发者_如何学Pythonon horizontal scaled setup using commodity servers.

As we are used to relational model for years and moving to cassandra will need decompiling of brain while the product is still not very mature and community is not also that big to respond quickly to any particular problem I have checked datastax(on of the professional support providers) website and their forums are pretty much dead.

So, how to compare mysql cluster vs cassandra while putting relational and non-relational comparison put aside?

Though cassandra is schema less but still it provies pretty much tabular features like super colum and sub column too so record can be searched from multiple column values.

I have also tried my best to find out how cassandra physically stores updated queries like for a row when a sub column is edited and added quite a big chunk of data then how it physically stores that record and how it accesses that record fast? Because in mysql columns have fixed length allocated so its not a big issue.


Here are some areas where I suspect Cassandra has an advantage:

  • Excellent support for larger-than-memory data sets
  • Replication: Cassandra supports arbitrary numbers of fully-distributed replicas instead of just partitioned replicas (so, you don't have to have a number of nodes divisible by your replica count in Cassandra, and there are no corner cases to deal with around primary failover), best-in-class support for multiple datacenters, support for synchronous replication as well as asynchronous (important if you're concerned about full durability), and robust self-healing (hinted handoff, read repair, anti-entropy) to make sure you never have to blow away a backup replica and rebuild it from scratch
  • No locking during ALTER TABLE, index creation, etc
  • Substantially simpler and less error-prone administration (compare http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-online-add-node.html and http://wiki.apache.org/cassandra/Operations#Bootstrap). In particular, I would call your attention to how many client or other nodes need to be restarted in the Cassandra scenario: none.

To elaborate on the last a little, most people who haven't actually run Cassandra on a multi-node cluster, don't realize just how well Cassandra has been designed for this. For a two minute taste, see Jake Luciani's demo.


To answer your physical storage question, the key feature that makes Cassandra writes fast is that they are append-only. That is, Cassandra only ever writes sequential blocks to disk; it doesn't need to do any slow seeks to random disk locations during a write.

When a column is updated, two things happen: the write is appended to the commit log (for failure recovery), and the in-memory Memtable is updated. Once the Memtable is full, it is flushed out to disk as a new SSTable. Thus, the length of the data doesn't matter, since you're not trying to fit it into a fixed-length disk structure.

SSTables are read-only - you never go back and overwrite an old value on an update, you just write new ones. On a read, Cassandra first looks in the Memtable for the key. If it doesn't find it, Cassandra scans the SSTables in order from newest to oldest and stops when it finds the key. This gives you the most recent value.

There are a few optimizations as well. Each SSTable has an associated Bloom filter for its keys, which is a compact probabilistic index that can produce false positives but never false negatives. If the key is not in the Bloom filter, you can safely skip that SSTable as it is guaranteed not to contain the key, although you may occasionally read an SSTable that you didn't have to.

When you get too many SSTables, they are merged together into a bigger one in a process called compaction. Essentially this does a big merge sort on the SSTables. This lets Cassandra reclaim the space for values that have been overwritten or deleted, and defragment rows that were spread across multiple SSTables.

See http://www.mikeperham.com/2010/03/13/cassandra-internals-writing/ and http://wiki.apache.org/cassandra/MemtableSSTable for more information.


1st a disclaimer - I work as part of the MySQL Cluster product team

If you are looking to Cluster it would be worth starting with the latest 7.2 Development Release which includes new capabilities to significantly enhance JOIN performnce, as well as a new memcached interface, bypassing the SQL layer http://dev.mysql.com/tech-resources/articles/mysql-cluster-labs-dev-milestone-release.html

If you are familiar already with MySQL, then the following documentation highlights differences between InnoDB and the current GA 7.1 release: http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndb-innodb-workloads.html

While these don't provide direct comparisons with Cassandra, they do at least provide the latest information on Cluster from which you can base any comparison


Another option these days is relational model in cassandra with playORM and as long as you partition your really really big tables, you can do joins and all the stuff you are familiar with using Scalable SQL like so

@NoSqlQuery(name="findJoinOnNullPartition", query="PARTITIONS p(:partId) select p FROM TABLE as p INNER JOIN p.security as s where s.securityType = :type and p.numShares = :shares"),

NOTE: The TABLE is a Trades table and p.security references the Security table. Trades is partitioned so it can have unlimited partitions and Security table is smaller so it is not partitioned but you can do all the Scalabla SQL with joins you want to.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号