Apache Cassandra Data Schema for Twitter Streaming API_问答_开发者

Apache Cassandra Data Schema for Twitter Streaming API

开发者 https://www.devze.com 2023-03-29 08:37 出处：网络

I am aware of Twissandra which is an example twitter clone usin开发者_C百科g Cassandra but I was interested to see if anyone has shared a Cassandra schema not to clone Twitter but to use for storing t

It very much depends what sort of queries you want to do with the data after you have ingested it - I see from your previous question "Dumping Twitter Streaming API tweets..." you probably just want to do big batch processing on it.

If this is the case, you just need to worry about load balancing, making sure each node in the cluster handles 1/n of the write load, and contains 1/n of the data - using the random partition and inserting one row per tweets with the status id as the row key will achieve this.

However, if you want to do queries like "give me all tweets for a given user" you will need a slightly more complicated schema, as the schema suggested above will require you to scan all the data. You could insert multiple tweets per row, the row key being the userid, the column key being the tweet id and the value being the tweet. Then you could use get_slice to answer that query.

A good (somewhat related) blog post: http://blog.insidesystems.net/basic-time-series-with-cassandra