Creating a movie db and I dont like the idea of giving each actor/actress and also each tag its own row as if there are 10 million moives total, each has a cast of atleast 20-30 people we will have 200-300 million rows in the table.
And it gets more complex with tags which can be unlimited per movie. So how to best store these 3 items? Ideally these can be modeled as a Many to Many but 开发者_如何学Cstill it will have hundreds of millions of rows. Any better suggestions on storing these? I am using MySQL.
I would dump it all in a textfile but I need to link actors between movies and do some analytics also and allow users to rate actors find movies by tag, etc so need to use a DB.
10 million movies seems pretty ambitious. IMDb's current statistics show that they have less than 1.8M titles and around 3.9M people.
Having said that, I see no problem with creating a table of titles, a table of actors, and a junction table to resolve the many-to-many relationship between the two. The same holds true for tags.

It sounds perhaps a bit of premature optimization here. You could denormalize all actors into a TEXT column of some kind onto the Movie table, but your performance + search would suffer, as well as losing all benefits of relational data.
Suggest to keep the normalized schema, as you were originally thinking:
Movie (ID)
Actor (ID)
Tag (ID) --horror, comedy, etc.
MovieActor (MovieID, ActorID)
MovieTag (MovieID, TagID)
- Create indexes as per normal on the associative entities: MovieActorandMovieTag.
- Load some dummy data in a Test environment. 10 million movies with 100 million actors with 1 million tags. Create associative entries for each as required.
- Baseline and performance test.
- Horizontal partitioning (sharding) if your performance metrics require more performance.
Regardless of the number of movies, or whether the data is DNA sequences: implement a design, test it, judge its performance based on your requirements (user acceptance, SLA, etc)
What's the reason for your aversion to the millions of rows? A perceived performance issue?
It's going to have hundreds of millions of relations somewhere. You do have to capture the mapping between actor and film and as you say, there are 200-300 million of those (although I don't believe there are 10 million movies in existence?)
If you really wanted, you could (for example) pack the ids for actors for a film into multiple columns (or into one column) but that would make searching unpleasant.
10 million movies w/ 20 to 30 cast members each (although the number sound higher than real life) will invariably lead to 200-300 million associations. If you're storing your data in a relational database, each association will naturally be one row in a table linking movies to actors. Each row will be very small (two columns - movie PK and actor PK; possibly an extra surrogate key column); the bulk of the data will be stored in the movies and actors table.
Any other solution (in an SQL database) will store the same amount of data in a less optimal format.
 
         
                                         
                                         
                                         
                                        ![Interactive visualization of a graph in python [closed]](https://www.devze.com/res/2023/04-10/09/92d32fe8c0d22fb96bd6f6e8b7d1f457.gif) 
                                         
                                         
                                         
                                         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论