I know we can grab information (with php) from any site and create own.
I'm talking about parsing some additional content like movie information (dates, budget, persons, etc) or video file properties from youtube (size, duration).
I'm excit开发者_如何学JAVAed on realizing of grabbing process from big sites and large amounts of information.
Seems there are several problems:
- Time of script execution. Seems we can make a rotation script to grab all the pages from one to another and push the content to our mysql base, but on a big number of pages execution time will be more than ordinary hosting provides (usually nearly 30 seconds), so the script will die on some moment.
- Amount of memory. Script will eat a much memory during parsing of a big number of pages.
- Antiddos? on located site (much queries from one ip address).
The main idea of this question is how to get round all these stones and make a rotational script (which can work all day long) without errors.
Are there some other bad news we can get during process?
Your thoughts?
I will answer this assuming that what you are doing is legal and going to add value to the data that is readily available. If that is the case, you can contact the sites in question and speak to them to confirm you screen scraping won't get blocked as a DoS attack. You can give them your IP addresses, etc. and everything will be fine.
There are many ways to make sure your process won't time out / use too much information. That just comes down to the design of your system. If the content of your site won't be original, please try to make the solution your own at least :) However if you run into specific issues during your implementation I'm sure you could get answers for focused questions.
Edit for clarification
My answer to your question is
1) Check with the sites you wish to scrape. If they have no problems they will not block your IP address - you can arrange a way to make sure this does not happen with them. Either use a static IP address. Or if the IP address you use may change, then agree a particular user agent string.
2) Once you've done (1) then start developing a solution. Execution time, etc. shouldn't be a problem, so if you encounter particular issues with your solution as you are coding it, then come back to stack overflow with a question focused on that one issue.
To be clear, if you can not or will not contact the sites you wish to scrape please tell us all now.
I'm talking about parsing some additional content like movie information (dates, budget, persons, etc) or video file properties from youtube (size, duration).
both imdb and youtube have API's to get data from their website, no need to scrape.
As @paulHadfield said, before you do anything, you need to ask the owner(s) of the website you want to scrape so you won't be mistaken for a DoS attack.
And what exactly are you trying to store in mysql?
精彩评论