开发者

What is the best way to continuously export information from a Scrapy crawler to a Django application database? [duplicate]

开发者 https://www.devze.com 2023-03-24 10:54 出处:网络
This question already has answers here: Can one use the Django database layer outside of Django? (12 answers)
This question already has answers here: Can one use the Django database layer outside of Django? (12 answers) Access django models inside of Scrapy (8 answers) Closed 2 years ago.

I am trying to build a Django app that functions sort of like a store. Items are scraped from around the internet, and update the Django project database continuously over ti开发者_开发技巧me (say every few days). I am using the Scrapy framework to perform scraping, and while there is an experimental DjangoItem feature, I would rather stay away from it because it is unstable.

Right now my plan is to create XML files of crawled items with Scrapy's XMLItemExporter (docs here), and use those to loaddata into the Django project as XML fixtures (docs here). This seems to be okay because if one of the two processes screws up, there is a file intermediary between them. Modularizing the application as a whole also doesn't seem like a bad idea.

Some concerns are:

  • That these files might be too large to read into memory for Django's loaddata.
  • That I am spending too much time on this when there might be a better or easier solution, such as exporting directly to the database, which is MySQL in this case.
  • No one seems to have written about this process online, which is strange considering Scrapy is an excellent framework to plug into a Django app in my opinion.
  • There is no definitive guide of manually creating Django fixtures on Django's docs - it seems like it is geared more towards the dumping and reloading of fixtures from the app itself.

The existance of the experimental DjangoItem suggests that Scrapy + Django is a popular enough choice for there to be a good solution here.

I would greatly appreciate any solutions, advice, or wisdom on this matter.


This question is a bit old already, but I'm currently dealing with proper integration of Django + Scrapy as well. My workflow is the following: I've set up Scrapy as a Django management command as described in this answer. Afterwards, I simply write a Scrapy pipeline that saves a scraped item into Django's database using Django's QuerySet methods. That's all. I'm currently using SQLite for the database and it works like a charm. Maybe this is still helpful for someone.


You can use django-dynamic-scraper to create and manage Scrapy scrapers with easy access to Django models. So far I have not run into any problems that it can't solve that Scrapy can't solve.

Django-dynamic-scraper documentation

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号