This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
start [2017/10/18 10:59] zoza [Scraping and mining Dezeen articles] |
start [2018/04/18 08:01] zoza |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== POSTDOCTORAL RESERACH ====== | ====== POSTDOCTORAL RESERACH ====== | ||
+ | |||
+ | ===== Python and SOM ===== | ||
+ | |||
+ | - python module by Vahid Moosavi of CAAD, **sompy** | ||
+ | |||
+ | - another SOM python implementation, **somoclu**: https://somoclu.readthedocs.io/en/stable/index.html | ||
===== Scraping and mining twitter streams ===== | ===== Scraping and mining twitter streams ===== | ||
Line 22: | Line 28: | ||
Run scrapy directly from the shell: | Run scrapy directly from the shell: | ||
- | <code>$ scrapy startproject dezeen # start a project | + | <code>$ scrapy startproject dezeen # start a project</code> |
+ | |||
+ | Detailed instructions here: https://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project | ||
+ | |||
+ | Create a _spider_ in the folder dezeen/dezeen/spiders/ within which you will create a class that will declare its' name. This name will be used to call the spider from the console: | ||
+ | |||
+ | <code>$ scrapy crawl spider_name</code> | ||
+ | |||
+ | It is also important to declare fields in pages that will be scraped. This is done in the dezeen/items.py file, using eg (the Class is already declared when you start project). | ||
+ | |||
+ | <code python>Class DezeenItem(Item): | ||
+ | title = Field() | ||
+ | link = Field() | ||
+ | description = Field() | ||
+ | </code> | ||
+ | |||
+ | These fields will be later used as part of the item dictionary (e.g. item['link']) | ||
====== DOCTORAL RESEARCH ====== | ====== DOCTORAL RESEARCH ====== | ||