Differences

This shows you the differences between two versions of the page.

--- start [2017/01/12 14:05]
zoza [Scraping and mining twitter streams]
+++ start [2018/04/18 08:01]
zoza
@@ Line 1: / Line 1: @@
 ====== POSTDOCTORAL RESERACH ======
+===== Python and SOM =====
+ - python module by Vahid Moosavi of CAAD, **sompy**
+ - another SOM python implementation, **somoclu**: https://somoclu.readthedocs.io/en/stable/index.html
 ===== Scraping and mining twitter streams =====
@@ Line 9: / Line 15: @@
   - **mine the tweets** using [[mine-tweets-py|python mining script]]
+A resourceful guide for Twitter textmining in Python: https://marcobonzanini.com/2015/03/23/mining-twitter-data-with-python-part-4-rugby-and-term-co-occurrences/
+===== Scraping and mining Dezeen articles =====
+  * with **scrapy**
+The setup: python3 in the conda environment
+<code>$ conda create -n bots python=3.4 # create a virtual environment named "bots"
+$ source activate bots # activate the environment; check if active: conda info --envs
+$ conda install -n bots -c conda-forge scrapy # install scrapy for the named environment
+</code>
+Run scrapy directly from the shell:
+<code>$ scrapy startproject dezeen # start a project</code>
+Detailed instructions here: https://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project
+Create a _spider_ in the folder dezeen/dezeen/spiders/ within which you will create a class that will declare its' name. This name will be used to call the spider from the console:
+<code>$ scrapy crawl spider_name</code>
+It is also important to declare fields in pages that will be scraped. This is done in the dezeen/items.py file, using eg (the Class is already declared when you start project).
+<code python>Class DezeenItem(Item):
+title = Field()
+link = Field()
+description = Field()
+</code>
+These fields will be later used as part of the item dictionary (e.g. item['link'])
 ====== DOCTORAL RESEARCH ======
@@ Line 324: / Line 360: @@
 ====== other ======
+[[ways-to-run-python|ways to run Python]]
 [[server maintenance]]

emperor's new architecture research

User Tools

Site Tools

Differences

Page Tools