python – Run scrapy from flask in another directory

I want to run scrapy from flask in another directory. My directory looks like this :

carchatbot
|__carscrape
|     |__carscrape
|           |__spiders
|                 |__cars_spider.py (my spider)
|__bot.py (my flask)

My spider is running fine if I directly run in the terminal with “scrapy crawl new_car” but I think of running the spider every time when someone enters the website, so I can show the user the latest information.

I already tried ScrapyRT, crochet, CrawlerProcess, CrawlerRunner and etc, but I still can’t make my spider automatically run whenever I load the website. This is some part of my flask :

from flask import Flask, request, jsonify, render_template
import json
import random
import dialogflowpy_webhook

app = Flask(__name__)

@app.route('/')
def index():
return render_template('index.html')

if __name__ == "__main__":
app.run()

@app.route('/webhook', methods=('POST'))
def webhook():
req = request.get_json(silent=True, force=True)
res = makeWebhookResult(req)
return res

What method can be used in my case?

javascript – Scrapy not pulling stock name, price, percentage from Business Insider

I’m trying to pull the ‘Name’, ‘Latest Price’, and ‘%’ fields for each stock from the following site:
https://markets.businessinsider.com/index/components/s&p_500

However, I get no data scraped even though I’ve confirmed that my XPaths work in the Chrome console for those fields. I’ve had an issue with Yahoo! Finance which uses ReactJS, but as far as I can tell that’s not the case with this webpage. Is there a separate issue here with JavaScript? If someone could help me understand how to detect whether ReactJS / JavaScript is being used on a page, etc. and how to deal with that when using Scrapy, that would be super helpful. Thanks in advance!

For reference, I’ve been using this guide:
https://realpython.com/web-scraping-with-scrapy-and-mongodb/

items.py:

from scrapy.item import Item, Field

class InvestmentItem(Item):
    ticker = Field()
    name = Field()
    px = Field()
    pct = Field()

investment_spider.py:

from scrapy import Spider
from scrapy.selector import Selector
from investment.items import InvestmentItem

class InvestmentSpider(Spider):
    name = "investment"
    allowed_domains = ("markets.businessinsider.com")
    start_urls = (
            "https://markets.businessinsider.com/index/components/s&p_500",
            )

    def parse(self, response):
        stocks = Selector(response).xpath('//*(@id="index-list-container")/div(2)/table/tbody/tr')

        for stock in stocks:
            item = InvestmentItem()
            item('name') = stock.xpath('td(1)/a/text()').extract()(0)
            item('px') = stock.xpath('td(2)/text()(1)').extract()(0)
            item('pct') = stock.xpath('td(5)/span(2)').extract()(0)

            yield item

output from console:

2020-05-26 00:08:32 (scrapy.utils.log) INFO: Scrapy 2.0.0 started (bot: investment)
2020-05-26 00:08:32 (scrapy.utils.log) INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.2 (v3.8.2:7b3ab5921f, Feb 24 2020, 17:52:18) - (Clang 6.0 (clang-600.0.57)), pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform macOS-10.15.4-x86_64-i386-64bit
2020-05-26 00:08:32 (scrapy.utils.log) DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-05-26 00:08:32 (scrapy.crawler) INFO: Overridden settings:
{'BOT_NAME': 'investment',
 'NEWSPIDER_MODULE': 'investment.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ('investment.spiders')}
2020-05-26 00:08:32 (scrapy.extensions.telnet) INFO: Telnet Password: 5680517a27f7223b
2020-05-26 00:08:32 (scrapy.middleware) INFO: Enabled extensions:
('scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats')
2020-05-26 00:08:32 (scrapy.middleware) INFO: Enabled downloader middlewares:
('scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats')
2020-05-26 00:08:32 (scrapy.middleware) INFO: Enabled spider middlewares:
('scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware')
2020-05-26 00:08:32 (scrapy.middleware) INFO: Enabled item pipelines:
('investment.pipelines.MongoDBPipeline')
2020-05-26 00:08:32 (scrapy.core.engine) INFO: Spider opened
2020-05-26 00:08:32 (scrapy.extensions.logstats) INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-26 00:08:32 (scrapy.extensions.telnet) INFO: Telnet console listening on 127.0.0.1:6023
2020-05-26 00:08:32 (scrapy.core.engine) DEBUG: Crawled (200) <GET https://markets.businessinsider.com/robots.txt> (referer: None)
2020-05-26 00:08:33 (scrapy.core.engine) DEBUG: Crawled (200) <GET https://markets.businessinsider.com/index/components/s&p_500> (referer: None)
2020-05-26 00:08:33 (scrapy.core.engine) INFO: Closing spider (finished)
2020-05-26 00:08:33 (scrapy.statscollectors) INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 488,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 22656,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 1.101634,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 5, 26, 4, 8, 33, 968318),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'memusage/max': 52027392,
 'memusage/startup': 52027392,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 5, 26, 4, 8, 32, 866684)}
2020-05-26 00:08:33 (scrapy.core.engine) INFO: Spider closed (finished)

python: PyCharm cannot use scrapy even after installing it using Anaconda

I installed Anaconda and downloaded Scrapy through it. Now when I want to start a new Scrapy project with Pycharm it says "Scrapy is not recognized as an internal or external command, operable program or batch file". Help, what should I do?
I have wasted many hours on this. Kindly help

Python: With the problem of scrapy, how to combine the printing results of two & # 39; & # 39; Titles "

Thank you for contributing a response to Stack Overflow!

  • Please make sure answer the question. Provide details and share your research!

But avoid

  • Ask for help, clarifications or respond to other answers.
  • Make statements based on opinion; Support them with references or personal experience.

For more information, see our tips on how to write excellent answers.

python – Problem with CrawlSpider in Scrapy

I'm trying to extract information from the elblogdelnarco.com page using the CrawlSpider class but I can not make it work, I show you my code:

from scrapy.item import Field, Article
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
import scrapy

misitems class (Item):
type = field ()
capacity = Field ()

MySpider class (CrawlSpider):
name = "blog"
allowed_domains = ["elblogdelnarco.com"]
    start_urls = ["https://elblogdelnarco.com/2019/06/21/la-vez-que-sicarios-del-cjng-de-el-mencho-presumieron-su-poder-en-calles-de-ciudad-de-mexico-video/"]

    rules = (
Rule (LinkExtractor (restrict_xpaths = ("// a[@class='next page-numbers']/ @ href "))),
Rule (LinkExtractor (restrict_xpaths = ("// h2[@class='title front-view-title']/ a / @ href ")), callback = & # 39; parse_item & # 39;),
)

def parse_item (auto, answer):
item = ItemLoader (misitems (), answer)
item.add_xpath ("title", "// h1[@class='title single-title entry-title']/text()")
item.add_xpath ("content", "(// div[@class='thecontent']/ p / b)[1]/text()")
performance item.load_item () 

python – Processing images without downloading using Scrapy Spiders

I'm trying to use a Scrapy Spider to solve a problem (a programming question from HackThisSite):

(1) I have to log in to a website, giving a username and a password (already done)
(2) After that, I have to access an image with a specific URL (the image is only accessible to registered users)
(3) Then, without downloading the image, I have to read its pixels and execute a function on the information
(4) And the result of the function will fill a form and send the data to the server of the website (I already know how to do this step)

So, I can ask again: would it be possible (using a spider) to read an image accessible only to registered users and process it in the spider code?

I tried to investigate different methods, using lines of elements is not a good method (I do not want to download the file).

The code that I already have is:

ProgrammingQuestion2 class (Spider):

name = & # 39; p2 & # 39;
start_urls = ['https://www.hackthissite.org/']

    def parse (auto, answer):

formdata_hts = {& # 39; username & # 39 ;: ,
& # 39; password & # 39 ;: ,
& # 39; btn_submit & # 39 ;: & # 39; Login & # 39;}

returns FormRequest.from_response (answer,
formdata = formdata_hts, callback = self.redirect_to_page)

def redirect_to_page (auto, reply):

Performance request (url = & # 39; https: //www.hackthissite.org/missions/prog/2/&#39 ;,
callback = self.solve_question_2)

def solve_question_2 (auto, answer):

open_in_browser (answer)
img_url = & # 39; https: //www.hackthissite.org/missions/prog/2/PNG&#39;
# What can I do here?

I hope to solve this problem using the functions of Scrapy, otherwise it would be necessary to log into the website (sending the form data) again.

Python 3.x – How to make loops in scrapy?

I'm scraping a website from the Dmoz site. I want to make a loop in functions because the in loop that I am using in each function, I will have to put again and again in each function. Although its functionality is the same. The second thing I want to solve is to make a loop in performance response. follow because if I scraped more pages, I'll have to write this again and again. Is there any way of my two problems. I tried several times but I failed.

                                # save and call another page
performance response.follow (self.about_page, self.parse_about, meta = {& # 39; items & # 39 ;: items})
performance response.follow (self.editor, self.parse_editor, meta = {& # 39; items & # 39 ;: items})

def parse_about (auto, answer):
# do your stuff on the second page
items = response.meta['items']
        names = {& # 39; name1 & # 39 ;: & # 39; Headings & # 39 ;,
& # 39; name2 & # 39 ;: & # 39; paragraphs & # 39 ;,
& # 39; name3 & # 39 ;: & # 39; 3 projects & # 39 ;,
& # 39; name4 & # 39 ;: & # 39; About Dmoz & # 39 ;,
& # 39; name5 & # 39 ;: & # 39; languages ​​& # 39 ;,
& # 39; name6 & # 39 ;: & # 39; You can make a difference & # 39 ;,
& # 39; name7 & # 39 ;: & # 39; additional information & # 39;
}

finder = {& # 39; find1 & # 39 ;: & # 39; h2 :: text, #mainContent h1 :: text & # 39 ;,
& # 39; find2 & # 39 ;: & # 39; p :: text & # 39 ;,
& # 39; find3 & # 39 ;: & # 39; li ~ li + li b a :: text, li: nth-child (1) b a :: text & # 39 ;,
& # 39; find4 & # 39 ;: & # 39; .nav ul a :: text, li: nth-child (2) b a :: text & # 39 ;,
& # 39; find5 & # 39 ;: & # 39; .nav ~ .nav to :: text & # 39 ;,
& # 39; find6 & # 39 ;: & # 39; dd :: text, # about-submit :: text & # 39 ;,
& # 39; find7 & # 39 ;: & # 39; li :: text, # about-more-info to :: text & # 39;
}
for name, search in zip (names.values ​​(), finder.values ​​()):
articles[name] = response.css (find) .extract ()
performance items

Python scrapy strip blank space answer

I'm trying to delete blank spaces in my scrapy script but it says that the object has no strip attribute

import scrapy

GamesSpider class (scrapy.Spider):
name = "games"
start_urls = [
        'myurl',
    ]

    def parse (auto, answer):
for the game in response.css (& # 39; ol # products-list li.item & # 39;):
performance {
& # 39 ;: game.css (& # 39; h2.product-name a :: text & # 39;). extract_first (). strip (),
& # 39; age & # 39 ;: game.css (& # 39 ;. list-price ul li: nth-child (1) :: text & # 39;). extract_first (),
& # 39 ;: players.css (& # 39 ;. list-price ul li: nth-child (2) :: text & # 39;). extract_first (),
& # 39; duration & # 39 ;: game.css (& # 39 ;. list-price ul li: nth-child (3) :: text & # 39;). extract_first (),
& # 39 ;: game.css (& # 39 ;. list-price ul li: nth-child (4) :: text & # 39;). extract_first ()
}

python – Cron for Scrapy

I am trying to deploy a Scrapy spider through crontab, and I can not get it to work. I have verified that cron works well with some Linux commands, everything works very well.
When I put the command to execute the spider, it does not work. I have registered the log that it returns and what it says is that it does not find scrapy.
I've tried creating a .sh file, or with the following code:

* * * * * cd / home / pedro / Documents / environments / basic / basic / && scrapy crawl  

The spider returns directly a csv as configured. works by throwing the spider with the instruction scrapy crawl , and if I run the previous cron code in console, without the asterisks, it also works.
I have consulted hundreds of places and I can not find the reason.
Can anybody help me?
Many thanks in advance