scraping Facebook

Hi, I was wondering if there is a way of using scrapebox to scrape facebook users by interest.  Thank you. Jackie

download – Automatic downloading through web scraping

(TLDR at the bottom) My goal was to automatically download English subtitles from the website ‘https://www.opensubtitles.org’ for all episodes of GoT (I know it’s not even trandy anymore, but I started watching it a week ago, no spoilers pls)

I have been using Python, with urllib.request to retrieve html pages and beautifulsoup(bs4) as bs to parse them.

So I got the link to the page where all seasons and episodes are listed: “https://www.opensubtitles.org/en/ssearch/sublanguageid-eng/idmovie-63130”. (note sublanguageid-eng)

Now, since the website stores multiple English subtitles for each episode, clicking on an episode gets you to the page storing all subtitles. Through inspection I found that html <a><a> fields to these pages contain “href=”https://webmasters.stackexchange.com/en/search/sublanguageid-eng/imdbid-“, so I could retrieve all 73 of them automatically with bs’s a.get('href').

Here to the issue: within each episode’s subtitles page, clicking on the e.g. 3798xsrt, in the download column, or opening it in a new page, starts the download from the browser, while copying the link address and pasting it in browser does not download anything. (The link copied looks like this: https://www.opensubtitles.org/en/subtitleserve/sub/8219491 and yields the episode’s subtitles page itself) Therefore, urllib.request with that url simply returns an HTML page and does not yield the desired download.

Does anyone know how this is happening and how to work around it to automatically download?

TLDR: A content download works on click, but does not with the link the click redirects to. This hinders automatic download through scraping, as the urllib.request yields the HTML page linked at, instead of the content to be downloaded. Any explanation and workaround welcome.

Web Scraping Dynamically Generated Content Python

This is a small web scraping project I made in 2 hours that targets the website remote.co . I am looking forward for improvements in my code. I know about the inconsistency with the WebDriverWait and time.sleep() waits, but when I used the WebDriverWait to wait until the load_more button was clickable and ran the program selenium crashed my webdriver window and continuously spammed my terminal window with 20-30 lines of seemingly useless text.

import scrapy   
from selenium import webdriver
from selenium.common.exceptions import ElementNotInteractableException
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep


class ScrapeRemote(scrapy.Spider):
    name = 'jobs'
    start_urls = (f'https://remote.co/remote-jobs/search/?search_keywords={job_title}')

    job_title = input('Enter your desired position: ').replace(' ', '+')

    def __init__(self):
        self.driver = webdriver.Chrome(r'C:Usersleaguchromedriver.exe')

    def parse(self, response):
        self.driver.get(response.url)

        try:
            load_more = WebDriverWait(self.driver, 10).until(
                EC.visibility_of_element_located((By.XPATH, '/html/body/main/div(2)/div/div(1)/div(3)/div/div/a'))
                )
        except TimeoutException:
            self.log("Timeout - Couldn't load the page!")

        while True:
            try:
                sleep(1.5)
                load_more = self.driver.find_element_by_css_selector('a.load_more_jobs')
                load_more.click()
            except (ElementNotInteractableException, ElementClickInterceptedException):
                try:
                    close_button = WebDriverWait(self.driver, 6).until(
                        EC.element_to_be_clickable((By.CSS_SELECTOR, '#om-oqulaezshgjig4mgnmcn-optin > div > button'))
                        )
                    close_button.click()
                except TimeoutException:
                    self.log('Reached Bottom Of The Page!')
                    break

        selector = scrapy.selector.Selector(text=self.driver.page_source)
        listings = selector.css('li.job_listing').getall()

        for listing in listings:
            selector = scrapy.selector.Selector(text=listing)
            position = selector.css('div.position h3::text').get()
            company = selector.css('div.company strong::text').get()
            more_information = selector.css('a::attr(href)').get()
            yield {
                'position': position,
                'company': company,
                'more_information': more_information
            }

        self.driver.close()

Facebook debugger reports “Curl error: 56 (RECV_ERROR)” when scraping content from my site

Starting from today (16th May 2020), all of today’s news from my website www.indozone.id cannot scrape anything (ex. https://www.indozone.id/health/gms7nka/diet-aktor-bollywood-hrithik-roshan-berpuasa-selama-23-jam).

I did not change any of the server configuration, nor change any firewall settings on my server? But Curl error: 56 (RECV_ERROR) keeps popping up when i’m using fb debugger.

And yes, it works fine yesterday.

import – Google nGram data scraping

In Chrome, after running the ngram query, you can go to "Developer Tools", find the source file starting with graph?content=", look for the chain var data (search is under triple dots in upper right corner) and find time series var data = ...

python – web scraping links

I am working to remove links from the website of a Christmas tree farm. First, I used this tutorial method to get all the links. So, I noticed that the links I wanted did not lead with the proper hypertext transfer protocol, so I created a variable to concatenate. Now i'm trying to create a if statement that takes each link and looks for two characters followed by "xmastrees.php". If that's true then my variable concatenated to the front. If the link does not contain the specific text, it is removed. For example NYxmastrees.php will be http://www.pickyourownchristmastree.org/NYxmastrees.php and ../disclaimer.htm it will be removed I have tried several ways but can't seem to find the right one.

This is what I currently have and I keep encountering a syntax error: del. I commented on that line and got another error saying that my string object doesn't have the & # 39; re & # 39; attribute. Does this confuse me because I thought I could use regular expressions with strings?

source = requests.get('http://www.pickyourownchristmastree.org/').text
soup = BeautifulSoup(source, 'lxml')
concatenate = 'http://www.pickyourownchristmastree.org/'

find_state_group = soup.find('div', class_ = 'alert')
for link in find_state_group.find_all('a', href=True):
    if link('href').re.search('^.B.$xmastrees'):
        states = concatenate + link
    else del link('href')
    print(link('href')

Error with else del link('href'):

    else del link('href')
           ^
SyntaxError: invalid syntax

Error without else del link('href'):

    if link('href').re.search('^.B.$xmastrees'):
AttributeError: 'str' object has no attribute 're'

First Python web scraping project with BeautifulSoup

This is my first Web Scraping project where I am retrieving current stock information from https://www.tradingview.com/markets/stocks-usa/market-movers-large-cap. This program works as expected, but I certainly think someone with more experience with language and web scraping could improve it.

#Imports
from bs4 import BeautifulSoup
from colorama import Fore as F
from time import sleep
import requests
import webbrowser
import pandas
import functools
import subprocess
from os import system
import geoip2.database
#Uses Maxmind GeoLite2-City Database for IP Location

#Compatible with most *nix systems only.  Please leave feedback if compatability for Windows is wanted.
#Should I make a function to check internet connection or just let an error arise?
#Beginning of program messages
print("""
 33(32m /$$$$$$ 
 /$$__  $$
| $$  __/
|  $$$$$$ 33(34m_____             ______         
 33(32m____  $$33(34m__  /________________  /_________
 33(32m/$$   $$33(34m_  __/  __ _  __ _  //_/_  ___/
33(32m|  $$$$$$/33(34m/ /_ / /_/ /  / / /  ,<  _(__  )
 33(32m______/ 33(34m__/ ____//_/ /_//_/|_| /____/

    """)
print(F.BLUE + "(!)Enlarge window as much as possible for easier observations" + F.RESET)
sleep(2)

#subprocess.run("clear")
#Variables
stock_chart = {"Value": False, "Data": False}
#Functions
def internet_test():
    proc = subprocess.Popen("ping google.com",
                            stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE,
                            shell=True,
                            universal_newlines=True)
    if proc.returncode == 0:
        return True
    return False
def display(df):
    formatters = {}
    for li in list(df.columns):
        max = df(li).str.len().max()
        form = "{{:<{}s}}".format(max)
        formatters(li) = functools.partial(str.format, form)
    print(F.LIGHTGREEN_EX + df.to_string(formatters=formatters,
                                         index=False,
                                         justify="left"))


def search_df(search_str: str, df: pandas.DataFrame) -> pandas.DataFrame:
    results = pandas.concat((df(df("Symbol").str.contains(search_str.upper())), df(df("Company").str.contains(search_str,case=False))))
    return results



#Function for fetching stocks, returns pandas.DataFrame object containing stock info
#Stocks pulled from https://www.tradingview.com/markets/stocks-usa/market-movers-large-cap
def stocks():
    #Set pandas options
    pandas.set_option("display.max_rows", 1000)
    pandas.set_option("display.max_columns", 1000)
    pandas.set_option("display.width", 1000)

    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
               " Chrome/80.0.3987.149 Safari/537.36"}

    #Make Request to site
    site = requests.get("https://www.tradingview.com/markets/stocks-usa/market-movers-large-cap", headers)

    #BeautifulSoup Object
    soup = BeautifulSoup(site.content, "html.parser")

    #Process to go achieve a list of stocks !!!SUGGESTIONS FOR EFICIENCY!!!
    html = list(soup.children)(3)
    body = list(html.children)(3)
    div = list(body.children)(5)
    div2 = list(div.children)(9)
    div3 = list(div2.children)(1)
    div4 = list(div3.children)(3)
    div5 = list(div4.children)(1)
    div6 = list(div5.children)(3)
    div7 = list(div6.children)(3)
    div8 = list(div7.children)(1)
    table = list(div8.children)(1)
    tbody = list(table.children)(3)
    stocks = tbody.find_all("tr")
    chart = {"Symbol": (), "Company": (), "Price Per Share": (), "Change(%)": (), "Change(Points)": ()}

    #Find each component of stock and put it into a chart
    for stock in stocks:
        symbol = list(stock.find("td").find("div").find("div"))(1).get_text()
        name = stock.find("td").find("div").find("div").find("span").get_text().strip()
        last_price = "$" + stock.find_all("td")(1).get_text()
        change_percent = stock.find_all("td")(2).get_text()
        change_points = stock.find_all("td")(3).get_text()
        chart("Symbol").append(symbol)
        chart("Company").append(name)
        chart("Price Per Share").append(last_price)
        chart("Change(%)").append(change_percent)
        chart("Change(Points)").append(change_points)

    panda_chart = pandas.DataFrame(chart)
    return panda_chart


def ip_info(ip):
    print(F.YELLOW + "(!)IP information is approximate.  Please use IPv6 for more accurate results.")
    try:
        reader = geoip2.database.Reader("GeoLite2-City.mmdb")
        print(F.GREEN + "(√)Database Loaded")
    except FileNotFoundError:
        print(F.RED + "(!)Could not open database; Exiting application")
        exit(1)
    #subprocess.run("clear")
    response = reader.city(ip)
    print(F.LIGHTBLUE_EX + """
    ISO Code: {iso}
    Country Name: {country}
    State: {state}
    City: {city}
    Postal Code: {post}
    Latitude: {lat}
    Longitude: {long}
    Network: {net}""".format(iso=response.country.iso_code, country=response.country.name,
                             state=response.subdivisions.most_specific.name, city=response.city.name,
                             post=response.postal.code, lat=response.location.latitude, long=response.location.longitude,
                             net=response.traits.network))
    print("nnEnter "q" to go back to menu or "op" to open predicted location in Google Maps.", end="nnnnnn")
    while True:
        inp = input()
        if inp == "q":
            break
        elif inp == "op":
            webbrowser.open(f"https://www.google.com/maps/search/{response.location.latitude},{response.location.longitude}", new=0)
            break

#Main
def main():
    try:
        global stock_chart
        internet = internet_test()
        print("""33(33mOptions:

          33(94m(1) - Display a chart of popular stocks
          (2) - Search a chart of popular stocks
          (3) - Locate an Internet Protocol (IP) Address
        """)
        while True:
            choice = input(F.YELLOW + "Enter Option Number(1-3)> " + F.WHITE)
            if choice in ("1", "2", "3"):
                break
            print(F.RED + "(!)Option invalid")
        if choice in ("1", "2"):
            if not stock_chart("Value"):
                stock_chart("Value") = True
                stock_chart("Data") = stocks()
            if choice == "1":
                display(stock_chart("Data"))
            else:
                search = input(F.LIGHTBLUE_EX + "Enter name to search for> ")
                display(search_df(search, stock_chart("Data")))
                sleep(1)
        else:
            ip_addr = input(F.GREEN + "Enter an Internet Protocol (IP) Address(IPv4 or IPv6)> ")
            try:
                ip_info(ip_addr)
            except ValueError:
                print(F.RED + "IP Address invalid")
                sleep(1)
        main()
    except KeyboardInterrupt:
        print(F.RED + "(!)Exiting..." + F.RESET)



if __name__ == "__main__":
    main()

All suggestions are welcome!

I'll scrap emails for your business for $ 2

I will discard emails for your business

I will provide emails for your business.
Emails are collected from top ranked websites.
I guarantee more precision for the given job.
The job will be completed in the specified time.
Emails can be collected according to your wish on the main social media websites.
I will never face any loss if I get the job.

PLEASE, ORDER NOW …….

.

web scraping: i need to extract google data value with BeautifulSoup in python

When you go to this link:
"https://www.google.com/search?q=usd+to+clp&rlz=1C1CHBD_esVE816VE816&oq=usd+to+clp&aqs=chrome..69i57j35i39j0l2j69i60l4.1479j0j7&sourceid=chrome&ie=UTF-8"

You can see the exchange price between USD and CLP (Chilean peso).

I want to extract that number …

I've already tried doing the following … (and much more)

import requests
from bs4 import BeautifulSoup

result = requests.get("https://www.google.com/search?q=usd+to+clp&rlz=1C1CHBD_esVE816VE816&oq=usd+to+clp&aqs=chrome..69i57j35i39j0l2j69i60l4.1479j0j7&sourceid=chrome&ie=UTF-8")

src = result.content
soup = BeautifulSoup(src, features="html.parser")
value = soup.find ('span', class_ = "DFlfde.SwHCTb").get_text()
print(value)

It would be great if someone could help me, thank you very much.

PS: Yes, I am a newbie, sorry: P

Data entry and scraping, Excel and spreadsheet, PDF to Word, Excel, any typing job for $ 5

Hello, have a nice day. Check my work activity when providing your project. You should be satisfied to see my work activity. Thanks with best regards.

by: shajuctg
Created: –
Category: Data entry
Views: 233


.