python – Asynchronous Web Scraper

This is my first asyncio/aiohttp web scraper I am trying to wrap my head around Python’s asyncio/aiohttp libs these days and I am not sure yet I fully understand it or not so I’d like have some constructive enhancement reviews here.

I’m scraping https://www.spoonflower.com/ which contains some public API’s for design data and pricing per fabric type data .My challenge was to get the design name, creator name and price of each design as per fabric type.Design name and creator name comes from this endpoint

https://pythias.spoonflower.com/search/v1/designs?lang=en&page_offset=0&sort=bestSelling&product=Fabric&forSale=true&showMatureContent=false&page_locale=en

and other pricing per fabric type data coming from this endpoint.

https://api-gateway.spoonflower.com/alpenrose/pricing/fabrics/FABRIC_’+ fab_type +’?quantity=1&shipping_country=PK&currency=EUR&measurement_system=METRIC&design_id=’+str(item(‘designId’))+’&page_locale=en

Each page has 84 items and 24 fabric types.I’m first getting all the names of the fabric types and storing in a list so I can loop through it and change the url dynamically then extracting designName and screenName from design page and finally extracting the price data.

Here is my code:

import asyncio
import aiohttp
import json
import requests
from bs4 import BeautifulSoup
from collections import OrderedDict


item_endpoint = 'https://pythias.spoonflower.com/search/v1/designs?lang=en&page_offset=0&sort=bestSelling&product=Fabric&forSale=true&showMatureContent=false&page_locale=en'

def get_fabric_names():
    res = requests.get('https://www.spoonflower.com/spoonflower_fabrics')
    soup = BeautifulSoup(res.text, 'lxml')
    fabrics = (fabric.find('h2').text.strip() for fabric in soup.find_all('div', {'class': 'product_detail medium_text'}))
    fabric = (("_".join(fab.upper().replace(u"u2122", '').split())) for fab in fabrics)
    for index in range(len(fabric)):
        if 'COTTON_LAWN_(BETA)' in fabric(index):
            fabric(index) = 'COTTON_LAWN_APPAREL'
        elif 'COTTON_POPLIN' in fabric(index):
            fabric(index) = 'COTTON_POPLIN_BRAVA'
        elif 'ORGANIC_COTTON_KNIT' in fabric(index):
            fabric(index) = 'ORGANIC_COTTON_KNIT_PRIMA'
        elif 'PERFORMANCE_PIQUÉ' in fabric(index):
            fabric(index) = 'PERFORMANCE_PIQUE'
        elif 'CYPRESS_COTTON' in fabric(index):
            fabric(index) = 'CYPRESS_COTTON_BRAVA'
    return fabric

async def fetch_design_endpoint(session, design_url):
    async with session.get(design_url) as response:
        extracting_endpoint = await response.text()
        _json_object = json.loads(extracting_endpoint)
        return _json_object('page_results')

async def fetch_pricing_data(session, pricing_endpoint):
    async with session.get(pricing_endpoint) as response:
        data_endpoint = await response.text()
        _json_object = json.loads(data_endpoint)
        items_dict = OrderedDict()
        for item in await fetch_design_endpoint(session, item_endpoint):
            designName = item('name')
            screenName = item('user')('screenName')
            fabric_name = _json_object('data')('fabric_code')
            try:
                test_swatch_meter = _json_object('data')('pricing')('TEST_SWATCH_METER')('price')
            except:
                test_swatch_meter = 'N/A'
            try:
                fat_quarter_meter = _json_object('data')('pricing')('FAT_QUARTER_METER')('price')
            except:
                fat_quarter_meter = 'N/A'
            try:
                meter = _json_object('data')('pricing')('METER')('price')
            except:
                meter = 'N/A'

            
            #print(designName, screenName, fabric_name, test_swatch_meter,fat_quarter_meter, meter)

            if (designName, screenName) not in items_dict.keys():
                items_dict((designName, screenName)) = {}
            itemCount = len(items_dict((designName, screenName)).values()) / 4
            return items_dict((designName, screenName)).update({'fabric_name_%02d' %itemCount: fabric_name,
            'test_swatch_meter_%02d' %itemCount: test_swatch_meter,
            'fat_quarter_meter_%02d' %itemCount: fat_quarter_meter,
            'meter_%02d' %itemCount: meter})
                

        

async def main():
    tasks = ()
    async with aiohttp.ClientSession() as session:
        fabric_type = get_fabric_names()
        design_page = await fetch_design_endpoint(session, item_endpoint)
        for item in design_page:
            for fab_type in fabric_type(0:-3):
                pricing_url = 'https://api-gateway.spoonflower.com/alpenrose/pricing/fabrics/FABRIC_'+ fab_type +'?quantity=1&shipping_country=PK&currency=EUR&measurement_system=METRIC&design_id='+str(item('designId'))+'&page_locale=en'
                print(pricing_url)
                await fetch_pricing_data(session, pricing_url)

                tasks.append(asyncio.create_task(
                    fetch_pricing_data(session, pricing_url)

                    )
                )

        content = await asyncio.gather(*tasks)
        return content
results = asyncio.run(main())
print(results)

Any ideas and suggestions are welcome to make this scraper more pythonic and smart.

Simple Python Recursive Web Scraper

I tried to make a simple recursive web scraper using Python. My idea was to grab all the links, titles and tag names.

Website: https://lifebridgecapital.com/podcast/

Course of Action:

Grab all the tags links from the Website.

tag_words_links(Website) --> (https://lifebridgecapital.com/tag/multifamily/)(2)

My script fetches all the links, tag names and titles from those links which tag_words_links returned. Some of these pages have pagination and some don’t, so I used an if condition to catch those pages which contain class="page-numbers".

By looking at the code, anyone can see clearly there is a lot of repetition going on in there, therefore I’d like to keep it DRY. Any suggestions and ideas are much appreciated.

Here is the code:

from requests_html import HTMLSession
import csv
import time


def tag_words_links(url):
    global _session
    _request = _session.get(url)
    tags = _request.html.find('a.tag-cloud-link')
    links = ()
    for link in tags:
        links.append(link.find('a', first=True).attrs('href'))
    
    return links

def parse_tag_links(link):
    global _session
    _request = _session.get(link)
    article_links = _request.html.find('h3 a')
    tag_names = (tag.text for tag in _request.html.find('div.infinite-page-caption'))
    articles = (article.find('a', first=True).attrs('href') for article in article_links)
    titles = (title.text for title in _request.html.find('h3.gdlr-core-blog-title'))
    if 'class="page-numbers"' in _request.text:
        next_page = _request.html.find('a.page-numbers')
        url = {url.find('a', first=True).attrs('href') for url in next_page}
        for page in url:
            next_page_request = _session.get(page)
            article_links = next_page_request.html.find('h3 a')
            for article in article_links:
                articles.append(article.find('a', first=True).attrs('href'))
            for title in article_links:
                titles.append((title.text for title in title.find('h3.gdlr-core-blog-title')))
            for tags in article_links:
                tag_names.append((tags for tags in tags.find('div.infinite-page-caption')))

    scraped_data = {
        'Title': titles,
        'Tag_Name': tag_names,
        'Link': articles
    }


    return scraped_data


if __name__ == '__main__':
    data = ()
    _session = HTMLSession()
    url = 'https://lifebridgecapital.com/podcast/'
    links = tag_words_links(url)
    for link in links:
        data.append(parse_tag_links(link))
        time.sleep(2)

    with open('life-bridge-capital-tags.csv', 'w', newline='', encoding='utf-8') as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=data(0).keys())
        writer.writeheader()
        for row in data:
            writer.writerow(row)
    

is there a free email scraper?

Advertising

y u no do it?

Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

Starts at just $1 per CPM or $0.10 per CPC.

python 3.x – Refactor Web Scraper

I wrote a simple zoopla real estate scraper for just practicing what i learned so far in python,requests,BeautifulSoup overall web scraping fundamentals and by looking at my code i feel like there would be a better and elegant way to write it but unfortunately as a beginner i don’t know yet.So, i believe i should let experienced guys here at SO to review my code for enhancement.

import requests
import json
import csv
import time
from bs4 import BeautifulSoup as bs

class ZooplaScraper:

    results = ()

def fetch(self, url):
    print(f'HTTP GET request to URL: {url}', end='')
    res = requests.get(url)
    print(f' | Status code: {res.status_code}')

    return res

def parse(self, html):
    content = bs(html, 'html.parser')
    content_array = content.select('script(id="__NEXT_DATA__")')
    content_dict = json.loads(content_array(0).string)
    content_details = content_dict('props')('initialProps')('pageProps')('regularListingsFormatted')

    for listing in content_details:
        self.results.append ({
            'listing_id': listing('listingId'),
            'name_title': listing('title'),
            'names': listing('branch')('name'),
            'addresses': listing('address'),
            'agent': 'https://zoopla.co.uk' + listing('branch')('branchDetailsUri'),
            'phone_no': listing('branch')('phone'),
            'picture': listing('image')('src'),
            'prices': listing('price'),
            'Listed_on': listing('publishedOn'),
            'listing_detail_link': 'https://zoopla.co.uk' + listing('listingUris')('detail')


        })

def to_csv(self):
    
    with open('zoopla.csv', 'w') as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=self.results(0).keys())
        writer.writeheader()

        for row in self.results:
            writer.writerow(row)

        print('Stored results to "zoopla.csv"')

def run(self):

    for page in range(1, 5):
        url = 'https://www.zoopla.co.uk/for-sale/property/london/?page_size=25&q=London&radius=0&results_sort=newest_listings&pn='
        url += str(page)
        res = self.fetch(url)
        self.parse(res.text)
        time.sleep(2)
    self.to_csv()

  if __name__ == '__main__':
     scraper = ZooplaScraper()
     scraper.run()

Basically in this scraper i mostly did is json parsing.Problem was all the data on the website was coming from javascript under the script tag so i have to select that tag and then pass it to json.loads() and parse the json dict to find the right key value pair.

Facebook Phone Number Scraper | Proxies123.com

  1. BobH
    New Member


    Joined:
    Dec 6, 2014
    Messages:
    10
    Likes Received:
    1
    Trophy Points:
    0
    Gender:
    Male
    Location:
    USA

    Facebook pay per click is kicks some serious butt over google adwords, FB has their head screwed on big time.
    I’m not sure if anyone knows, but you can target people by phone numbers, does anyone know of a facebook phone number scraper or a phone number scraper in general?
    It’s been said they have a higher match as high as 70% in some cases as phone numbers are less likely to change than emails of which has about 40% match rate..
    Thanks.

     

  2. chaphiop
    New Member


    Joined:
    Dec 10, 2014
    Messages:
    11
    Likes Received:
    0
    Trophy Points:
    0

    You can buy a list of high quality US proven buyers here -> http://lists.nextmark.com/
    it’s very little known yet, you will have a competitive advantage with this

     

  3. Alom1988
    VIP


    Joined:
    Dec 10, 2014
    Messages:
    26
    Likes Received:
    1
    Trophy Points:
    0
    Gender:
    Male
    Location:
    London

    Wow good point BobH — I have an email scraper, but not a phone number scraper —
    someone needs to make one!!

     

  4. king
    New Member


    Joined:
    Dec 14, 2014
    Messages:
    9
    Likes Received:
    1
    Trophy Points:
    0

    A phone number scraper hasn’t even crossed my mind. That’s why its so good to be a member of a site like this. Others always seem to have some bright ideas.

     

  5. tops
    New Member


    Joined:
    Dec 16, 2014
    Messages:
    2
    Likes Received:
    0
    Trophy Points:
    0

    Thanks for the FB share

     

  6. nelina
    New Member


    Joined:
    Dec 12, 2014
    Messages:
    10
    Likes Received:
    0
    Trophy Points:
    0
    Gender:
    Female
    Location:
    argentina

    I never think about the phone scrape option in Facebook, thanks for the idea.
    I search someone, and when I find it, I post here :p

     

  7. Alom1988
    VIP


    Joined:
    Dec 10, 2014
    Messages:
    26
    Likes Received:
    1
    Trophy Points:
    0
    Gender:
    Male
    Location:
    London

    Awesome! Do let us know :)
    And I will post it here if I find one as well

     

  8. ishboo
    New Member


    Joined:
    Dec 18, 2014
    Messages:
    20
    Likes Received:
    0
    Trophy Points:
    0
    Gender:
    Female

    I don’t think you can do this on FB. It’s against their terms of service or something like that.

     

  9. addres86
    VIP


    Joined:
    Dec 12, 2014
    Messages:
    2
    Likes Received:
    0
    Trophy Points:
    0

    Can I ask where I could find a good reputable email scraper for Facebook please?

     

  10. tantricmaster
    VIP


    Joined:
    Apr 16, 2015
    Messages:
    526
    Likes Received:
    0
    Trophy Points:
    0

    I thought it was againg their tos, interesting thread

     

  11. masteryoung
    New Member


    Joined:
    Apr 20, 2015
    Messages:
    10
    Likes Received:
    0
    Trophy Points:
    0
    Gender:
    Male

    Get UID’s and convert them with this free tool to emails:
    http://goo.gl/qol10y

     

  12. cuso
    VIP


    Joined:
    May 15, 2015
    Messages:
    630
    Likes Received:
    1
    Trophy Points:
    18

    The phone scraper is a really good idea, and yes I think that facebook trumps google adwords for us IMHO because you can CONTROL more variables

     

  13. offlineseller
    New Member


    Joined:
    Jun 3, 2015
    Messages:
    10
    Likes Received:
    0
    Trophy Points:
    0

    Yeah it is against the TOS but that doesn’t stop people from doing it. FB ads are evolving all the time and they even have a new feeds program similar to google shopping.

     

  14. wesdg1978
    New Member


    Joined:
    Jun 23, 2015
    Messages:
    3
    Likes Received:
    0
    Trophy Points:
    0

    Did anyone ever find a phone scraper?

     

  15. Rasfq
    VIP


    Joined:
    Jul 18, 2015
    Messages:
    578
    Likes Received:
    1
    Trophy Points:
    0

    This option is not working any more. But Phone numbers for target audience is still cool. For Now. FB going through rapid changes. thanks to BH (​IMG)

     

  16. cloud
    New Member


    Joined:
    Jul 19, 2015
    Messages:
    10
    Likes Received:
    0
    Trophy Points:
    0

    Phone scraper is as good as email scraper… Not.

     

  17. kingbin
    New Member


    Joined:
    Apr 11, 2017
    Messages:
    10
    Likes Received:
    0
    Trophy Points:
    0

    Nowaday, scarpe phone and email of facebook user is too hard

     

  18. kingbin
    New Member


    Joined:
    Apr 11, 2017
    Messages:
    10
    Likes Received:
    0
    Trophy Points:
    0

    Nowaday, scarpe phone and email of facebook user is too hard

     

  19. malik8021
    New Member


    Joined:
    Today
    Messages:
    20
    Likes Received:
    0
    Trophy Points:
    1
    Gender:
    Male

     

javascript – How to turn nodeJS scraper results into a CSV file?

I’m using the nodeJS scraper below that scrapes the google play store. However, I’d like the results to show up in a CSV file. Currently it just runs in console:

https://github.com/facundoolano/google-play-scraper

Secondly, I would like to somehow combine the “search” and “app detail” functions so that I could search a term and return the App title, developer name, app URL, and developer email.

Search function:

var gplay = require('google-play-scraper');

gplay.search({
    term: "panda",
    num: 2
  }).then(console.log, console.log);

App Details Function:

var gplay = require('google-play-scraper');

gplay.app({appId: 'com.google.android.apps.translate'})
  .then(console.log, console.log);

ScrapeBox Email Scraper

Photo 

Hellow !

Please tell me:
    – how to limit the search for letters to one section, I’m interested in the section (for example, so that I could get all Email only from the “Jobs” section).
    – how you can receive Email from this section (“Jobs”) and the URLs where they are located (for example, the “Email | URL” table or the “email, url” list).

As a result, I need to get a list or table in which there is an Email only from the “Jobs” section, and the URL where this email is located.

Attached Files

Thumbnail(s)
https://www.scrapeboxforum.com/   

scraper sites – Parsing the new CIA World Factbook

I have a python script from last year that I used to get basic info on countries from the CIA World Factbook. I relied on the printable HTML version of each country page, but with the recent reboot I can’t find any similar sources. Has anyone else had this problem and/or know of any solutions?

TELEGRAM SCRAPER & ADDER BUNDLE | Proxies123.com

PRICE FOR THE ABOVE TELEGRAM SCRAPER & ADDER BUNDLE ONLY $100..
ONLY BITCOIN PAYMENT ACCEPTED..

View Description: https://ibb.co/9VKcypz

Click Here To Buy: https://shoppy.gg/product/1jfYleZ

FAQ

Q: In which language these scripts are written?
A: These scripts are written in Python programming language.

Q: Do I need to be a windows user OR mac user to use them?
A: You can run and use these scripts in any OS ( mac, windows, Linux ). This is one of the biggest advantages because you are not limited to one OS as well as it gives you more flexibility.

Q: Do I need to be a programmer in order to use them?
A: NO!! you don’t need to be a programmer in order to use them, because everything is already written in the code and you will get a well-detailed setup guide that will help you to get started even if you have basic computer knowledge.

Q: Is there any adding limit to group/channel?
A: There is no limit with the group but with the channel, the telegram has a restriction of 200 members.

Q: How long do I have access to the scripts?
A: How does lifetime access sound? You will have unlimited and lifetime access.

Q: What if I still face a problem even after reading your setup guide?
A: Don’t worry you will get my contact details inside the setup guide so you can contact me there whenever you want and I will personally help you to set up everything via Team Viewer.

Q: Will I get a refund after I received the product?
A: No, we don’t offer a refund once you have received the product. All sales are final.

For Inquires: Skype: peterkim08

 

eu – Ryanair using “unauthorised screen scraper” argument to refuse refund

I had a return flight (booked on the 29th of Oct) with Ryanair that was moved from Dec 1. to Dec. 3 and subsequently cancelled. I requested a refund after the cancellation, and received an e-mail a few days after confirming that I, through their website, could get my cash refund.

When I go on the website and provide my e-mail and booking reference, they reject my application and provide the following reason:

“This booking has been identified as one purchased through an unauthorised screen scraper. In order to request a cash refund, the customer must complete our customer verification process. Please click here to complete the form.”

The problem is, the above is a complete fabrication – I have never booked any flight, ever, via anything but the corresponding airline websites, and I can see in my history that I went through the entire booking process with Ryanair.com!

To move forward they require that I fill out a demanding ‘customer verification form’, but this requires that I confirm which online travel agency (OTA) I used, but as noted, I never did. I guess they hope that I try to fill it out, allowing them subsequently to invalidate my claim by accusing me of falsifying my verification form. From my viewpoint it appears they actively are trying to defraud me, or hope I do something dumb.

I paid via MasterCard. It is impossible to get through to their customer service. What do you suggest? I don’t want a voucher – I only used Ryanair because they – due to CoVID – were the only direct option via my local airport.