What's inside Web Scrapping?

Ashim Bajracharya

Software Design Analyst

So, what actually is it?

 

Web Scraping which is also known as web harvesting is nothing but a scientific technique to harvest or mine or even extract large amounts of data from different websites and saved to a certain local location in specified or simple format.

It is carried out by certain piece of codes, where request queries are sent to specific website. On the basis of the received result, it is parsed from HTML Document. After that, scrapper searches for the data we need within that document. Data is then converted to the specified format. The extracted data can be documents, product items, images, videos, text, contact information, emails and phone numbers.

Its Amazed to Get Amazing Applications

There are certain effective applications of web scrapping. Some are mentioned in below the following points: 

  • Weather reporting and analysis.

  • Acquiring auction details.

  • Extracting and mining news from different websites.

  • Obtaining market price and make analysis.

  • Extracting contact information of various personalities.

  • For understanding customer experiment and feedback by extracting reviews from eCommerce portals and other public forums.

  • It is very helpful for tracking prices from multiple markets.

  • Extracting data from social media sites that allow crawling to gauge consumer trend and the way they react to campaigns.

Odoo text and image block
Odoo text and image block
Odoo image and text block

How can we do web scraping?

There various technical ways we can scrap data from the various websites and some of them are mentioned below:

  • Point and Click Interface

  • Auto Pattern Detection

  • Export scraped data 

  • Export data to file/database

  • Scrape from Multiple Pages

  • Keyword based Scraping

  • Proxy Servers / VPN

  • Regular Expressions

  • Automate browser interaction

Methodology

There are different methods of website scrapping, lets see some them below:

  • Manual Scrapping:

  • Automated Scraping Techniques

    • HTML Parsing

    • DOM Parsing

    • Vertical Aggregation

    • Xpath Method

    • Honeypots

    • Text Pattern Matching

Odoo text and image block

But there are some pitfalls too,

  •  The 'robots.txt' in the website makes the scraping rule which pitfalls the web scraping if certain rule is not allowed.

  • HTML can be evil for web scrapping process because, HTML tags contain id, class or both due to which on their value change  could break out scraping code or even can get wrong results.

  • User agent spoofing is another pitfall. Every time we visit a website, browser information is obtained via user agent. Moreover some websites won't show any content unless we provide user.