Web Scraping Without Beautifulsoup



Note: Web scraping is not considered good practice if you try to scrape web pages without the website owner’s consent. It may also cause your IP to be blocked permanently by a website. Web Scrapping Courses. If you want to learn full-fledged web scraping techniques, you can try the following on-demand courses. Web Scraping in Python Datacamp. Web scraping is the term for using a program to download and process content from the Web. For example, Google runs many web scraping programs to index web pages for its search engine. In this chapter, you will learn about several modules that make it easy to scrape web pages in Python.

Sometimes we need to extract information from websites. We can extract data from websites by using there available API’s. But there are websites where API’s are not available.

Here, Web scraping comes into play!

Python is widely being used in web scraping, for the ease it provides in writing the core logic. Whether you are a data scientist, developer, engineer or someone who works with large amounts of data, web scraping with Python is of great help.

Without a direct way to download the data, you are left with web scraping in Python as it can extract massive quantities of data without any hassle and within a short period of time.

In this tutorial , we shall be looking into scraping using some very powerful Python based libraries like BeautifulSoup and Selenium.

BeautifulSoup and urllib

BeautifulSoup is a Python library for pulling data out of HTML and XML files. But it does not get data directly from a webpage. So here we will use urllib library to extract webpage.

First we need to install Python web scraping BeautifulSoup4 plugin in our system using following command :

$ sudo pip install BeatifulSoup4

$ pip install lxml

OR

$ sudo apt-get install python3-bs4

$ sudo apt-get install python-lxml

So here I am going to extract homepage from a website https://www.botreetechnologies.com

from urllib.request import urlopen

from bs4 import BeautifulSoup

We import our package that we are going to use in our program. Now we will extract our webpage using following.

response = urlopen('https://www.botreetechnologies.com/case-studies')

Beautiful Soup does not get data directly from content we just extract. So we need to parse it in html/XML data.

data = BeautifulSoup(response.read(),'lxml')

Here we parsed our webpage html content into XML using lxml parser.

As you can see in our web page there are many case studies available. I just want to read all the case studies available here.

There is a title of case studies at the top and then some details related to that case. I want to extract all that information.

We can extract an element based on tag , class, id , Xpath etc.

You can get class of an element by simply right click on that element and select inspect element.

case_studies = data.find('div', { 'class' : 'content-section' })

In case of multiple elements of this class in our page, it will return only first. So if you want to get all the elements having this class use findAll() method.

case_studies = data.find('div', { 'class' : 'content-section' })

Now we have div having class ‘content-section’ containing its child elements. We will get all <h2> tags to get our ‘TITLE’ and <ul> tag to get all children, the <li> elements.

case_stud.find('h2').find('a').text

case_stud_details = case_stud.find(‘ul’).findAll(‘li’)

Now we got the list of all children of ul element.

To get first element from the children list simply write:

case_stud_details[0]

We can extract all attribute of a element . i.e we can get text for this element by using:

case_stud_details[2].text

But here I want to click on the ‘TITLE’ of any case study and open details page to get all information.

Since we want to interact with the website to get the dynamic content, we need to imitate the normal user interaction. Such behaviour cannot be achieved using BeautifulSoup or urllib, hence we need a webdriver to do this.

Webdriver basically creates a new browser window which we can control pragmatically. It also let us capture the user events like click and scroll.

Selenium is one such webdriver.

Selenium Webdriver

Selenium webdriver accepts cthe ommand and sends them to ba rowser and retrieves results.

You can install selenium in your system using fthe ollowing simple command:

$ sudo pip install selenium

In order to use we need to import selenium in our Python script.

from selenium import webdriver

I am using Firefox webdriver in this tutorial. Now we are ready to extract our webpage and we can do this by using fthe ollowing:

self.url = 'https://www.botreetechnologies.com/'

self.browser = webdriver.Firefox() Pokemon ash gray orange islands download gba.

Now we need to click on ‘CASE-STUDIES’ to open that page.

We can click on a selenium element by using following piece of code:

self.browser.find_element_by_xpath('//div[contains(@id,'navbar')]/ul[2]/li[1]').click()

Now we are transferred to case-studies page and here all the case studies are listed with some information.

Here, I want to click on each case study and open details page to extract all available information.

So, I created a list of links for all case studies and load them one after the other.

To load previous page you can use following piece of code:

self.browser.execute_script('window.history.go(-1)')

Final script for using Selenium will looks as under:

And we are done, Now you can extract static webpages or interact with webpages using the above script.

Conclusion: Web Scraping Python is an essential Skill to have

Today, more than ever, companies are working with huge amounts of data. Learning how to scrape data in Python web scraping projects will take you a long way. In this tutorial, you learn Python web scraping with beautiful soup.

Along with that, Python web scraping with selenium is also a useful skill. Companies need data engineers who can extract data and deliver it to them for gathering useful insights. You have a high chance of success in data extraction if you are working on Python web scraping projects.

If you want to hire Python developers for web scraping, then contact BoTree Technologies. We have a team of engineers who are experts in web scraping. Give us a call today.

Consulting is free – let us help you grow!

In this article, we will see how to extract structured information from web-page leveraging BeautifulSoup and CSS selectors.

WebScraping with BeautifulSoup

Pulling the HTML out

BeautifulSoup is not a web scraping library per se. It is a library that allows you to efficiently and easily pull out information from HTML. In the real world, it is often used for web scraping projects.

So, to begin, we'll need HTML. We will pull out HTML from the HackerNews landing page using the requests python package.

Parsing the HTML with BeautifulSoup

Now that the HTML is accessible we will use BeautifulSoup to parse it. If you haven't already, you can install the package by doing a simple pip install beautifullsoup4. In the rest of this article, we will refer to BeautifulSoup4 as BS4.

We now need to parse the HTML and load it into a BS4 structure.

This soup object is very handy and allows us to easily access many useful pieces of information such as:

Targeting DOM elements

You might begin to see a pattern in how to use this library. It allows you to quickly and elegantly target the DOM elements you need.

If you need to select DOM elements from its tag (<p>, <a>, <span>, ….) you can simply do soup.<tag> to select it. The caveat is that it will only select the first HTML element with that tag.

For example if I want the first link I just have to do

This element will also have many useful methods to quickly extract information:

This is a simple example. If you want to select the first element based on its id or class it is not much more difficult:

And if you don't want the first matching element but instead all matching elements, just replace find with find_all.

This simple and elegant interface allows you to quickly write short and powerful Python snippets.

For example, let's say that I want to extract all links in this page and find the top three links that appear the most on the page. All I have to do is this:

Advanced usage

BeautifulSoup is a great example of a library that is both easy to use and powerful.

You can do much more to select elements using BeautifulSoup. Although we won't cover those cases in this article, here are few examples of advanced things you can do:

  • Select elements with regexp
  • Select elements with a custom function (links that have Google in them for example)
  • Iterating over siblings elements

We also only covered how to target elements but there is also a whole section about updating and writing HTML. Again, we won't cover this in this article.

Let's now talk about CSS selectors.

CSS selectors

Why learn about CSS selectors if BeautifulSoup can select all elements with its pre-made method?

Well, you'll soon understand.

Hard dom

Sometimes, the HTML document won't have a useful class and id. Selecting elements with BS4 without relying on that information can be quite verbose.

For example, let's say that you want to extract the score of a post on the HN homepage, but you can't use class name or id in your code. Here is how you could do it:

Not that great right?

If you rely on CSS selectors, it becomes easier.

This is much clearer and simpler, right? Of course, this example artificially highlights the usefulness of the CSS selector. But, you will quickly see that the DOM structure of a page is more reliable than the class name.

Easily debuggable

Scraping Html Data With Beautifulsoup

Another thing that makes CSS selectors great for web scraping is that they are easily debuggable. I'll show you how. Open Chrome, then open your developers’ tools, (left-click -> “Inspect”), click on the document panel, and use “Ctrl-F or CMD-F” to be in search mode.

In the search bar, you'll be able to write any CSS expression you want, and Chrome will instantly find all elements matching it.

Iterate over the results by pressing Enter to check that you are correctly getting everything you need.


Pdf

What is great with Chrome is that it works the other way around too. You can also left-click on an element, click “Copy -> Copy Selector”, and your selector will be pasted in your clipboard.


Powerful

CSS selectors, and particularly pseudo-classes, allow you to select any elements you want with one simple string.

Setting Up Public Key Authentication for SSH. The following simple steps are required to set up public key authentication (for SSH): Key pair is created (typically by the user). This is typically done with ssh-keygen. Private key stays with the user (and only there), while the public key is sent to the server. Typically with the ssh-copy-id utility. The key-based authentication mechanism in SSH is called public key authentication. Essentially, some session-specific data is signed using the private identity key. The signature is then sent to the server that checks if the key used for signing is configured as an authorized key. It's a very natural assumption that because SSH public keys (ending in.pub) are their own special format that the private keys (which don't end in.pem as we'd expect) have their own special format too. However, they're actually in the same stardard formats that OpenSSL uses. If you want more info check this out: OpenSSH vs OpenSSL Key Formats. Ssh with public key. Use the ssh-keygen command to generate SSH public and private key files. By default, these files are created in the /.ssh directory. You can specify a different location, and an optional password (passphrase) to access the private key file. If an SSH key pair with the same name exists in the given location, those files are overwritten. Public key authentication is a way of logging into an SSH/SFTPaccount using a cryptographic key rather than a password. If you use very strong SSH/SFTP passwords, your accounts are already safe from brute force attacks. However, using public key authentication provides many benefits when working with multiple developers. For example, with SSH keys you can 1. Allow multiple developers to log i.

Child and descendants

You can select direct child and descendant with:

And you can mix them together:

Web Scraping Beautiful Soup Python

This will totally work.

Siblings

This one is one of my favorites because it allows you to select elements based on the elements on the same level in the DOM hierarchy, hence the sibling expression.

To select all p coming after an h2 you can use the h2 ~ p selector (it will match two p). You can also use h2 + p if you only want to select p coming directly after an h2 (it will match only one p)

Attribute selectors

Attribute selectors allow you to select elements with particular attributes values. So, p[data-test='foo'] will match

Position pseudo classes

If you want to select the last p inside a section, you can also do it in “pure” CSS by leveraging position pseudo-classes. For this particular example, you just need this selector: section p:last-child(). If you want to learn more about this, I suggest you take a look at this article

Maintainable code

I also think that CSS expressions are easier to maintain. For example, at ScrapingBee, when we do custom web scraping tasks all of our scripts begins like this:

This makes it easy and quick to fix scripts when DOM changes appear. The laziest way to do it is to simply copy/paste what Chrome gives you when you left-click on an element. If you do this, be careful, Chrome tends to add a lot of useless selectors when you use this trick. So do not hesitate to clean them up a bit before using them in your script.

Conclusion

In the end, everything you do with pure CSS selectors you can do it with BeautifulSoup4. But, I think choosing the former is the best way to go.

I hoped you liked this article about web scraping in Python and that it will make your life easier.

Web Scraping Using Beautifulsoup

If you'd like to read more about web scraping in Python do not hesitate to check out our extensive Python web scraping guide.

You might also be interested by our XPath tutorial

Happy Scraping,

Pierre de Wulf





Comments are closed.