In Internet & Development

Today most sites do have content available via APIs, RSS feeds or some other form of structured data. But what do we do if there is nothing of the above provided and we still need the data in a structured way?

That’s where the art of web scraping comes into play. That’s what this article is about…

Extract the top 10 Google search results without waisting time

We are going to build a simple web scraper that is able to extract the title and url from the top 10 Google search results for any given term.

Let’s get our hands dirty

To build our scraper we use Java and the Jsoup library.

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

— From the Jsoup Website

Why Java?

We use Java in this example to show that it can be really simple to write a web scraper with this language. Most people think of Java with a lot of configuration and complexity compared to something like NodeJs.

Get the CSS selector for data extraction

We want to extract the title and url from the search results page. So fire up your browser’s developer tools and get the CSS selector we need…

We have to hover over the elements we are interested in and analyse the html structure. In the screenshot we can see the outer div#res container that contains all the results.

Every single search result is wrapped inside div.g and the html looks like this:

<div class="g">  
   <div class="rc" data-hveid="27">
      <h3 class="r">
        <a href="">
 by Ookla - The Global Broadband Speed Test
      <div class="s">

So the value we want to have is located at the a tag inside the h3.

After some experimenting the minimal final selector we use is: h3.r a

Write the actual code

The code is just a few lines and they are well commented so I just post them here…

package de.mphweb;

import org.jsoup.Jsoup;  
import org.jsoup.nodes.Document;  
import org.jsoup.nodes.Element;

public class App {

    //We need a real browser user agent or Google will block our request with a 403 - Forbidden
    public static final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";

    public static void main(String[] args) throws Exception {

        //Fetch the page
        final Document doc = Jsoup.connect("").userAgent(USER_AGENT).get();

        //Traverse the results
        for (Element result :"h3.r a")){

            final String title = result.text();
            final String url = result.attr("href");

            //Now do something with the results (maybe something more useful than just printing to console)

            System.out.println(title + " -> " + url);

Download the code

As usual you can find a running Maven project on Github.

You want to learn more?

I created an online video course for you and offer a 50% discount to loyal readers of my blog…

All you have to do is to click on the course image above. The discount code is included in the link!

I’m happy to see you in the course…

Recent Posts