Jsoup Library in Java

Introduction to Jsoup

Jsoup is a powerful Java library that works with real-world HTML. It provides a convenient API for extracting and manipulating data using DOM, CSS, and jQuery-like methods. Jsoup can handle HTML parsing, content extraction, DOM traversal, and much more.

Installation

Adding Jsoup to Your Project

To use Jsoup, add the following dependency to your pom.xml if you're using Maven:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version> <!-- or the latest version -->
</dependency>

For Gradle:

implementation 'org.jsoup:jsoup:1.15.3'

Basic Usage

Parsing HTML from a URL

Jsoup allows you to parse HTML from a URL and extract data easily.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.io.IOException;

public class JsoupExample {
    public static void main(String[] args) {
        try {
            // Parse HTML from a URL
            Document document = Jsoup.connect("https://example.com").get();
            System.out.println(document.title());

            // Extract specific element
            Element element = document.selectFirst("h1");
            System.out.println("First h1 element: " + element.text());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Explanation: This example demonstrates how to parse HTML from a URL and extract the document title and the first h1 element.

Output:

Example Domain
First h1 element: Example Domain

Parsing HTML from a String

Jsoup can also parse HTML from a string.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupStringExample {
    public static void main(String[] args) {
        String html = "<html><head><title>My Page</title></head>"
                    + "<body><p>Hello, Amit!</p></body></html>";

        Document document = Jsoup.parse(html);
        System.out.println(document.title());

        Element body = document.body();
        System.out.println("Body text: " + body.text());
    }
}

Explanation: This example demonstrates how to parse HTML from a string and extract the document title and body text.

Output:

My Page
Body text: Hello, Amit!

Advanced Features

Selecting Elements

Jsoup provides powerful methods to select elements using CSS selectors.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class JsoupSelectExample {
    public static void main(String[] args) {
        try {
            Document document = Jsoup.connect("https://example.com").get();

            // Select all paragraphs
            Elements paragraphs = document.select("p");
            for (Element paragraph : paragraphs) {
                System.out.println("Paragraph: " + paragraph.text());
            }

            // Select element by ID
            Element div = document.getElementById("main");
            System.out.println("Element with ID 'main': " + div.text());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Explanation: This example demonstrates how to select elements using CSS selectors and extract their text.

Output:

Paragraph: ...
Element with ID 'main': ...

Extracting Attributes

Jsoup allows you to extract attributes from elements.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.io.IOException;

public class JsoupAttributesExample {
    public static void main(String[] args) {
        try {
            Document document = Jsoup.connect("https://example.com").get();

            // Select the first link
            Element link = document.selectFirst("a");
            if (link != null) {
                System.out.println("Link text: " + link.text());
                System.out.println("Link href: " + link.attr("href"));
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Explanation: This example demonstrates how to extract the href attribute from a link.

Output:

Link text: More information...
Link href: https://www.iana.org/domains/example

Modifying HTML

Jsoup allows you to modify the HTML content of a document.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupModifyExample {
    public static void main(String[] args) {
        String html = "<html><head><title>My Page</title></head>"
                    + "<body><p>Hello, Vikas!</p></body></html>";

        Document document = Jsoup.parse(html);
        Element body = document.body();

        // Modify the body text
        body.text("Hello, Priya!");

        System.out.println(document.html());
    }
}

Explanation: This example demonstrates how to modify the text of an element in the document.

Output:

<html>
 <head>
  <title>My Page</title>
 </head>
 <body>
  Hello, Priya!
 </body>
</html>

Extracting Data from Tables

Jsoup can be used to extract data from tables in HTML.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTableExample {
    public static void main(String[] args) {
        String html = "<table><tr><th>Name</th><th>Age</th></tr>"
                    + "<tr><td>Amit</td><td>30</td></tr>"
                    + "<tr><td>Priya</td><td>28</td></tr></table>";

        Document document = Jsoup.parse(html);
        Elements rows = document.select("table tr");

        for (Element row : rows) {
            Elements cells = row.select("th, td");
            for (Element cell : cells) {
                System.out.print(cell.text() + " ");
            }
            System.out.println();
        }
    }
}

Explanation: This example demonstrates how to extract data from an HTML table and print it.

Output:

Name Age
Amit 30
Priya 28

Complex Examples

Web Scraping

Jsoup can be used to scrape data from web pages.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class JsoupWebScrapingExample {
    public static void main(String[] args) {
        try {
            // Connect to the website and get the document
            Document document = Jsoup.connect("https://en.wikipedia.org/wiki/List_of_Indian_people").get();

            // Select all people in the list
            Elements people = document.select(".mw-parser-output ul li");

            for (Element person : people) {
                System.out.println("Person: " + person.text());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Explanation: This example demonstrates how to scrape a list of names from a Wikipedia page.

Output:

Person: ...
Person: ...

Handling Forms

Jsoup can handle form submissions and extract form data.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.io.IOException;
import java.util.Map;

public class JsoupFormExample {
    public static void main(String[] args) {
        try {
            Document document = Jsoup.connect("https://example.com/login").get();
            Element form = document.selectFirst("form");

            if (form != null) {
                Map<String, String> formData = formData(form);
                for (Map.Entry<String, String> entry : formData.entrySet()) {
                    System.out.println(entry.getKey() + ": " + entry.getValue());
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static Map<String, String> formData(Element form) {
        Map<String, String> data = new java.util.HashMap<>();
        for (Element input : form.select("input")) {
            String name = input.attr("name");
            String value = input.attr("value");
            if (!name.isEmpty()) {
                data.put(name, value);
            }
        }
        return data;
    }
}

Explanation: This example demonstrates how to extract form data from a webpage.

Output:

username:
password:

Parsing and Modifying Large HTML Documents

When working with large HTML documents, Jsoup provides efficient methods for parsing and modifying the content.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import java.io.File;
import java.io.IOException;

public class JsoupLargeDocumentExample {
    public static void main(String[] args) {
        try {
            // Parse a large HTML file
            File inputFile = new File("path/to/large-file.html");
            Document document = Jsoup.parse(inputFile, "UTF-8");

            // Extract and modify a specific element
            Element element = document.selectFirst("div.content");
            if (element != null) {
                element.text("Updated content");
            }

            System.out.println(document.html());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Explanation: This example demonstrates how to parse and modify a large HTML file efficiently.

Output:

<!DOCTYPE html>
<html>
<head>
  <title>Large Document</title>
</head>
<body>
  <div class="content">Updated content</div>
</body>
</html>

Conclusion

Jsoup is a versatile and powerful library that works with HTML in Java. This guide covered the basics of parsing HTML from a URL and a string, selecting elements, extracting attributes, modifying HTML, extracting data from tables, handling forms, and more complex examples like web scraping and working with large documents. 

By leveraging Jsoup, you can simplify and enhance your HTML data extraction and manipulation tasks in Java applications. For more detailed information and advanced features, refer to the official Jsoup documentation.

Comments