Parse HTML with Jsoup in Kotlin
This week I had to read out the title of an HTML file, so that I could reuse it later in my code. First of all, I wanted to simply take the substring between the two <title></title> brackets. After some testing, I quickly recognized that this isn’t an ideal solution for my problem, since this could also give me wrong results out. For example, if I have a text, that also contains these two brackets, it could also give me that out instead of the real title. Then I heard about Jsoup.
What’s Jsoup?
Jsoup is an open source Java library mainly used for extracting data from HTML. You can also manipulate and output the changed HTML.
Or with the words of the creators:
Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
How to use it
Add Gradle Dependency
Examples
This first example actually was the solution to my problem I described above and as you see, it is really easy to implement.
Jsoup provides also some a little bit more advanced features like Loading, Filtering and Modifying. In the following examples I will show you some basics on how to use these functionalities.
Loading
Jsoup combines the fetching and parsing of the HTML into a Document. Jsoup does this just like an up-to-date browser would do.
The connection can also be customized a lot.
Filtering
With Jsoup, you can use the same selectors as you would with css or js.
You can also navigate through the whole DOM tree with all the nodes parents, siblings and children.
Modifying
Jsoup lets you use the same methods as in jQuery. Therefore you can use the attr(), text() or html() functions just like you would normally.
You can also create, delete or append new elements to your DOM pretty easily with Jsoup.
Finally there is an option to convert the Document back to a valid HTML String as an output like I did in the above code.
Reflection
What went good?
The whole project is pretty well documented and I did not have any problems using the library for my specific needs.
What needs improvement?
My first PR that I created was with the substring solution. In the future I should look for better solutions right at the start and not after the first PR.