Over the years I've made repeatable use of the jsoup library so I figured it'd be nice to put out a little primer on using it with CFML.
From the official site:
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.
jsoup allows you to do such things as:
There are a few ways to go about integrating jsoup into an application.
Note: This assumes CommandBox version 3.7+
From the CommandBox CLI, create a new project directory: mkdir cfml-jsoup-example
, and cd
to the new folder. From there run init cfml-jsoup-example
to create a box.json
for your project.
#> mkdir cfml-jsoup-example
#> cd cfml-jsoup-example
#> init cfml-jsoup-example
Inside of box.json you want to a dependency with a JAR endpoint that points to the URL where the JAR file is located. The installed JAR will always be homed in a directory named after the JAR file, but you can place that folder in any "root" folder of your choice.
{
"name":"cfml-jsoup-example",
"dependencies":{
"jsoup-1.10.3":"jar:https://jsoup.org/packages/jsoup-1.10.3.jar"
},
"installPaths":{
"jsoup-1.10.3":"lib\\jsoup-1.10.3"
}
}
To manually install jsoup, you can simply go to the official site download page and pull down the latest core library
release. Place the JAR file in a folder of your choice within your project.
Now we need to add the JAR to the Java Class Path. You can map it in your project's Application.cfc
via this.javaSettings
.
this.javaSettings = { loadPaths: ["./your_dir"] };
To learn more on integrating 3rd party Java libraries in CFML, check out the CFDocs - Java Integration Guide.
A jsoup document can be a string of HTML-like data or data read in from a file as a string.
<cfscript>
// Create the jsoup object
Jsoup = createObject("java", "org.jsoup.Jsoup");
// HTML string
html = "<html><head><title>CFML & jsoup Example</title></head><body>Content about CFML and jsoup.</body></html>";
// Parse the string
document = Jsoup.parse(html);
// Extract content
title = document.title();
body = document.body().text();
writeOutput("
<div>Title: #title#</div>
<div>Body: #body#</div>
");
</cfscript>
The example code instantiates the Jsoup class and parses a string of HTML. This returns a Document class object that we can act on with its methods.
Consider the following example HTML...
<!DOCTYPE html>
<html>
<head>
<title>CFML & jsoup Example</title>
<meta charset="UTF-8">
<meta name="keywords" content="jsoup,cfml,java,html">
<meta name="description" content="Examples for using CFML and jsoup.">
<meta name="author" content="@tonyjunkes">
</head>
<body>
<header id="header">Getting Started With CFML & jsoup</header>
<div>Some content...</div>
<a href="#">A link to useful info</a>
</body>
</html>
And CFML...
<cfscript>
// Create the jsoup object
Jsoup = createObject("java", "org.jsoup.Jsoup");
// Create the File object
JFile = createObject("java", "java.io.File");
// Get the absolute file path
fileName = expandPath("./path/to/file.html");
// Parse the File object and extract data
document = Jsoup.parse(JFile.init(fileName), "utf-8");
header = document.getElementById("header");
writeOutput(header.text());
</cfscript>
The example code demonstrates using jsoup to parse a Java File object that has the path to the HTML file set as the constructor parameter. This returns a Document to act on with its methods.
We can connect to an external source using jsoup's connect()
method.
<cfscript>
// Create the jsoup object
Jsoup = createObject("java", "org.jsoup.Jsoup");
// Connect
siteAddress = "https://jsoup.org/";
document = Jsoup.connect(siteAddress).get();
// Do things to act on the Document...
// Dump the object
writeDump(document);
</cfscript>
So we take a website URL address and pass it to Jsoup.connect()
and, so long as the site resolves to a valid page, we are returned a Document object to act on. The example above only dumps the returned object to show various functions available to use on the collected content.
Using the same example HTML file content displayed earlier, we will grab various meta data from a Document object.
<cfscript>
// Create object, pass in file and parse
Jsoup = createObject("java", "org.jsoup.Jsoup");
JFile = createObject("java", "java.io.File");
fileName = expandPath("./path/to/file.html");
document = Jsoup.parse(JFile.init(fileName), "utf-8");
title = document.title();
head = document.head();
writeOutput("Title: #title#");
writeDump(head);
</cfscript>
Once we have parsed the HTML source, we can access data like title
or everything in the <head>
element with head()
.
From the Document object, we can use the select()
method and pass in selector syntax, similar to jQuery, as the parameter to match and retrieve the metadata values.
<cfscript>
// Create object, pass in file and parse
Jsoup = createObject("java", "org.jsoup.Jsoup");
JFile = createObject("java", "java.io.File");
fileName = expandPath("./path/to/file.html");
document = Jsoup.parse(JFile.init(fileName), "utf-8");
// Get metadata
description = document.select("meta[name=description]").first().attr("content");
keywords = document.select("meta[name=keywords]").first().attr("content");
writeOutput("
<p>Description: #description#<p>
<p>Keywords: #keywords#</p>
");
</cfscript>
We pass in a selector parameter, to query meta elements, which returns an Elements class object. Then we can access it's key (attribute) values using attr()
. We also have access to various helper methods like first(), last(), next() & prev()
.
We can get the raw HTML source of a Document object by calling a parent method: html()
.
<cfscript>
// Create the jsoup object and connect
Jsoup = createObject("java", "org.jsoup.Jsoup");
siteAddress = "https://jsoup.org/";
document = Jsoup.connect(siteAddress).get();
writeDump(document.html());
</cfscript>
The html()
method is borrowed from the Elements
class object.
Link attributes and content can be obtained using the same selector methods demonstrated earlier.
<cfscript>
// Create object, pass in file and parse
Jsoup = createObject("java", "org.jsoup.Jsoup");
JFile = createObject("java", "java.io.File");
fileName = expandPath("./path/to/file.html");
document = Jsoup.parse(JFile.init(fileName), "utf-8");
// Get an array of links
links = document.select("a[href]");
for (link in links) {
writeOutput("
<div>Link: #link.attr("href")#</div>
<div>Text: #link.text()#</div>
");
}
</cfscript>
In this example, we see how to get the href
value using the attr()
method selector and also how to obtain the text within the actual <a>
element by using text()
.
Once we find the <form>
element in the document, we can use selectors to iterate and grab <input>
data.
Consider this HTML...
<!DOCTYPE html>
<html>
<head>
<title>CFML & jsoup Example</title>
</head>
<body>
<form id="contact" name="contact" action="/">
<label>Name:</label>
<input name="fullname" value="Tony Junkes">
<label>E-Mail:</label>
<input name="email" value="fake@email.com">
<label>Message:</label>
<textarea name="message">Message here...</textarea>
</form>
</body>
</html>
And CFML...
<cfscript>
// Create object, pass in file and parse
Jsoup = createObject("java", "org.jsoup.Jsoup");
JFile = createObject("java", "java.io.File");
fileName = expandPath("./path/to/file.html");
document = Jsoup.parse(JFile.init(fileName), "utf-8");
// Get the form and inputs
contactForm = document.getElementById("contact");
inputs = contactForm.getElementsByTag("input");
// Iterate through the inputs
for (input in inputs) {
key = input.attr("name");
value = input.attr("value");
writeOutput("
<div>Name: #key#</div>
<div>Value: #value#</div>
");
}
</cfscript>
So we've used getElementById()
to find the <form>
element and then getElementsByTag()
to grab all of the <input>
elements within the form. At this point, we can iterate through the array of inputs and use selector methods to act on the data.
jsoup provides a collection of classes and methods for sanitizing HTML. Similar to Antisamy, you can use a premade or custom Whitelist class object that specifies valid and invalid elements in a document. This whitelist object is then passed to a Cleaner class object which checks the document against the whitelist rules and removes any invalid content.
<cfscript>
Jsoup = createObject("java", "org.jsoup.Jsoup");
Whitelist = createObject("java", "org.jsoup.safety.Whitelist");
Cleaner = createObject("java", "org.jsoup.safety.Cleaner");
html = "<html><head><title>My title</title></head><body><center>Body content</center></body></html>";
filter = Whitelist.none();
valid = Jsoup.isValid(html, filter);
if (valid) {
writeOutput("The document is valid!");
} else {
invalidData = Jsoup.parse(html);
writeOutput("The document is not valid!");
writeDump(invalidData.html());
cleanDocument = Cleaner.init(filter).clean(invalidData);
writeOutput("The document has been cleaned.");
writeDump(cleanDocument.html());
}
</cfscript>
This example takes simple HTML content and passes it to a Whitelist
that calls the none()
method. This is a pre-defined Whitelist that restricts any HTML markup inside of the <body>
. When the populated class is passed to the Cleaner
, the clean()
method is called to remove any HTML and leave only valid HTML.
A list of default options includes:
Here's a few more examples I thought were worth mentioning because jsoup is so cool.
This example gets the inner content of an <a>
element and replaces the element with only the content; using a TextNode class object.
<cfscript>
// Create Java objects
Jsoup = createObject("java", "org.jsoup.Jsoup");
TextNode = createObject("java", "org.jsoup.nodes.TextNode");
// Create some markup...
html = '<html><head><title>Hello World!</title></head><body><h1>A Header</h1><p>Some content. <a href="##">A cool link.</a></p></body></html>';
// Parse it into a Jsoup Document
document = Jsoup.parse(html);
// Create a Node object
link = document.select("a").first();
node = TextNode.init(link.text(), "");
link.replaceWith(node);
writeDump(label="Original HTML", var="#html#");
writeDump(label="Link Text", var="#link.text()#");
writeDump(label="Modified HTML", var="#document.body().toString()#");
writeDump(node);
</cfscript>
Using a TextNode
, we can store the content between the <a>
element. Then call the replaceWith()
method on the element to switch out the HTML for plain text.
Using the same select()
method, we can pass in a regular expression string to filter results by using ~=
instead of =
.
<cfscript>
// Create Java objects
Jsoup = createObject("java", "org.jsoup.Jsoup");
siteAddress = "https://jsoup.org/";
document = Jsoup.connect(siteAddress).get();
links = document.select("a[href~=^((?!##|html).)*$]");
original = [];
for (link in document.select("a[href]")) {
original.append(link.attr("href"));
}
filtered = [];
for (link in links) {
filtered.append(link.attr("href"));
}
// Original links
writeDump(label="Original Links", var="#original#");
// Filtered links
writeDump(label="Filtered Links", var="#filtered#");
</cfscript>
In this example, we grab links from the first page of the Jsoup site. The filtered links use a regex to exclude any URLs that contain a #
or the string html
.
jsoup
is a super powerful framework for working with and manipulating HTML. The possibilities are endless when working with node structured documents.
For more info on it's classes and methods, check out the jsoup API Docs.
To help break into these examples, I've put together a little project that can be run from CommandBox. You can find the it at GitHub - cfml-jsoup-example.
Cheers!