There is a debate on whether HTML classes belong in your content. As in, classes that are strictly related to the presentation of that content. Sometimes the use of these classes is unavoidable. A callout paragraph, a pull quote, a carousel in the middle of a post... you'll need classes to style and add functionality to these things.
While you sometimes need them, the less you write them into the actual post content, in my opinion, the better.
Why avoid writing HTML with classes in content?
The main reason is that these HTML classes are fragile as they're tied to your current theme. On the next redesign there's a chance these classes will change or require different structure. Or at least, over time, certain classes will be forgotten, new classes will emerge, duplicate classes will happen, and it will get messy.
Changing HTML in templates is easy, as one template is responsible for lots of pages. But changing HTML inside content is hard. They are individual things, sometimes in the hundreds and thousands, that may need to be updated manually one post a time.
But I need those HTML classes!
No worries, WordPress is flexible enough to allow us to generate HTML and insert it into the right spot.
Your content remains pure. No more fragile HTML. Remaining pure, you can easily transform and adapt your post content to your presentational needs.
All these transformations can happen with code. Next time you update the design you'll update the transformation function to generate the right HTML. Just like templates, you make the updates in one place and it affects all the content at once.
No more updating posts manually.
Strategies for adapting content
Among all the tools offered by WordPress, we're going to use:
Shortcodes
the_content filter
I'll quickly explain how the two above work and provide some real word examples of things you can do with them.
Shortcodes
Shortcodes allow you to define a macro that expands to something of your choice. They're basically a sort of HTML tag that wraps content and accept attributes. For example, you could put this in post content:
Then write code to have it transform into:
And then have the power the change that output anytime.
WordPress has extensive documentation for Shortcodes, but I'll provide a simple example.
This is a contrived example, but if you include the code above in your `functions.php` file, you can create a post with the following content:
that will render this HTML:
Filters
WordPress has many filters available. A filter is a function that has the opportunity to transform something before it's returned to the entity that requested it. Filters are mainly used by plugins, and are what make WordPress so customizable.
The filter we're going to use is the_content, which has a page in WordPress' codex.
The following is a basic example of how to use it.
This will add text to the end of a post, which can be useful for RSS scrapers.
Getting more out of the_content
The documentation for the the_content filter provides similar examples to the one above, so let's do something different. We'll get to some real world practical examples after we look at the tech involved.
Say you already write pure posts and transform them on the client side with JavaScript. This is a pretty common scenario. Say you write in markdown and do triple-backtick code blocks. Those convert to HTML like...
But say your syntax highlighting library requires the code blocks like this:
You might be doing something like...
That works, but it requires a bunch of DOM effort on every single page load. It would be better to fix that HTML before it even comes to the browser. We'll cover the solution to this in the examples below.
Combined with an HTML (technically, XML) parser such as libxml, we can move DOM transformations back to the server, relieving the browser. Reducing the amount of JavaScript required on the front end is definitely a good goal.
libxml has bindings for PHP that are usually available in standard installations. You have to make sure that your server has PHP > 5.4 and libxml > 2.6. You can check that by inspecting the output of phpinfo() or use the command line:
If your server doesn't fulfill these requirements you should ask your system administrator to update the required packages.
Parsing a post
The filter we added will receive the raw HTML of the post and return the transformed content.
We're going to use the DOMDocument class to load and transform the HTML. We'll use the loadHTML instance method to parse the post and the saveHTML to serialize the transformed document back to a string.
There's a little catch: this class will automatically add the <!doctype html> definition and will also automatically wrap the content in <html> and <body> tags. This is because libxml was designed to be used to parse full pages, not just a part of it, as we're doing.
One potential solution is to set some flags when loading the HTML, but this isn't perfect too. When loading the HTML libxml expects to find a single root element, but posts could have more than one root element (usually, you have many paragraphs). In that case, libxml will throw some errors.
The better solution I came up with is to subclass DOMDocument and override the saveHTML function to strip those html and body tags. When loading the HTML I don't set the LIBXML_HTML_NOIMPLIED flag, so it doesn't throw any error.
It's not ideal, but it gets the job done.
Now we need to use MSDOMDocument instead of DOMDocument in our filter functions. If you're going to create more than one filter, I advise you to parse the post just once and pass the MSDOMDocument instance around. When all transformations are done we'll get back the HTML string.
Content Altering Examples
We've learned that we can use shortcodes and libxml to reduce the amount of HTML we directly have to insert in the post. It can be a little hard to understand what results we can get, so let's go through some real world examples.
Many of the following examples come from the production version of MacStories. Other examples are Chris' ideas, which could be easily added to CSS Tricks one day (or are already in use).
Pull quotes
Your site could have pull quotes. The desired HTML could be something like:
To achieve something like this, I'd suggest a shortcode:
In your post you would then do this
The author is optional and the function that handles that omits it from the HTML altogether if it is not set.
There are many advantages to this:
If you need a different HTML or different HTML classes you can update the output from the function in one place.
If you want to scrap pull quotes entirely, you can start returning an empty string from the function.
If you want to add a feature (e.g. click to tweet) you update the output from the function.
Twitter/Instagram embeds
One of the greatest features of WordPress is, in my opinion, automatic embedding. Say you want to insert external content in your post: chances are you might get the job done just by inserting the URL on its own line. No more hunting for the correct embed code. And most importantly, you don't have to keep that up to date.
This is called oEmbed, and the list of supported providers is available here and here.
WordPress has a hook to customize such embeds. If you want to wrap the embedded content in a div you can do something like this:
Syntax highlighting
You can process your code blocks on the server to add line numbers to every line. With that you should be able to just insert the code in pre and code blocks.
This is achieved using the_content filter and libxml:
Search for all code blocks
Get all lines by splitting on newlines
Wrap each line in a span
Apply CSS
The handler also changes the classes (as explained in the earlier example) as required by the syntax highlighter.
You can use CSS counters to generate the numbers:
A real world example, from MacStories, is we can write Markdown like this:
Which processes into HTML, then is sent through that filter, ending up like this:
Which renders like this, with our syntax highlighter:
Rewriting URLs
When we switched to HTTPS at MacStories we faced an issue with mixed content warnings. Old posts linked to images hosted on Rackspace using the HTTP protocol. Whoops.
Fortunately Rackspace also serves content over HTTPS, but the URL is slightly different.
We decided to add a filter to change those URLs. Editors will link images using the HTTPS URL, but this filter can work around HTTP URLs inserted by mistake. Goodbye mixed content warnings.
This is achieved by adding a the_content filter and running a regular expression substitution.
You can do something similar to CDN-ify image links: if your image URLs have a well defined pattern (so that you don't change an URL of something that's not an image) use a similar approach. Otherwise it's better if you parse the HTML to change just the src attribute of the images.
Adding IDs to headings
Having the id attribute set on all headings allows you to link to a specific section (e.g. when you have a Table of Contents or want to share a link scrolled to the correct section).
If you write in HTML, you can add them manually. But that's tedious. If you write in Markdown you have to make sure that your Markdown processor adds them (Jetpack does not). In any case, authoring them adds redundancy to your content.
You can automate the process using libxml in a the_content filter:
Search for all headings
Generate the slug
Set that slug as id attribute
The filter is this:
This filter also prevents the generation of duplicated ids.
Removing wrapping paragraphs
If automatic embedding is my favorite feature of WordPress, automatic paragraph wrapping is the thing I hate the most. This issue is well known.
Using RegEx to remove them works, but isn't well suited for working with HTML tags. We can use libxml to remove the wrapping paragraph from images and other elements, such as picture, video, audio, and iframe.
Adding rel=noopener
Recently we became aware of security issue regarding links opening in a new tab.
Adding the rel=noopener attribute will fix the issue, but that's not something editors should have to remember to do. It also doesn't play nice with Markdown, because you'd have to write links in plain HTML.
libxml can help us:
Considerations
I've been using the techniques explained above since the launch of MacStories 4 and haven't had any major issues. Writers can focus solely on writing great content. All presentation related transformations/generations are documented in code and can easily be ported over the new version or updated to the new design. It's a big win. I won't have to create a `legacy-theme.css` file to style or fix old (and poor) decisions.
With content filters, you can pretty much do whatever you want. With shortcodes, you'll need to be careful not to create overly-specialized shortcodes that look like the old raw HTML you had in the past. For example
Some of these attributes may not make sense in the future, so it's up to you to decide on attributes that seem well-suited and abstracted enough to live forever. Still, even a bad shortcode is better than no content abstraction at all.
In the end: do what you think is best and think twice before implementing. Always ask yourself "will I need this when the next design goes live?"
Leverage WordPress Functions to Reduce HTML in your Posts is a post from CSS-Tricks