Make WordPress Core

Opened 6 years ago

Last modified 2 months ago

#45387 new defect (bug)

Valid HTML get mangled on the frontend

Reported by: youknowriad's profile youknowriad Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version:
Component: Editor Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

Open the HTML editor, paste this HTML code

<p>To make this thing happen, go to Pages <a href="http://google.com/" target="_blank" rel="noreferrer noopener" aria-label="This is > an aria label">http://google.com</a></p>

And preview the post in frontend.

The HTML is a big mangled in the output, the link don't show up properly.

I suspect that it's related to one of the_content filters.

Reproduced in 4.9.8

Change History (22)

#1 @pbiron
6 years ago

First, the HTML you've got is invalid. It should be:

<p>To make this thing happen, go to Pages <a href="http://google.com/" target="_blank" rel="noreferrer noopener" aria-label="This is &gt; an aria label">http://google.com</a></p>

[Note that > has been replaced by &gt; in the aria-label attribute]

Is the (classic) editor supposed to try to "fix" bad HTML? If so, then there is a bug somewhere. If not, then I'd say this "operator error" and this ticket should be closed as invalid.

Last edited 6 years ago by pbiron (previous) (diff)

#2 follow-up: @youknowriad
6 years ago

This html is valid. You can try by pasting in the w3c official validator https://validator.w3.org/nu/#textarea

<!DOCTYPE html>
<html lang="fr">
<head>
<title>title test</title>
</head>
<body>
<p>To make this thing happen, go to Pages <a href="http://google.com/" target="_blank" rel="noreferrer noopener" aria-label="This is > an aria label">http://google.com</a></p>
</body>
</html>

#3 @chrisvanpatten
6 years ago

There's also some additional context on why > is valid inside HTML attributes in this Gutenberg ticket.

#4 in reply to: ↑ 2 @pbiron
6 years ago

Replying to youknowriad:

This html is valid. You can try by pasting in the w3c official validator https://validator.w3.org/nu/#textarea

You're right, my bad. It is just < and & that must be replaced by character entities or references. But using &gt; does allow the HTML to rendered correctly on the front-end, so the problem is probably in wptexturize() (which is hooked to the_content).

Let me see if I can find why it is confused by > in an attribute value.

#5 @pbiron
6 years ago

The problem is in _get_wptexturize_split_regex() (which is called by wptexturize()) at https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L713.

The regex interprets the > in the attribute value as the end of start tag (which is why using the character entity does not cause a problem), which results in the HTML getting "mangled" on the front end.

Unfortunately, I don't know how to fix it...I've never been good with regex assertions/lookahead/etc.

The same regex problem also exists in get_html_split_regex() at https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L673.

#6 follow-up: @aduth
6 years ago

Maybe worth pointing out that the function is documented to have been slated for removal by v4.5.0 :

https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php?rev=43571#L687

It's unclear from the history whether it's safe to be dropped without any additional changes.

Related:

I'm not certain whether this is on any person's radar for being addressed; my suspicion is it's not.

For the regular expression itself, I sense it may not be possible in a single expression to match the end of an opening tag while exempting ">" which occur within an attribute value.

It's unfortunate that this forces us into a position of needlessly escaping content, as it feels like compounding technical debt. The issue will remain for non-Gutenberg HTML entry.

#7 in reply to: ↑ 6 @pbiron
6 years ago

Replying to aduth:

Maybe worth pointing out that the function is documented to have been slated for removal by v4.5.0 :

I haven't looked through the history but I interpreted that comment as calling for replacing that regex with the one in get_html_split_regex()...which has the same problem.

#8 @afercia
6 years ago

#46114 was marked as a duplicate.

#9 @mcsf
3 years ago

This issue has come up again in a different form in the following Gutenberg bug report:

https://github.com/WordPress/gutenberg/issues/11789#issuecomment-847242464

(In a gist, <script>alert('rock & roll')</script> works as expected, but <script>if (3 < 4) alert('rock & roll')</script> yields an alert message with an escaped &#038;.)

In this case, we can't really work around the issue by escaping content (as described in #6), nor by escaping HTML attributes (as described in #1): we are dealing with a <script> tag and so special characters like < and & should somehow be preserved.

#10 @joshuanan
2 years ago

I hope this will get more attention soon. The issue @mcsf mentioned is really a pain for JS conditional logic. Consider that the following code will throw an error ("Uncaught SyntaxError: '#' not followed by identifier") when entered into a Gutenberg HTML block directly or as the output of a short code entered in Gutenberg, because & is converted to an HTML entity.

<script>

x = 5;
y = 10;
z = 10

if (x < y) {

  console.log("X is Greater than Y.");
}
else if (x !== y && x !== z) {

  console.log("X is More than Y, and Unique From Z.");
}
</script>
Last edited 2 years ago by joshuanan (previous) (diff)

#11 @ninetyninew
16 months ago

We have recently been made aware of an issue which sounds like this one found by use of the twentytwentythree theme.

An example is if, for example, you are using the twentytwentythree theme with WooCommerce, if you add some custom inline HTML and JS code via the woocommerce_after_add_to_cart_quantity hook.

We added a select field with some options, the values of these options are URLs and when selected it redirects the user to the URL in the value, these URLs are concatenated strings based on some PHP conditions and add query args like &something=1 to the string which is then set as the value of the select options, the build of the URL and the functionality which does the redirect upon selection are in an inline <script> tag, the resulting URLs don't remain &something=1 and get converted to #038;something=1, this then means we cannot use the $_GET['something'].

If we switch from twentytwentythree to a classic theme, the conversion of & to #038; doesn't occur.

Last edited 16 months ago by ninetyninew (previous) (diff)

This ticket was mentioned in Slack in #core-editor by ndiego. View the logs.


16 months ago

#13 @alexhay
14 months ago

With projects like Tailwind CSS and Alpine.js gaining a lot of popularity over the last few years, i've seen increasing numbers of developers troubleshooting why features using greater/smaller than characters break the markup in blocks. A resolution to this would save a lot of devs a lot of headache.

#14 @rickiwylder
11 months ago

I've recently experienced this too. Trying to paste an Active Campaign form embed into the HTML block. The embed is one part style, one part html markup and the last part is a script tag where all validation logic etc is happening.

This code:

if (elem.type != 'radio' && elem.type != 'checkbox')


Becomes:

if (elem.type != 'radio' &#038;&#038; elem.type != 'checkbox')


which breaks the script. Sounds like what @ninetyninew reported although I'm using a custom classic theme.

#15 @allocater
9 months ago

Still happening in WordPress 2023. I inverted my logic

if(a && b)

became

if(!(a || b))

since only & is affected 🤷‍♂️

#16 @ninetyninew
8 months ago

Still seeing this issue occuring on several websites.

This ticket was mentioned in PR #5697 on WordPress/wordpress-develop by co6x0.


8 months ago
#17

  • Keywords has-patch has-unit-tests added

Ensures valid HTML is worked correctly by wptexturize(), wp_html_split(), etc.
I started working on this PR when I noticed that using TailwindCSS child selectors would break the layout of block theme (also reported in Trac ticket: 57381).

I have identified a problem with the regular expression defined in _get_wptexturize_split_regex() used in wptexturize().
This problem seemed to be affecting get_the_block_template_html() and causing the block theme layout collapse described above.
Changing this regex fixes the layout issue.

Also, wp_html_split() uses almost the same regex.
Other trac tickets caused by this function will also be fixed by updating to a similar regex.

According to the HTML reference at html.spec.whatwg.org, attribute values can contain a variety of characters.
With this in mind, I have modified the regex to exclude matching characters within quotation marks.
This fixes the misplacement of GREATER-THAN SIGN(>) and prevents other valid HTML structures from being mishandled.

I've included tests to cover these changes in tests/phpunit/tests/formatting/wpTexturize.php and tests/phpunit/tests/formatting/wpHtmlSplit.php. If there's anything I've missed, please let me know.

Trac ticket: https://core.trac.wordpress.org/ticket/45387
Trac ticket: https://core.trac.wordpress.org/ticket/43457
Trac ticket: https://core.trac.wordpress.org/ticket/57381

co6x0 commented on PR #5697:


8 months ago
#18

Added commit.
Removed tranformation of & to &#038; in HTML attribute values modified by <https://core.trac.wordpress.org/ticket/35008>.

This ticket seems to have been created because the W3C HTML Validator found it to be invalid HTML, but as of now, the & in the URL is valid.

@dmsnell commented on PR #5697:


2 months ago
#20

howdy! just wanted to stop by and mention that I've been exploring updating these same functions using the HTML API, which provides a full spec-compliant parse of the HTML stream.

You can find some rough notes on the broader roadmap

#21 @jonsurrell
2 months ago

This Gutenberg issue seems to be caused by wptexturize mangling otherwise valid HTML: https://github.com/WordPress/gutenberg/issues/56597

Last edited 2 months ago by jonsurrell (previous) (diff)

@co6x0 commented on PR #5697:


2 months ago
#22

@dmsnell
Thank you for letting me know.
Handling this with regular expressions has been challenging, so it would be wonderful if we could address it using the HTML API.

Please let me know if there's anything I can help with!
I will close this PR now.

Note: See TracTickets for help on using tickets.