Make WordPress Core

Opened 4 months ago

Closed 3 months ago

Last modified 2 months ago

#60841 closed feature request (duplicate)

Create token lookup class.

Reported by: dmsnell's profile dmsnell Owned by:
Milestone: Priority: normal
Severity: normal Version: 6.5
Component: General Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

In the HTML API it would be nice to perform spec-compliant attribute encoding and decoding. This requires a divergence from current Core behavior; namely, to rely on the set of defined named character references rather than a list that has been combined over time.

Specifically when parsing HTML character references I keep finding the need for two operations:

  • is this substring an HTML character reference?
  • what, if any, is the next character reference in this string at this offset?

For the first question it's possible to use in_array( $substring, $character_reference_names, true ), even though this may not be optimal. It's a reasonable first step.

For the second question I'm unaware of a strong implementation available in PHP or WordPress. The change motivating this is moving from split a string into potential character references and non-character-reference contents and then check if each one is valid and towards a single-pass parser that is less likely to be confused by strings that are almost character references. There is a potential for improving the speed of handling character references, but it is not a driving goal.

<?php
$name = $entity_set->read_token( $html, $at );
if ( false !== $name ) {
        $decoded .= substr( $html, $was_at, $was_at - $at ) . entity_lookup( 'attribute', $name );
        $at      += strlen( $name );
} else {
        ++$at;
}

https://github.com/WordPress/wordpress-develop/pull/5337/files#diff-e2bc0d3d983191acdb2effe67311dc37666eae53d59983281b34a7b4eed238acR1124-R1163

A final need involves mapping a token to a value, but this may be best relegated to another class or application code. For example, mapping from named character reference to UTF-8 bytes representing the corresponding Code Point.


My proposal is adding a new class WP_Token_Set providing at least two methods for normal use:

  • contains( $token ) returns whether the passed string is in the set.
  • read_token( $text, $offset = 0 ) indicates if the character sequence starting at the given offset in the passed string forms a token in the set, and if so, returns the longest matching sequence.

It also provides utility functions for pre-computing these classes, as they are designed for relatively-static cases where the actual code is intended to be generated dynamically, but stay static over time. For example, HTML5 defines the set of named character references and indicates that the list shall not change or be expanded. HTML5 spec

  • static::from_array( array $words ) generates a new token set from the given array of tokens.
  • to_array() dumps the set of tokens into an array of string tokens.
  • static::from_precomputed_table( $table ) instantiates a token set from a precomputed table, skipping the computation for building the table and sorting the tokens.
  • precomputed_php_source_table() generates PHP source code which can be loaded with the previous static method for maintenance of the core static token sets.

Having a semantic class for this work provides an opportunity to optimize lookup without demanding that the user-space (or Core-space) code change. There are more methods that could be useful but which aren't included in the proposal because they haven't been necessary:

  • add( $token ) and remove( $token ) for dynamically altering the table.

Also, this is currently limited to store tokens of byte-length <= 256 for practical implementation details (it has not been necessary to store longer tokens).

## How can I provide feedback?

I'm happy to do all the work of documenting the class, adding tests, and cleaning up the code for inclusion. What I'd appreciate from you is feedback on the idea, the naming of the class and the methods, whether you have considered other ideas similar to this before, feedback on the general approach taken in the linked PRs.

Thank you all.

Change History (16)

This ticket was mentioned in PR #5373 on WordPress/wordpress-develop by @dmsnell.


4 months ago
#1

Trac ticket: Core-60841

esc_attr() does an in_array() lookup for an array with many entries. could we replace that with an efficient lookup data structure?

for a normative test attribute value: from 4.1 µs to 3.9 µs
for an extreme case (12 MB single-page.html file): from 90 ms to 84 ms

  • the key length can be tuned
  • for HTML4+ legacy comparison this approach doesn't work nearly as well as for HTML5 named character references, of which there are over 2300
  • adding a small set of common checks up-front didn't show any demonstrable impact. e.g. if ( $i === 'gt' || $i === 'lt' || $i === 'amp' ) { return "&$i;"; }

probably in_array() is just-as fast for small arrays, but the idea of collapsing tokens into a linear string has merit for cache locality, whereas the arrays _always_ involve indirect memory lookups. (doing so here with fixed width slowed down both tests, KEY_SIZE = 8)

  • ❓ will PHP end up storing subsequent array values in contiguous memory when declared as an array+string literal? how can we test this?
  • ❓ would binary search on a linear string improve lookup? (seems doubtful since we can probably pre-load enough of the string into memory that it doesn't help, unless the linear string is very long)

I like the idea of having a class to delegate this behavior. it would be trivial on static literal lists to precompute and perform more analysis. e.g. for small data sets have a WP_Small_Token_Set that only stores a linear list (or is this equivalent to storing a long key list?)

This ticket was mentioned in PR #5337 on WordPress/wordpress-develop by @dmsnell.


4 months ago
#2

  • Keywords has-unit-tests added

Trac ticket: Core-60841

Important note

In a test I had esc_attr() return its given input and do no processing and could measure no impact on page runtime, at least nothing significant, so this PR is entirely exploratory and more focused now on:

  • could this be more maintainable?
  • does this impose a performance penalty?
  • could this lead to better sanitization?

Benchmark on single-page.html 12 MB test document:

  • trunk runs in 90 ms using 79.3 MB of memory.
  • branch runs in 193 ms using 73.8 MB of memory.
  • branch without $allowedentitynames runs in 184 ms using 73.8 MB of memory.
  • branch matching legacy behavior is at 156 ms using 73.9 MB of memory.
  • This branch is currently ~around~ less than twice as slow but uses noticeably less memory. Since calling everything but esc_attr() consumes 41 MB we can estimate that the comparison is 39 MB against 34 MB, or that this branch uses only 86% as much as trunk

Benchmark on a short string more typical of a real value:

  • trunk runs in 4.0 µs / 40.5 MB
  • branch runs in 4.6 µs / 40.6 MB
  • branch without $allowedentitynames runs in 3.9 µs / 40.6 MB
  • branch matching legacy behavior runs in 3.0 µs / 41.0 MB
  • On short or normal inputs the performance between the two is thus insignificant.

---

The existing implementation of esc_attr() runs a jumble of regular expression and other search passes over its input.

In this patch, if the site uses UTF-8, then an exploratory single-pass custom parser is used to escape the attribute values.

Trac ticket:

@dmsnell commented on PR #5373:


3 months ago
#3

Rebased from e98508f10d17b606cf3f8761c82bb0f8750b263d to 502e4f5bcbe1271d57efe81aebe4ac19884e6cb7

#4 @dmsnell
3 months ago

  • Milestone Awaiting Review deleted
  • Resolution set to duplicate
  • Status changed from new to closed

Duplicate of #60698.

@gziolo commented on PR #5373:


2 months ago
#5

Some CI jobs report that some static test helpers are private and it can’t access them.

@dmsnell commented on PR #5373:


2 months ago
#6

I've been reviewing the implementation and I have a concern. As far as I can tell, case-insensitive behavior is unreliable on some string strings, specifically multibyte strings with upper/lowercase variants like ü.

Have you considered this?

Yes, absolutely.

I did add case insensitivity during the testing cycle kind of as an afterthought, but I like having it. It's intended to aid in the purpose for which the class was made, which is focusing on ASCII casing. This may sound strange, but almost every basic casing functions are primarily if not entirely ASCII-oriented, because proper case-folding is a much more complicated topic, and no, mb_strtoupper() is not enough, at all.

So I think the expectation that this works for ü is more subjective. My gut feeling is that we can clarify this in documentation and let it be. There's no way this is going to be properly detecting more complicated case variants, but then again, nothing in this domain does. Otherwise we can remove it, but I like how the option gives us the ability to implement functions like WP_HTML_Processor::is_special() without allocating a new string and upper-casing it.

@jonsurrell commented on PR #5373:


2 months ago
#7

It seems fine to me if we don't support case insensitive non-ascii token matching as long as we're clear about it.

@dmsnell commented on PR #5373:


2 months ago
#8

It seems fine to me if we don't support case insensitive non-ascii token matching as long as we're clear about it.

Ha! I went to add these and I already noted this in the function docblock.

'case-insensitive' to ignore ASCII case or default of 'case-sensitive'

https://github.com/WordPress/wordpress-develop/assets/5431237/f9d3bf66-9307-4f67-b4bb-6e8f1aecacfa

Text issues are apparently always on my mind. Let me know if you still think this is unclear. I'm worried that if we try and add more documentation when it's already there right at the cursor when calling the function, then it won't be any more helpful, just more noise.

@jonsurrell commented on PR #5373:


2 months ago
#9

I already noted this in the function docblock … Let me know if you still think this is unclear.

I think this is fine. I was searching for words like "mb" or "multibyte", but it seems like ASCII is correct 👍

I did notice a changelog entry on the PHP documentation page that makes me wonder if we might need to support PHP versions that have varying behavior on non-ASCII treatment. I wasn't able to trigger any differing behavior myself, but do you know what this is about?

8.2.0 Case conversion no longer depends on the locale set with `setlocale()`. Only ASCII characters will be converted.

That seems like exactly what we want, but I've looked around PHP source, the changelog, etc. and I haven't managed to find exactly what changed or examples of the different.

@dmsnell commented on PR #5373:


2 months ago
#10

I did notice a changelog entry on the PHP documentation page that makes me wonder if we might need to support PHP versions that have varying behavior on non-ASCII treatment. I wasn't able to trigger any differing behavior myself, but do you know what this is about?

I've struggled to test this with the HTML API. in effect, some system locales could cause, I believe, a few characters to be case-folded differently, but this is such a broken system I'm not of the opinion that it's a big risk here. I'm willing to let this sit as a hypothetical bug until we have a clearer understanding of it.

@dmsnell commented on PR #5373:


2 months ago
#11

One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.

this is a good suggestion, @aaronjorbin - and thanks for the review! (I'll be addressing the other feedback inline where you left it).

my one main reluctance to automate the download is that the specification requires that the list never change, and automating it only adds a point of failure to the process, well, a point of failure plus some latency.

do you think that it's worth adding this additional complexity in order to change what is defined to not change? I believe that a wide cascade of failures would occur around the internet at large if we found new character references in HTML.

I'm happy to add this if there are strong feelings about it, though if that's the case, maybe we could consider it a follow-up task so as not to delay this functionality? a separate task where we can discuss the merits of the automation?

@dmsnell commented on PR #5373:


2 months ago
#12

I think we can add a new token-map group. It also might be worth putting in the html-api group since this is where the bigger effort lies

@aaronjorbin following the html5lib test suite, I added the wpTokenMap test module to the @group html-api-token-map group. happy to change if this isn't idea; I was looking for more examples and didn't see. it can also be set as a @package and @subpackage but I was looking for @subgroup and didn't find any

@jorbin commented on PR #5373:


2 months ago
#13

@aaronjorbin following the html5lib test suite, I added the wpTokenMap test module to the @group html-api-token-map group. happy to change if this isn't idea; I was looking for more examples and didn't see. it can also be set as a @package and @subpackage but I was looking for @subgroup and didn't find any

@subgroup isn't an annotation that is a part of PHPUnit. I think this group makes sense, thanks!

@jorbin commented on PR #5373:


2 months ago
#14

One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.

this is a good suggestion, @aaronjorbin[[Image(chrome-extension://hgomfjikakokcbkjlfgodhklifiplmpg/images/wp-logo.png)]] - and thanks for the review! (I'll be addressing the other feedback inline where you left it).

my one main reluctance to automate the download is that the specification requires that the list never change, and automating it only adds a point of failure to the process, well, a point of failure plus some latency.

do you think that it's worth adding this additional complexity in order to change what is defined to not change? I believe that a wide cascade of failures would occur around the internet at large if we found new character references in HTML.

I'm happy to add this if there are strong feelings about it, though if that's the case, maybe we could consider it a follow-up task so as not to delay this functionality? a separate task where we can discuss the merits of the automation?

Upon reading https://github.com/whatwg/html/blob/main/FAQ.md#html-should-add-more-named-character-references I withdrawal this suggestion. I do think it might make sense to link that and mention that the entities json should never change in the readme for others that come across this with a similar thinking as me.

@dmsnell commented on PR #5373:


2 months ago
#15

I do think it might make sense to link that and mention that the entities json should never change in the readme for others that come across this with a similar thinking as me.

Thanks again @aaronjorbin!

It's already mentioned at the top of the README, in the character reference class (and as well in the autogenerator script where it comes from), the PR description, and the Trac ticket description.

Is that good enough or was there somewhere else you wanted it? Maybe the way I wrote it wasn't as direct or clear as it should be?

Note: See TracTickets for help on using tickets.