Make WordPress Core

Opened 3 months ago

Closed 2 months ago

Last modified 11 days ago

#61182 closed enhancement (fixed)

Normalize UTF-8 charset slug detection.

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 6.6 Priority: normal
Severity: normal Version: 6.6
Component: General Keywords: has-unit-tests has-patch
Focuses: Cc:

Description

There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The _canonical_charset() method exists, but is marked private for use.

In this patch the new unicode module provides is_utf8_charset() as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including _canonical_charset().

Finally, the test functions governing _canonical_charset() have been rewritten as a single test with a data provider instead of as separate test functions.

This patch raises a couple of questions:

Is there a need to normalize the ISO-8859-1 charsets since htmlspecialchars() does not need it?
Should WordPress normalize invalid variations of ISO-8859-1, such as "latin1", so that htmlspecialchars() does not choke?
One important question I'd like your help on, if you know the answer: is it still important to avoid calling get_option( 'blog_charset' ) multiple times in a row, or is that already cached? If it's already cached then I suspect further caching or use of static vars will only bloat the memory footprint of Core without speeding it up.

Change History (14)

This ticket was mentioned in PR #6535 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#1

Trac ticket: Core-61182

There are several exist places in Core that attempt to detect if a blog charset is UTF-8. Each place attempts to perform the same check, except the logic is spread throughout and there's no single method provided to make this determination in a consistent way. The _canonical_charset() method exists, but is marked private for use.

In this patch the new unicode module provides is_utf8_charset() as a method taking an optional charset slug and indicating if it represents UTF-8, examining all of the allowable variants of that slug. Associated code is updated to use this new function, including _canonical_charset().

Finally, the test functions governing _canonical_charset() have been rewritten as a single test with a data provider instead of as separate test functions.

This patch raises a couple of questions:

  • Is there a need to normalize the ISO-8859-1 charsets since htmlspecialchars() does not need it?
  • Should WordPress normalize invalid variations of ISO-8859-1, such as "latin1", so that htmlspecialchars() does not choke?

One important question I'd like your help on, if you know the answer: _is it still important to avoid calling get_option( 'blog_charset' ) multiple times in a row, or is that already cached?_ If it's already cached then I suspect further caching or use of static vars will only bloat the memory footprint of Core without speeding it up.

#2 in reply to: ↑ description @SergeyBiryukov
3 months ago

Replying to dmsnell:

One important question I'd like your help on, if you know the answer: is it still important to avoid calling get_option( 'blog_charset' ) multiple times in a row, or is that already cached?

It is already cached, so performance should not be an issue here.

#3 @dmsnell
3 months ago

Thank you @SergeyBiryukov - I had hoped and thought so, but was not confident. I'll re-examine how it's used because I think we have some old code adding caches ontop of other caches, which in this case only seems like it would bloat the memory use, albeit insignificantly.

#4 @dmsnell
2 months ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 58147:

Normalize UTF-8 charset slug detection.

There are several exist places in Core that attempt to detect if a blog charset
is UTF-8. Each place attempts to perform the same check, except the logic is
spread throughout and there's no single method provided to make this
determination in a consistent way. The _canonical_charset() method exists,
but is marked private for use.

In this patch the new unicode module provides is_utf8_charset() as a method
taking an optional charset slug and indicating if it represents UTF-8,
examining all of the allowable variants of that slug. Associated code is
updated to use this new function, including _canonical_charset(). If no slug
is provided, it will look up the current get_option( 'blog_charset' ).

Finally, the test functions governing _canonical_charset() have been
rewritten as a single test with a data provider instead of as separate test
functions.

Developed in https://github.com/WordPress/wordpress-develop/pull/6535
Discussed in https://core.trac.wordpress.org/ticket/61182

Fixes #61182.
Props dmsnell, jonsurrell.

#5 @dmsnell
2 months ago

In 58148:

Quick-fix: Normalize UTF-8 charset slug detection.

In the merge of [58417] the new file was missed. This commit adds the missing required file.

Developed in https://github.com/WordPress/wordpress-develop/pull/6535
Discussed in https://core.trac.wordpress.org/ticket/61182

Fixes #61182.
Follow-up to [58147].
Props dmsnell, jonsurrell.

#7 @SergeyBiryukov
2 months ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Thanks for the commit!

Requiring an extra file just for a single function does not seem ideal to me, could is_utf8_charset() be moved to either wp-includes/functions.php or wp-includes/load.php instead?

#8 @dmsnell
2 months ago

it could @SergeyBiryukov, but there are a few reasons I think it may be fine to leave.

  • load.php is already 2000 lines and can be slow in an IDE to work in if the IDE is applying styling lints and code sniffs.
  • there are many other common Unicode-related functions that I think fit there, including some that are already scattered and duplicated among kses and formatting and elsewhere.
  • "unicode" is a somewhat well-formed category on its own. If looking for some related functionality I might look for a string.php, strings.php, utf8.php , or unicode.php but wouldn't think to lookup load.php

Happy to adapt; just wanted to share that when I proposed this I did consider leaving it even inside of functions.php where _canonical_charset() lives, but for similar reasoning felt it was a good fit for a new home.

#9 @swissspidy
2 months ago

  • Keywords needs-patch added; has-patch removed

I don't think these are compelling enough reasons to create a new file just for this 1 function. If we want to move more similar functions together in a new file, then that can be done separately in one go.

Also, right now, this separate location causes a fatal error when a site is in maintenance mode.

<br />
<b>Fatal error</b>:  Uncaught Error: Call to undefined function is_utf8_charset() in /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php:7481
Stack trace:
#0 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php(4310): _canonical_charset('utf-8')
#1 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php(3804): _wp_die_process_input('Briefly unavail...', 'Maintenance', Array)
#2 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php(3787): _default_wp_die_handler('Briefly unavail...', 'Maintenance', Array)
#3 /Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/load.php(388): wp_die('Briefly unavail...', 'Maintenance', Array)
#4 /Users/pascalb/Local Sites/media-experiments/app/public/wp-settings.php(76): wp_maintenance()
#5 /Users/pascalb/Local Sites/media-experiments/app/public/wp-config.php(95): require_once('/Users/pascalb/...')
#6 /Users/pascalb/Local Sites/media-experiments/app/public/wp-load.php(50): require_once('/Users/pascalb/...')
#7 /Users/pascalb/Local Sites/media-experiments/app/public/wp-login.php(12): require('/Users/pascalb/...')
#8 {main}
  thrown in <b>/Users/pascalb/Local Sites/media-experiments/app/public/wp-includes/functions.php</b> on line <b>7481</b><br />

wp_maintenance() is called early in wp-settings.php, before unicode.php is loaded, and exits. wp_maintenance() manually loads functions.php and then exits. Hence the error.

That's another reason why this should be moved to functions.php

This ticket was mentioned in PR #6568 on WordPress/wordpress-develop by @dmsnell.


2 months ago
#10

  • Keywords has-patch added; needs-patch removed

Trac ticket: Core-61182

This caused issues in maintenance mode, and it's not warranted to have its own module. This will live alongside _canonical_charset(), it's partner function.

Props: dmsnell, sergeybiryukov, swisspiddy.
Follow-up to: [58148].

#11 @dmsnell
2 months ago

Nice catch @swissspidy - I wasn't aware of the need to test in maintenance mode. I'll try and remember that.

Since both of you felt it important anyway to merge this into functions.php, I've done so in #6568

#12 @dmsnell
2 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 58169:

Move is_utf8_charset() into functions.php

This caused issues in maintenance mode, and it's not warranted to have
its own module. This will live alongside _canonical_charset(), it's
partner function.

Fixes: #61182.
Props: dmsnell, sergeybiryukov, swisspiddy.
Follow-up to: [58148].

This ticket was mentioned in Slack in #core-committers by dmsnell. View the logs.


11 days ago

Note: See TracTickets for help on using tickets.