Make WordPress Core

Opened 3 years ago

Closed 17 months ago

#54088 closed defect (bug) (worksforme)

Uploading media containing Norwegian letter å does not automatically readjust it to become aa.

Reported by: paaljoachim's profile paaljoachim Owned by: audrasjb's profile audrasjb
Milestone: Priority: normal
Severity: normal Version:
Component: Media Keywords: has-patch has-unit-tests dev-feedback
Focuses: Cc:

Description

I did a test yesterday and noticed when uploading an image containing Norwegian letters æ ø å that the å did not convert to aa.

It looked like this:
æ -> ae (converted)
ø -> o (converted)
å -> å (did not convert)

Attachments (4)

Uploading-image-containing-Norwegian-Letters.gif (969.9 KB) - added by paaljoachim 3 years ago.
Norwegian letter å does not convert to aa
Screenshot 2022-04-27 at 00.22.58.png (223.4 KB) - added by nielslange 2 years ago.
Screenshot 2022-10-04 at 20.09.27.jpg (39.5 KB) - added by paaljoachim 22 months ago.
Search for Unicode å
Capture d’écran 2023-03-01 à 16.17.39.png (71.4 KB) - added by audrasjb 17 months ago.
WP 6.2 beta 3

Download all attachments as: .zip

Change History (61)

@paaljoachim
3 years ago

Norwegian letter å does not convert to aa

This ticket was mentioned in Slack in #core by paaljoachim. View the logs.


3 years ago

#2 @SergeyBiryukov
3 years ago

  • Component changed from General to Media

#3 @paaljoachim
3 years ago

I focused on this topic because I am redoing tutorials on my WordPress tutorial site. This is an old tutorial I believe is likely not needed any longer: https://www.easywebdesigntutorials.com/cleaning-up-filenames-that-have-non-utf8-characters-in-them/ (I am adding it in here just in case there are aspects in the tutorial that is needed.) Thanks.

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


3 years ago

#5 @antpb
3 years ago

  • Milestone changed from Awaiting Review to 5.9

For anyone digging into this, the solution will likely be within the remove_accents() function used sanitize_file_name

https://developer.wordpress.org/reference/functions/remove_accents/

#6 follow-up: @antpb
3 years ago

I need to do some more digging but an initial glance at the logic behind converting å is only turning it into a but seemingly not even doing that from the video provided.

https://github.com/WordPress/wordpress-develop/blob/e83a341cc082864edf69257fded43d70d8a27685/src/wp-includes/formatting.php#L1254

#7 @antpb
3 years ago

  • Keywords needs-patch needs-unit-tests added

#8 in reply to: ↑ 6 @knutsp
3 years ago

Replying to antpb:

I need to do some more digging but an initial glance at the logic behind converting å is only turning it into a

Some background:
"å" does not stem from a ligature. The letter stems form Old Norse "á", a longer and darker form of the sound written as "a". Swedish has had it since the 16th century, Norwegian since 1917 and Danish since 1948. Danish still use "aa" in many geographical names (alternative, official spelling), but this is not the not the case in Norway and Sweden (only old family names).

A few years ago there was a suggestion here on Trac to transliterate "å" to "aa" in slugs, instead of just "a" (as initially in WP). There was some opposition to this in Scandinavia, at least in Norway (advocated by me). Generally, but specially in the Norwegian variant Nynorsk, the has been a stronger opposition to use "aa" of "å". This is because in some words the next letter is also an "a", giving "aaa" on words like "Tåa" and "Åa". But also because it doesn't add readability and just becoming longer. So I say, at least as my personal opinion, keep it like that. We are uses to it and don't complain. Keep special for Danish.

The main thing here and now is of course to make it work properly for filenames.

#9 follow-up: @paaljoachim
3 years ago

Hei Knut. Thank you for adding the additional information!

My name is Paal (American/English spelling would likely be Paul). Same spelling as my father. In modern Norway Paal is spelled Pål. So the alternative to using å is usually aa. If I write Norwegian with an English keyboard I would use the aa instead of å.

I agree the main thing here is making it work properly for filenames.
I would prefer a conversion of å to aa but if there is "a lot" of resistance to aa than a single a would also be totally fine.

Last edited 3 years ago by paaljoachim (previous) (diff)

#10 @johnbillion
3 years ago

There's already a test for this but only for the remove_accents() function, not that it actually applies those transformations to the name of an uploaded file. https://github.com/WordPress/wordpress-develop/blob/16b04903feec8216bdd2e6230f4ad511a9238db1/tests/phpunit/tests/formatting/removeAccents.php#L15

#11 in reply to: ↑ 9 ; follow-up: @knutsp
3 years ago

Replying to paaljoachim:

If I write Norwegian with an English keyboard I would use the aa instead of å.

Good point, and this was mentioned back then. Also international standards on the field. However, writing is slightly different than slugs, as distinguishing between "a" and "å" might feel needed.

So, it was argued from a conservative point of view, don't change what works just fine. That a change was made just for Danish surprised me a bit, but that effectively silenced the discussion.

Small thing. If there is a need on WP for standardization across our relatively small Scandinavian languages, "aa" will be just fine by me.

I have linked to this is in the Norwegian Slack.

#12 @bjornjohansen
3 years ago

Oh, no! Please don’t transliterate å to aa, nor ø to oe. Æ is (originally) a ligature, so it’s fine to use ae. Visually, ae is close to an æ, so it’s easy to read. Texts where å is transliterated to aa (or ø to oe) is really hard to read, as it breaks the “look at the full word to recognize and read it” feature in the brain.

It also looks like it was written by Henrik Ibsen 150 years ago. As Paal mentions, in modern Norway Paal is spelled Pål.

Surnames, which are rarely changed/updated, became common in the period where eg. aa was still used. They became mandatory in 1923 when å had recently been introduced to Norwegian, and had yet not been introduced into Danish (which Norwegian was extremely much based on). Over the last 100 years, a lot of family names have been updated to use å, but this is not something that people change lightly, so it’s still common to see them there. First names using aa are rare.

In the WP context this is only done for normalizing slugs and filenames. Using the longer versions makes them … well … longer. As Knut also mentioned, having “aaa” is not exactly ideal.

I see no reason to make the slugs longer and less readable, to confirm to an old and conservative method that is irrelevant and outdated to most people. I tried to find what The Language Council of Norway (Språkrådet) has to say about it, but could not find anything.

If anything gets changed in WP regarding this, we would need a filter on the transliteration table, so people can choose what they like.

#13 follow-up: @paaljoachim
3 years ago

Hei @bjornjohansen

I do think the most common approach when not able to use Norwegian letters is to use æ = ae, ø = o and å = aa. But I do feel your passion here. Having å become a or aa in a filename does not really matter to me. The important part is actually the process being done, and that the å becomes converted in a filename.

It sounds like you really really really want to instead see å converted to a...:)
That is fine by me..:)

Last edited 3 years ago by paaljoachim (previous) (diff)

#14 in reply to: ↑ 13 @bjornjohansen
3 years ago

Replying to paaljoachim:

It sounds like you really really really want to instead see å converted to a...:)

Haha, yes. I’m a bit passionate about this. It’s personal :-)

BTW, it looks like filenames are keeping æ, ø, and å. So it’s just in the slugs where æ and ø are transliterated, while å isn’t.

#15 @smit08
3 years ago

Hi @paaljoachim

In which wordpress version you are facing this issue? In latest wordpress version it is running fine. But in version 4.9.8 i am facing this issue.

#16 @paaljoachim
3 years ago

Hi @smit08

I just retested.
I noticed that "å" remain "å" in WordPress 5.8.1. Tested with naming an image æøå and only the å was not converted.

Å should become either aa or a instead.
It is fine by me if we change å to a or aa. It is nice to get a fix in place.

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


3 years ago

#18 @sabernhardt
3 years ago

  • Milestone changed from 5.9 to 6.0

This ticket was mentioned in Slack in #core-media by joedolson. View the logs.


3 years ago

This ticket was mentioned in Slack in #core by costdev. View the logs.


2 years ago

#21 @costdev
2 years ago

This ticket was discussed in the bug scrub. @paaljoachim, do you think this ticket is likely to move towards resolution during the 6.0 cycle?

#22 @paaljoachim
2 years ago

Thank you for bringing this up @costdev

Let's go with Bjørn's @bjornjohansen passion for converting å to a...:)
We need someone/dev to create a patch converting å to a. It would be nice to get it into WP 6.0.

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


2 years ago

#24 @antpb
2 years ago

  • Owner set to antpb
  • Status changed from new to assigned

#25 @nielslange
2 years ago

I just ran a quick test and this problem seems to affect more letters than only å. As seen in the screenshot above, it also affects and . That said, I only tested å, æ, and . There might be many more characters affected.

#26 @nielslange
2 years ago

  • Owner changed from antpb to nielslange

#27 in reply to: ↑ 11 @SergeyBiryukov
2 years ago

Replying to knutsp:

Replying to paaljoachim:

If I write Norwegian with an English keyboard I would use the aa instead of å.

Good point, and this was mentioned back then. Also international standards on the field. However, writing is slightly different than slugs, as distinguishing between "a" and "å" might feel needed.

So, it was argued from a conservative point of view, don't change what works just fine. That a change was made just for Danish surprised me a bit, but that effectively silenced the discussion.

For reference, [26585] / #23907 appears to be the related change.

Replying to johnbillion:

There's already a test for this but only for the remove_accents() function, not that it actually applies those transformations to the name of an uploaded file. https://github.com/WordPress/wordpress-develop/blob/16b04903feec8216bdd2e6230f4ad511a9238db1/tests/phpunit/tests/formatting/removeAccents.php#L15

There is a test with some of the mentioned characters for sanitize_file_name() too, see [48603] / #22363. If that doesn't always work as expected, something else might be involved or missing.

Last edited 2 years ago by SergeyBiryukov (previous) (diff)

#28 @costdev
2 years ago

  • Milestone changed from 6.0 to 6.1

With 6.0 RC1 tomorrow, I'm moving this ticket to the 6.1 milestone.

#29 @paaljoachim
2 years ago

Hi @nielslange

I feel like this ticket got sidetracked with other characters. In Norway for the regular language we in general use æøå. Only the å needed to be converted to either a or aa. It seemed like it was about to be fixed for WP 6.0 with the decision to convert å to a.

Instead of adding in additional characters that are not used in the regular written language of Norway it would have been better to open a new trac ticket with the additional characters.

I am not sure what actually happened in this ticket....

Last edited 2 years ago by paaljoachim (previous) (diff)

This ticket was mentioned in Slack in #core by paaljoachim. View the logs.


2 years ago

This ticket was mentioned in Slack in #core by costdev. View the logs.


2 years ago

#32 @costdev
2 years ago

  • Keywords dev-feedback added

This ticket was discussed in the bug scrub. The characters mentioned in comment 25 should be discussed in their own ticket.

This ticket should continue on the discussion about changing å to a. Please be mindful of backwards compatibility as the discussion continues.

I'll also add dev-feedback to help draw more attention to this ticket.

This ticket was mentioned in PR #2688 on WordPress/wordpress-develop by nielslange.


2 years ago
#33

  • Keywords has-patch has-unit-tests added; needs-patch needs-unit-tests removed

#34 @nielslange
2 years ago

@paaljoachim and @costdev As asked by you above, I've only addressed the problem with the Norwegian letter å. I noticed, that å can appear with two different Unicode character code points.

<?php

// int(97)
var_dump( mb_ord( 'å' ) );

// int(229)
var_dump( mb_ord( 'å' ) );

I've added the character, that hasn't been converted, to remove_accents in /wp-includes/formatting.php and updated the unit test test_remove_accents_latin1_supplement in /tests/phpunit/tests/formatting/removeAccents.php.

Last edited 2 years ago by nielslange (previous) (diff)

#35 @paaljoachim
2 years ago

Thanks @nielslange

#36 follow-up: @knutsp
2 years ago

Very happy this gets in. Also happy that "å" still will transliterate to "a" in slugs, at least when using Norwegian locale.

As noted in my #comment:8 this character is not purely Norwegian, but part of the common Danish/Norwegian alphabet. Very common in locales DA_dk, NB_no and nn_NO.

I guess this transliteration of file names will happen independent of locale?

#37 @paaljoachim
2 years ago

I am redoing older tutorials and I am wondering if this tutorial is still valid for various languages?
(Norwegian letters is mostly taken care of as we know.)
https://www.easywebdesigntutorials.com/cleaning-up-filenames-that-have-non-utf8-characters-in-them

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


23 months ago

This ticket was mentioned in Slack in #core-media by joedolson. View the logs.


22 months ago

#40 in reply to: ↑ 36 @audrasjb
22 months ago

  • Keywords dev-feedback removed

Replying to knutsp:

As noted in my #comment:8 this character is not purely Norwegian, but part of the common Danish/Norwegian alphabet. Very common in locales DA_dk, NB_no and nn_NO.

I guess this transliteration of file names will happen independent of locale?

Yes, with PR2688, it will :)

#41 @audrasjb
22 months ago

  • Owner changed from nielslange to audrasjb
  • Status changed from assigned to accepted

I think we're good to go with PR2688. Self assigning for final testing and commit.

#42 @audrasjb
22 months ago

Ok, we need to update the Docblock, too.
I'll modify the provided PR accordingly.

#43 @audrasjb
22 months ago

@nielslange @paaljoachim what is the Code for ?

We already have this in remove_accents():

 * | U+00E5  | å     | a           | Latin small letter a with ring above   |

Is it a different character? Which one?

#44 @audrasjb
22 months ago

  • Keywords dev-feedback added

@paaljoachim
22 months ago

Search for Unicode å

#45 @paaljoachim
22 months ago

Heya @audrasjb

I just made a search for the unicode for å and found the above.
https://unicode-table.com/en/search/?q=%C3%A5

#46 @audrasjb
22 months ago

Thanks @paaljoachim, but in that case, it looks like the character is already covered by remove_accents().

See the following links:

#47 @audrasjb
22 months ago

  • Milestone changed from 6.1 to 6.2

With WP 6.1 RC 1 scheduled today (Oct 11, 2022), there is not much time left to address this ticket. Let's move it to the next milestone.

This ticket was mentioned in Slack in #core by costdev. View the logs.


18 months ago

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


17 months ago

#50 @joedolson
17 months ago

@paaljoachim Can you re-test this and confirm whether this issue is still happening? I just tested against trunk and 6.1.1, and I can't reproduce the issue with å failing to change to a in uploaded media file names.

I'm wondering whether this issue was specific to your environment or was fixed inadvertently through some other commit since September 2021.

Regarding the choice of letter to change to: it seems to be somewhat contentious, so in the interest of not churning code unnecessarily, I feel that we should leave it as it is.

#51 @peterwilsoncc
17 months ago

I'm able to reproduce locally on trunk 6.2-beta3-55400-src.

Media handling config (summary of site health report):

  • Active editor: WP_Image_Editor_GD
  • GD version: 2.3.3
  • GD supported file formats: GIF, JPEG, PNG, WebP, BMP, XPM
  • Ghostscript, ImageMagick: not available

Media Page

  1. Visit Admin > Media
  2. Drag image named håmilton.jpg in to window
  3. Image remains håmilton.jpg and the resized images also use the letter .

Block Editor

  1. Create new post
  2. Drag image name håmilton.jpg in to block editor
  3. New block created, saves image and the resized images with the letter

Block editor two

  1. Create new post
  2. Add image block.
  3. Click media library
  4. Drag image in to media libray, select once uploaded
  5. New block complete, saves image and the resized images with the letter

It seems the issue in the initial report is the inconsistency with how WordPress handles non-ascii characters. æ and ø are converted to ascii while å is not. I understand why this is not optimal.

What I am wondering is, is there any technical limitation introduced by retaining the letter å in the file name? Specifically, is it common that browsers or servers are unable to load the image as a result?

Please excuse any ignorance as I am a monolingual English speaker.

#52 @paaljoachim
17 months ago

Retesting using WordPress 6.2 beta 2.
Twenty Twenty Three.
Brave browser.

I dragged the following images into the Media library.
I uploaded: ståck-of-pebbles-gråphic.jpg it was converted into stack-of-pebbles-graphic.jpg
I uploaded: Sun-grådients-ååå.jpg it was converted into Sun-gradients-aaa.jpg
I uploaded: Moon-stars-åå.jpg it was converted into Moon-stars-aa.jpg
I uploaded: Leaves-white-å.jpg it was converted into Leaves-white-a.jpg.

I also tested dragging a couple of the images into the Block Editor and these were also converted. So as far as I noticed the conversation is working as it should.

This ticket was mentioned in Slack in #core by mukeshpanchal27. View the logs.


17 months ago

#54 @audrasjb
17 months ago

I tested with WordPress 6.2 beta 3, without applying the path, with a file named håmilton.png and it is converted to hamilton.png.

If I'm not misunderstanding, this is the expected result, isn't it?

#55 @paaljoachim
17 months ago

"If I'm not misunderstanding, this is the expected result, isn't it?"

The å -> a.

As I understand it. It is the expected result.

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


17 months ago

#57 @antpb
17 months ago

  • Milestone 6.2 deleted
  • Resolution set to worksforme
  • Status changed from accepted to closed

With as many confirmations as we have in this ticket, this seems okay to close for now as worksforme.

If this is not correct, feel free to reopen the issue and target it for 6.3!

Note: See TracTickets for help on using tickets.