man sscanf: %d is deprecated in C or glibc?

Question

I was just reading the glibc sscanf man page (from the Linux man-pages package) and I found the following:

The following conversion specifiers are available:
(...)

d    Deprecated. Matches an optionally signed decimal integer; the next pointer must be a pointer to int.

i    Deprecated. Matches an optionally signed integer; the next pointer must be a pointer to int. The integer is read in base 16 if it begins with 0x or 0X, in base 8 if it begins with 0, and in base 10 otherwise. Only characters that correspond to the base are used.

o    Deprecated. Matches an unsigned octal integer; the next pointer must be a pointer to unsigned int.

(...)

How come %d is deprecated? It seem that all int specifiers are deprecated.
What does it mean and what is there to replace them?

man 3 sscanf does not indicate deprecation for my toolchain. You should cite yours. — Jeff Holt, Commented Dec 4, 2023 at 18:37
There is an explanation in the "BUGS" subsection of the man page — Eugene Sh., Commented Dec 4, 2023 at 18:41
The reasoning is given in the BUGS section of the Linux manpage @DanielWalker linked. — Shawn, Commented Dec 4, 2023 at 18:41
They're using the dictionary definition of "deprecate", which means "express disapproval of". This is not how it's usually used in software documentation, which is for warnings of obsolete features that are planned to be removed. It's the opinion of the man page author, not from the language specification. — Barmar, Commented Dec 4, 2023 at 18:49
@Barmar They're using the dictionary definition of "deprecate", which means "express disapproval of". This is not how it's usually used in software documentation... How very Microsoft of them. ;-) — Andrew Henle, Commented Dec 4, 2023 at 18:56

John Bollinger · Accepted Answer · 2023-12-04 19:21:55Z

How come %d is deprecated? It seem that all int specifiers are deprecated.

They are not deprecated in the sense that that term is ordinarily used in software documentation. There is no plan for their removal from the language and there are no direct replacements. The ISO committee responsible for maintaining the language standard has not expressed any opinion that they should be avoided, though there are indeed workarounds available to avoid their use.

The deprecation notices on some Linux manual pages that you are asking about constitute an inappropriate liberty taken by the maintainer of that version of the documentation. It is explained in the BUGS section of the same page:

Numeric conversion specifiers
Use of the numeric conversion specifiers produces Undefined Behavior for invalid input. See C11 7.21.6.2/10 ⟨https://port70.net/%7Ensz/c/c11/n1570.html#7.21.6.2p10⟩. This is a bug in the ISO C standard, and not an inherent design issue with the API. However, current implementations are not safe from that bug, so it is not recommended to use them. Instead, programs should use functions such as strtol(3) to parse numeric input. This manual page deprecates use of the numeric conversion specifiers until they are fixed by ISO C.

The manual page maintainer is both unfortunately opinionated and atypically aggressive. It is a somewhat controversial opinion that it constitutes a bug in the standard for the affected functions have undefined behavior for invalid input. It is a valid opinion that that is a good reason to avoid numeric conversion specifiers, but that author is not empowered to deprecate the functions in the sense that readers of the manual page would typically understand. The conventional approach to a situation like this would to be add references to the BUGS section at appropriate places in the manual text, possibly even with a brief explanatory note. Deprecation labels are not that, no matter how they are explained elsewhere in the document.

With that said, the scanf-family functions are overall difficult to use correctly. Some around here are prone to recommend avoiding them entirely, and that should certainly be considered. If you do avoid them, then that moots the issue.

I'm just as frustrated with the manpages' maintainer's deprecation-happy attitude as you are, but I do not think there is any reasonable counterargument to the claim that 7.21.6.2p10's rule "if the result of the conversion cannot be represented in the object, the behavior is undefined" is a design defect in the standard. The only reason I haven't filed a DR is that I consider *scanf unfit for purpose anyway, for reasons that are much harder to fix. — zwol, Commented Dec 5, 2023 at 19:02
At least every occurrence of "Deprecated." should have mentioned the reason and suggested alternative: "Deprecated, prefer strtol." or "Deprecated, see BUGS."! — Bergi, Commented Dec 5, 2023 at 20:18
I understand the maintainer's attitude. Because the input is typically not under the control of the programmer, the UB is a potential opportunity for an exploit or at least denial of service triggered with malicious input. But many programs process input from known sources, and then this is not an issue. I'm also skeptical of the suggestion to replace a well-tested and versatile tool like scanf with one's own code. strtol is not perfectly trivial to use properly either (what's again the condition that the token was entirely read?) and there still is the issue of tokenizing etc. — Peter - Reinstate Monica, Commented Dec 6, 2023 at 0:38
@Pod, "deprecated" is an inaccurate description of what that author conveys in their explanation, so those labels are misleading at minimum. But inasmuch as readers are prone -- with good reason -- to interpret manual pages for C standard library functions to present an accurate representation of the language's specifications for them, yes, conveying a different impression is an inappropriate liberty. This answer already describes the conventional and appropriate approach that the manual page should take to express the kind of concerns at issue. — John Bollinger, Commented Dec 6, 2023 at 13:35
Maybe Don't use with untrusted input; see BUGS.. Cc: @zwol — alx - recommends codidact, Commented Dec 6, 2023 at 14:00

Barmar · Accepted Answer · 2023-12-04 19:00:21Z

24

This is explained in the BUGS section of the man page:

Numeric conversion specifiers
Use of the numeric conversion specifiers produces Undefined Behavior for invalid input. See C11 7.21.6.2/10 ⟨https://port70.net/%7Ensz/c/c11/n1570.html#7.21.6.2p10⟩. This is a bug in the ISO C standard, and not an inherent design issue with the API. However, current implementations are not safe from that bug, so it is not recommended to use them. Instead, programs should use functions such as strtol(3) to parse numeric input. This manual page deprecates use of the numeric conversion specifiers until they are fixed by ISO C.

So it's not deprecated by the C language specification, the author of the man page is using this notation to indicate that they're not safe to use.

However, this is only a problem in practice if the input being read might not contain validly formatted data. If you're reading a file that is formatted reliably, you can use these specifiers safely.

This actually seems to be an inconsistency in the language spec, because it also says that the function returns the number of valid conversions (or EOF if an input failure occurs before the first conversion). It makes no sense to say that a conversion failure is undefined behavior and also say what it returns in that case, and most implementations return the value properly.

The man-page author is being overly pedantic in recommending against these specifiers, in my opinion.

answered Dec 4, 2023 at 19:00

Barmar

770k54 gold badges529 silver badges641 bronze badges

6

What's doubly-bad about putting this in a man page? The authors of the man page are the authors of the implementation they're complaining about. The C standard does not preclude the glibc authors from defining the behavior of their own implementation.
– Andrew Henle
Commented Dec 4, 2023 at 19:04
11

They're complaining about "Use of the numeric conversion specifiers produces Undefined Behavior for invalid input." That's a "bug" glibc devs are free to fix - no one is stopping them from defining the behavior for their implementation. GCC, for example, has -fwrapv that defines the behavior of signed integer overflow - behavior that otherwise, were the "logic" of this man page followed - would "deprecate" all integer operations in C. "They could overflow and cause undefined behavior!!!" Would a compiler that "deprecates" every use of + between integral arguments be sane?
– Andrew Henle
Commented Dec 4, 2023 at 19:24
6

@AndrewHenle They're not complaining about an implementation. They say it's a bug in the ISO C specification, because of the inconsistency of saying that it's undefined and then saying that it returns the number of valid conversions.
– Barmar
Commented Dec 4, 2023 at 19:27
4

That's asinine. There's no contradiction in returning the number of valid conversions and "this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined". Glibc devs can implement what they want. Where's the contradiction in signed integer overflow results in undefined behavior and 6.5.6 Additive operators, paragraph 5's "The result of the binary + operator is the sum of the operands".
– Andrew Henle
Commented Dec 4, 2023 at 20:27
5

@AndrewHenle: I wouldn't be surprised if intent of the man page is to warn that these aren't safely portable, even if glibc did check for overflow. (In practice I'm sure at worst the behaviour on integer overflow in glibc scanf is wrapping, since they compile to asm that has to work for non-overflowing cases, and the conversion loops are simple total = total*base + digit unless the check for overflow like in some other parts of glibc, such as in handling %12d conversions for printf. Parsing the 12 does check for overflow, making it unfortunately kinda slow for the common small case.)
– Peter Cordes
Commented Dec 5, 2023 at 7:17

| Show 8 more comments

Peter Cordes · Accepted Answer · 2023-12-06 04:04:08Z

This notice in the man page is for the benefit of people trying to write portable programs.
Since there has been speculation about what glibc itself does in this case, I decided to check.

The glibc source code actually avoids signed-overflow UB, at least in the conversion function scanf("%d") uses. At worst you could say the conversion result is undefined with glibc, but not the behaviour of the whole program. int on GNU systems doesn't have trap values (it's 2's complement) so this can't make your program crash or misbehave, other than perhaps not having a numeric value that matches what you might get from other ways of parsing the string. e.g. if your code looked at the last decimal digit as well as using sscanf to convert, you could have -1 even though the last decimal digit was even.

errno == ERANGE after a glibc scanf integer conversion that overflowed long or unsigned long, for conversions of long or narrower.
(%lld on a 32-bit system would only check for overflow of long long.)

I checked with this test program:

#include <stdio.h>

int main(){
        int tmp = 0xcccccccc;
        int conv_result = scanf("%d", &tmp);
        printf("successful conversions = %d,  result = %d = %#x\n",
                                       conv_result, tmp, (unsigned)tmp);
}

With input that fits in a long (64-bit on x86-64 GNU/Linux), we get that value truncated to int.
With larger input, glibc detects overflow and produces -1 (actually LONG_MIN or LONG_MAX according to the sign, in this case LONG_MAX which gets truncated to -1 when narrowing to int).

For example it converts 1111111111111111111111111111111 as -1, but 1111111111111111111 as 734294471 = 0x2bc471c7. See it on Godbolt with 2 executors that feed stdin with those inputs. It treats this as a successful conversion either way, scanf returning 1, e.g.

successful conversions = 1,  result = -1 = 0xffffffff

I used GDB to single-step into scanf with glibc 2.38-7 on my Arch GNU/Linux system (letting debuginfod fetch the library source code, very helpful). It eventually reached __strtol_l (https://codebrowser.dev/glibc/glibc/stdlib/strtol_l.c.html#215) after a bunch of stdio overhead and copying characters one at a time into a tmp buffer, checking the base each time to see if it should be checking for hex or base-10 digits. Yikes, not efficient.

https://codebrowser.dev/glibc/glibc/stdlib/strtol_l.c.html#466 is the actual part of that function which checks for overflow with something like total >= ULONG_MAX/10 and the the trailing decimal digit of ULONG_MAX against the new digit being converted, before doing the total = total*base + digit.

// glibc/stdlib/strtol_l.c
INT
INTERNAL (__strtol_l) (const STRING_TYPE *nptr, STRING_TYPE **endptr,
               int base, int group, locale_t loc)
{
...
    if (c >= L_('0') && c <= L_('9'))
      c -= L_('0');
...  // check for grouping characters like ' if enabled
    else if (ISALPHA (c))
      c = TOUPPER (c) - L_('A') + 10;
    else
      break;

// my comments added:
// c is a the new digit converted to integer in the [0,base) range
// i is the total to be returned
    if ((int) c >= base)
      break;
    /* Check for overflow.  */
    if (i > cutoff || (i == cutoff && c > cutlim))   // cutoff and cutlim were set from a lookup table according to base
      overflow = 1;
    else
      {
      use_long:             // goto label from a loop using narrower types, if LONG isn't the same size as long
        i *= (unsigned LONG int) base;
        i += c;
      }
    }

...
  if (__glibc_unlikely (overflow))
    {
      __set_errno (ERANGE);
#if UNSIGNED
      return STRTOL_ULONG_MAX;
#else
      return negative ? STRTOL_LONG_MIN : STRTOL_LONG_MAX;
#endif
    }
...

(Yes, the loop could skip overflowing digits and still process a later smaller digit, but the later code doesn't use i at all if overflow is set.)

0x90 · Accepted Answer · 2023-12-21 18:58:16Z

We can all see that man7 does indeed list it as deprecated, but no-one here is answering the pertinent question that was asked of "why".

How come %d is deprecated? It seem that all int specifiers are deprecated.

The man pages describe the state of a current POSIX distribution. Thus each system may have its set of man pages, and the documentation on one can differ from another. Ideally you'd consult your local man page with man sscanf. However the online manages, e.g. at man7, are convenient. But note that they're describing a system that isn't yours, or perhaps even an idealised system that doesn't exist.

You should always be wary about reading the man pages for a system that you're aren't programming for as they can be documented older or newer versions of the same interface.

In this instance, man7 is hosting the man pages as used by the Linux Kernel team and the GNU lib c team. This particular changes, of marking sscanf integer specifiers as deprecated, was done in a15d34326c581eab10 a year ago and is contained in released man-pages-6.02. The latest change of adding the BUGS note was done in 1f9949d11f499e5758f7e21 and is contained in man-pages 6.03. Whether that change ends up in your distribution's man pages is another matter.

The discussion surrounding this is actually about ERANGE, and you can follow that in a few places, e.g.

Someone even asks the same question as OP. The response can be seen at From: Alejandro Colomar @ 2023-01-20 13:12 UTC. Some snippets:

Should it really be deprecated?

While the interface of sscanf(3) numeric conversions is not mis-designed and could be fixed, it is not correctly implemented, nor even standardized.

I think it's correct to deprecate unless there's a clear effort to fix it.

Is the undefined behavior here a real world issue anywhere, or is this just a theoretical issue based on interpretation of the C standard?

All implementations of sscanf(3) produce Undefined Behavior (UB), AFAIK. How much you consider UB to be a real-world issue differs for each programmer, but I tend to consider all UB to be as bad as nasal demons. I'm not saying UB shouldn't exist, just that you shouldn't invoke it. And a function that is used for scanning user input is one of those places where you really want to avoid invoking UB.

One common aspect of man page documentation is that they draw a distinction between the POSIX compatible interface and the interface as used by their system. Both are available on man7.org:

You'll notice the 3p version doesn't list %d as deprecated. Therefore %d is only deprecated on the systems documented by man7.org.

If you wish to stop using scanf (and sscanf, fscanf), then there's a handy guide available

Good point to distinguish Posix man pages from the system ones. I didn't even know the Posix ones exist. — Peter - Reinstate Monica, Commented Dec 6, 2023 at 20:17
Therefore %d is only deprecated on the systems documented - Sort of the reverse; I think they care more about people writing portable code that has to work on non-GNU systems. Glibc internally does avoid signed-overflow UB, and even sets errno = ERANGE for numbers that don't fit in a long or unsigned long. But for narrower conversions like %d instead of %ld, it truncates LONG_MAX or LONG_MIN to int on overflow, as shown in my answer. (But this is just the current behaviour, it's not documented to keep doing that; more useful might be saturating to narrow type limits.) — Peter Cordes, Commented Dec 7, 2023 at 0:26
"All implementations of sscanf(3) produce Undefined Behavior" is really not understanding Standardese. On almost all implementations, I expect Unspecified Behavior, if not Implementation-Defined Behavior. As Perter Cordes notes, glibc is one of the implementations that does not produce Undefined Behavior, so the original statement is factually wrong. — MSalters, Commented Dec 7, 2023 at 11:16

Jander · Accepted Answer · 2023-12-04 18:58:18Z

As pointed in the comments (thanks to @JeffHolt, @Eugene-sh, @DanielWalker, @Barmar, @DanielWalker) , the answer is indeed in the Bugs section:

BUGS
   Numeric conversion specifiers
       Use of the numeric conversion specifiers produces Undefined
       Behavior for invalid input.  See C11 7.21.6.2/10 
       ⟨https://port70.net/%7Ensz/c/c11/n1570.html#7.21.6.2p10⟩.  This is
       a bug in the ISO C standard, and not an inherent design issue
       with the API.  However, current implementations are not safe from
       that bug, so it is not recommended to use them.  Instead,
       programs should use functions such as strtol(3) to parse numeric
       input.  This manual page deprecates use of the numeric conversion
       specifiers until they are fixed by ISO C.

I do agree that of "deprecate" means here "express disapproval of" (as from @Barmar's comment).

John Bode · Accepted Answer · 2023-12-05 17:39:07Z

To echo everyone else, this use of "deprecated" is weird. They really mean "not recommended", not "no longer supported".

Here's the issue the author of the man page is complaining about:

Assume the code

int x;
printf( "Gimme a number: " );
if ( scanf( "%d", &x ) == 1 )
  do_something_with( x );
else
  // handle input error

and the input

12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

This is a syntactically valid decimal integer constant:

6.4.4.1 Integer constants
...

integer-constant :
    decimal-constant integer-suffix_opt
    octal-constant integer-suffix_opt
    hexadecimal-constant integer-suffix_opt

decimal-constant :
    nonzero-digit
    decimal-constant digit

nonzero-digit: one of
    1 2 3 4 5 6 7 8 9

digit: one of
    0 1 2 3 4 5 6 7 8 9

and the scanf function will match the longest sequence of characters that satisfies the %d conversion:

7.21.6.2 The fscanf function
...
9 An input item is read from the stream, unless the specification includes an n specifier. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence.²⁸⁵⁾ The first character, if any, after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.

No field width is specified, so that entire input will be converted and assigned to x and scanf will return 1 to indicate success; the problem is that input will overflow and result in undefined behavior.

Using a %d or %i or %o (or %s or pretty much any conversion specifier) without an explicit field width opens you up to accepting input that could lead to numeric overflow or worse.

This is one of those areas where C has no blade guards and will cut you if you aren't careful. The optional bounds-checking version (scanf_s) only makes sure none of the arguments are NULL; it doesn't check for numeric overflow.

*scanf is only really appropriate if you know your input is well-behaved. If you can't guarantee your input is well-behaved, then you shouldn't use *scanf at all; instead, use fgets to read input as text and perform some basic sanity checks for length and content before attempting to do any conversions.

+1 Thank you for actually explaining what the other answers are talking about instead of just saying "it's undefined on bad input" like they are saying. I had no idea what "bad input" they were talking about. — Stev, Commented Dec 8, 2023 at 0:21

Collectives™ on Stack Overflow

man sscanf: %d is deprecated in C or glibc?

6 Answers 6

Not the answer you're looking for? Browse other questions tagged
c
scanf
gnu
glibc
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Not the answer you're looking for? Browse other questions tagged cscanfgnuglibc or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c
scanf
gnu
glibc
or ask your own question.