### PHP and “string” comparisons

May. 20th, 2010 10:42 pm
PHP, among other problems, is a dynamically and (problematically) weakly typed language. What this means is that variables are cast, willy-nilly, to work in whatever fashion the programmer or the PHP interpreter feels is appropriate for the occasion. For example, a string `"1"`

is equivalent to the integer value `1`

. Or at least equivalent-ish.

The equality test operator, `==`

, is defined in PHP for strings as for other built-in types. However, as the official documentation states,

If you compare a number with a string or the comparison involves numerical strings, then each string is converted to a number and the comparison performed numerically. These rules also apply to the switch statement.

When a string is evaluated in a numeric context, the resulting value and type are determined as follows.

If the string does not contain any of the characters '.', 'e', or 'E' and the numeric value fits into integer type limits (as defined by PHP_INT_MAX), the string will be evaluated as an integer . In all other cases it will be evaluated as a float .

The value is given by the initial portion of the string . If the string starts with valid numeric data, this will be the value used. Otherwise, the value will be 0 (zero). Valid numeric data is an optional sign, followed by one or more digits (optionally containing a decimal point), followed by an optional exponent. The exponent is an 'e' or 'E' followed by one or more digits.

In the typical PHP context, where scripts are expected to deal with form input and so forth, this seems to make a lot of sense—everything arrives as string data, but the string `"123"`

clearly encodes a number. Well, if it all worked properly, maybe it wouldn’t be so bad. But note that little subtlety above, that you might not expect if you hadn’t either seen it or read it in the docs: *If the string starts with valid numeric data, this will be the value used. Otherwise, the value will be 0 (zero).* This means that the following are all true:

"1" == 1
"a" != 1
"a" == 0

Yes—because any string that *isn’t* a number gets converted to zero, this is what you get. I saw this cause a nasty bug only today. (Personally, I prefer `strcmp()`

*et al* for string comparisons. It’s clunky, but at least I know what it does, in *all* cases…I think. This is PHP, so one can never be quite sure.)

Another subtle consequence of the (accurate) definition from the documentation: *If you compare a number with a string or the comparison involves numerical strings…*, then it performs a numeric conversion. Thus, if at least one operand is an integer, the comparison is numeric; if both operands are numeric strings, the comparison is numeric; if both operands are strings, but only *one* of them is numeric, then it’s a regular string comparison. This makes sense…sort of…but combined with the 0 quirk above, this means that *equality in PHP is not always transitive!*

"a" == 0 // true
0 == "0" // true
"a" == "0" // false!

Normally you expect equality to be transitive, that is, if `A` = `B` and `B` = `C`, then *obviously* `A` = `C`. In PHP, though, this is not necessarily true: `"a"`

= 0 and 0 =`"0"`

, but `"a"`

≠`"0"`

! This follows the specification presented by the PHP documentation (the last comparison fails because both operands are strings but only one of them is numeric, so the comparison is lexical), but it doesn’t make much mathematical or common sense.

In fact, since a given binary relation ~ on a set A is said to be an equivalence relation if and only if it is reflexive, symmetric and transitive

[Wikipedia], the “Equal” operator `==`

in PHP is not, in fact, a valid equivalence relation at all.

This is not the only problem, however. A different problem—and one that, unlike the 0-comparisons above, I do not find mentioned or justified in the documentation, is that *integers are parsed differently by the regular parser and the string conversion parser*. This baffles me; not only is it stupid and weird, but it’s also strange that they don’t just reuse the same routines. The problem is introduced by the fact that PHP, like many other languages, accept integer literals in base 10 (decimal), base 16 (hexadecimal), and base 8 (octal). Octal integer literals are denoted by prefixing them with a `0`

, thus `01`

means “1 in octal notation” (which equals 1 in decimal notation), `010`

means “10 in octal notation” (which equals 9 in decimal notation), and `09`

is invalid—it means “9 in octal notation”, which makes no sense.

Well, it turns out that for reasons best known to the PHP developers themselves, the automatic conversion of strings to numbers in PHP is handled by something analogous to the C library function `strtod()`

, whose input format is described as such:

The expected form of the (initial portion of the) string is optional leading white space as recognized by isspace(3), an optional plus ('+') or minus sign ('-') and then either (i) a decimal number, or (ii) a hexadecimal number, or (iii) an infinity, or (iv) a NAN (not-a-number).

In other words, integer *literals* in PHP accept octal notation, but automatic conversions of strings to integers do not. Thus,

01 == 1 // true, fine
"01" == 1 // true, fine
010 == 10 // false, fine -- it's equal to 8
"010" == 10 // true! The conversion assumes decimal notation
"010" == 010 // false!

This also means that casting a string `$s`

to an integer, `$x = (int)$s`

, is not equivalent to `eval`

ing it, `eval("\$x = {$s}")`

.

On a side note, octal numbers are handled in a pretty weird way to begin with. As the documentation warns you,

If an invalid digit is given in an octal integer (i.e. 8 or 9), the rest of the number is ignored.

Thus,

"09" == 0 // true
"09" == 09 // false; recall that "09" is decimal

This form of behaviour is why I dislike PHP so intensely. As the Zen of Python reminds us, Explicit is better than implicit

, and Errors should never pass silently

(Unless explicitly silenced

). A language that silently squashes errors and returns 0 or `null`

or some similar “empty-ish” value instead of warning you that something went wrong is a language that is *not* engineered to help you discover your errors, a language that would rather let you produce incorrect output than crash. (Crashing is way better than incorrect output. At least you *know* something is wrong. Silent logic errors kill people and crash space probes.)

Keep in mind when you code that in general you *don’t know* whether some string may be numeric or not—if it’s input (direct user input, data from a database, what have you), then the string *might happen to be* numeric, and you won’t know unless you check (e.g. with `is_numeric()`

).

If you can’t get away from PHP (always an attractive option), I suggest that you stick with `strcmp()`

and its relatives (`strncmp()`

, `strcasecmp()`

, and so on) if you want to compare strings, and *explicit* casts to integers (or floats), *with validation* (cf. `is_numeric()`

), if you want to compare numbers. The bugs that are likely to arise from the inconsistencies above may be rare, but they can be subtle and they can be damnably annoying.

For the sake of completeness, the script that I used to discover and verify the above:

<?php
function run_test($test_string) {
eval("\$result = ($test_string) ? 'true' : 'false';");
echo "$test_string => $result\n";
}
$tests = array(
' "1" == 1 ',
' "a" == 1 ',
' "a" == 0 ',
' "a" == "0" ',
' "0" == 0 ',
' "01" == 1 ',
' "010" == 10 ',
' 010 == 10 ',
' "010" == 010 ',
' "0x10" == 0x10 ',
' "09" == 09 ',
' "0" == 09 ',
' "0" == "09" ',
' "a" == 1e-1000 ',
' 1e-1000 == "1e-1000"',
'"1e-1000" == "0" ',
'"1e-1000" == "a" ',
);
foreach ($tests as $test) {
run_test($test);
}
$s = "010";
echo "\"$s\" == (int)\"$s\" ? " . ($s==(int)$s ? 'yes' : 'no') . "\n";
eval("\$x = {$s};");
echo "\"$s\" == $s ? " . ($s==$x ? 'yes' : 'no') . "\n";