Status· Background· Problem· Solutions·
QA Homepage· Latest News· QA Resources· QA IG· QA WG· QA Calendar·
Why using PHP sessions causes invalid HTML and XHTML to be generated, and how to fix it.
This document is an article contributed to the QA Interest Group. Feedback, suggestions and corrections are welcome, and should be sent to the publicly archived mailing-list www-qa.
In HTML (and XHTML, along with other SGML and XML applications)
certain characters have special meaning, a prime example being <,
which indicates the beginning of a tag. Such characters cannot be
simply typed into a document if you wish them to display - otherwise
how could the user agent tell the difference between
b<a
(meaning b is less than a) and
b<a
(meaning b followed by the start of an
anchor)?
In order to display reserved characters HTML and XHTML provide a mechanism called character references. The syntax of these is:
For example, the "less than" character is represented as
<
.
Giving the ampersand special meaning makes it, like <, a
reserved character, so it also needs to be represented by an entity
for it to be used in a document - &
Now for a small confession - there are exceptions to these rules, although they are not relevant when dealing with the issues caused by PHP sessions.
HTML and XHTML include blocks of what is called CDATA, where HTML
special characters no longer have special meaning. Inside such blocks
character references are no longer processed, so an ampersand must be
typed as an ampersand, and not as its character reference. In HTML,
the content of <script>
and
<style>
elements is CDATA, while in XHTML they are
marked explicitly. You can avoid the problem by placing scripts
and style sheets in separate files and using <link>
and <script src="…">
.
The other exceptions are that sometimes the semi-colon is optional, and sometimes ampersands can be represented without being encoded as entities. In these situations it is never wrong to represent the character as a character reference terminated by a semicolon, so I won't go into more detail.
PHP has session handling code built in, this enables data to be stored on the server but be associated with a specific user (for, roughly, a single visit to the site).
To link the data with a user, the website has to hand the user agent a token which identifies it. This token is stored in a cookie, but not all user agents support cookies, and most of those which do allow them to be turned off.
PHP provides a fallback mechanism. If it discovers that cookies are not accepted by the client, it rewrites every link on the page to include that token in a query string. I believe this used to be enabled by default, but testing shows that, at least for the Fedora package of PHP 4.3.11 (Fedora release 2.4 of that package), it isn't. It can be turned by on by setting the session.trans_sid directive.
This is, in theory, a pretty elegant solution to the problem (discounting the issues of the token hanging around for third parties to hover off public computers, bookmarking, link sharing, etc, etc), but the implementation is flawed.
For links with no query string, there isn't a problem. PHP appends
?PHPSESSID=
followed by a random hexadecimal number. For
links that do have a query string PHP appends
&PHPSESSID=
.
Ampersand characters used as argument separators pose no problem in plain old URLs, however in URLs encoded in HTML they still mean start of character reference (subject to the aforementioned exceptions, which the above example does not qualify for).
Most users won't notice a problem, the majority of user agents are rather good at working around mistakes by authors. However, that does not mean authors should ignore the problem.
The character that PHP uses to separate arguments is configurable with the arg_separator.output directive. This can be set in a number of ways and is the solution suggested in the PHP manual.
The php.ini file contains the central configuration data for an install of PHP on a computer. You can specify a character reference to use there.
arg_separator.output = "&"
The Apache web server can set PHP scripts in all the usual places. This allows different directives to be set on a per site or per directory basis (in, for example, a <location> block or .htaccess file).
php_value arg_separator.output &
PHP configuration directives can be set on a per script basis with the ini_set function. Put the code to set the directives at the top of your script.
<?php ini_set('arg_separator.output','&'); ?>
Since the ampersand character has special meaning in HTML, the specification suggests that query string parsers allow the use of a semicolon as an argument separator. PHP comes preconfigured to accept this, so you can alter the output code to use a semicolon instead of an ampersand using the same techniques.
arg_separator.output = ";"
php_value arg_separator.output ;
<?php ini_set('arg_separator.output',';'); ?>
This option has a number of advantages from a security point of view as it reduces the chance of the session token leaking to third parties. As a side effect it will render your session code useless for visitors who disable, block or otherwise do not support cookies (this has accessibility implications).
session.use_trans_sid = 0
php_value session.use_trans_sid 0
This directive may or may not be able to be set on a per script basis depending on which version of PHP you are using. If it is possible to set it then the syntax is as follows:
<?php ini_set('session.use_trans_sid','0'); ?>