CMS MADE SIMPLE FORGE

CMS Made Simple Core

 

[#11775] Encoding failure when CMSMS runs windows ..xC3 (with solution proposal)

avatar
Created By: James Searby (jsearby)
Date Submitted: Wed Mar 21 07:09:00 -0400 2018

Assigned To:
Version: 2.2.7
CMSMS Version: 2.2.7
Severity: Major
Resolution: Fixed
State: Closed
Summary:
Encoding failure when CMSMS runs windows ..xC3 (with solution proposal)
Detailed Description:
When running on CMSMS on windows server.
If you submit page content  "juq'à" it fails with 
Incorrect string value: '\xC3' for column 'word' at row 1

Reason:
When CMCMS php search module try to index words it uses 
- html_entity_decode that gives UTF8 ouptut (now by default since PHP 5.4.0)
and
- preg_split that will use default server encoding (so not UTF8 when running on
windows)

The consequence is that a string like 
x6A x75 x71 x27 xC3 xA0    (juq'à in UTF8 as output of html_entity_decode)
Cannot be splitted correctly by preg_split  on windows 

Resolution:
In search.tools.php change the following
$words = preg_split('/[\s,!.;:\?()+-\/\\\\]+', $phrase);
==>
$words = preg_split('/[\s,!.;:\?()+-\/\\\\]+/u', $phrase);

This way we enforce preg_split uses UTF8 as well.

Fix risk assesment:
- This fix will not change the behaviour on Linux where UTF8 is already the
default for preg_split
- This fix will not alter behaviour for any PHP >= 5.4.0 as

=> I would propose to add a check at install time to check default_charset to
UTF8 as well.

Testing:
I've done quite a bunch of test locally and remotly. Seems to solve it .

Thanks
James


History

Comments
avatar
Date: 2018-03-21 15:15
Posted By: Robert Campbell (calguy1000)

Fixed in svn for the next version of 2.2.x  and for 2.3
the 2.3 series will also perform tests on the default_charset ini setting.

Thanks for identifying the issue and providing a clear, and precise report.
      
Updates

Updated: 2018-07-30 17:53
state: Open => Closed

Updated: 2018-03-21 15:15
resolution_id: => 7