Converting OpenPSA database to UTF8
The story of converting (large) OpenPSA installation to UTF-8 (starting from ISO-LATIN-1)
Format is more of a log of activities with comments, hopefully I'll have extra time to edit it to nicer HOWTO someday, certain parts apply also to converting plain MidCOM sites (that do not have serialized data) if they use custom schemas with text in attachments
- Start testing and refining the process
- Create a mysql dump (dump.sql) of the database to convert
- Copy blobs to a testing blobdir
- Configure apache testing host (remember to set MidgardParser to russian)
- use iconv to convert the mysql dump
- iconv -o dump.sql.utf8 -f iso-8859-1 -t utf-8 dump.sql
- Create a script to search and clean serialize strings (with slight modifications [remove some escapes] this can be used with repligard dumps as well), see script1
- After some iterations the script seems to work as it should and produces dump.sql.utf8.cleaned
- Import (with mysql) the dump.sql.utf8.cleaned to a separate db
- Rename some OpenPSA host to match the testing host
- Start testing for the first time (use OpenPSA as it contains most serialized strings)
- Serialization fixer seems to work fabulously, no serialization errors found
- Found problems with nemeinloc, since the strings are stored in db base64 encoded iconv can't touch them
- Create functions for converting nemeinloc libraries between charsets (on OpenPSA devel server
- Repligard import the new nemeinloc to get the conversions functions
- Alternatively you can use the 1.10.1 OpenPSA release
- Run script2 to convert the localizations
- If you installed the 1.10.1 OpenPSA, only convert the Aegir libraries
- Testing with OpenPSA again
- Looks fine, start migration for real
- Migrate for real
- Disabled modifications by commenting out midgard-data.conf (and put up "under maintenance" banner to global documentroot)
- Alternatively you can for example disable all logins by setting username to "" in mysql (make sure you have a backup to retvieve them from afterwards)
- Drop test db (do not delete blobs)
- Create real UTF8 db
- Repeat the dump,convert,import,fix localization process for real DB
- Alternatively you can just rename the testing blobdir and cp -aux to update
- Backup midgard-data.conf
- Change MidgardParser and MidgardDatabase directives in midgard-data.conf for new UTF8 db
- Uncommented midgard-data.conf to get sites visible
- If you're doing the conversion you should propably convert the text attachments (see below) first
- Testing with real sites
- Found out that some MidCOM schemas are attachments for storing text
- Convert thext attachments to UTF8
- Get a list of suspect attachments' locations
- mysql> select location from blobs where name like "data_%";
- Run search/replace to get a file like textfile1
- Use script3 to run iconv for each suspect attachment
- More testing
- Found out that the charset in Content-Type header seems not to be correct
- Add AddDefaultCharSet UTF-8 directives to midgard-data.conf, either to each VirtualHost or on global level
- Even more testing
- Noticed some sites have hardcoded charset in <meta http-equiv> tags in style or in site settings (/midcom-admin/site-settings/), these need to be manually fixed
- Recommend that you use latest release FS-MidCOM have auto-detection of charset for sites that do not have any encodings set (or they will have to set encoding to utf-8 in /midcom-admin/settings)
- Kaukola made some changes HTMLArea (or its configuration), I'll pester him for details
- All done in "three easy steps".
Special thanks to Kaukola and Bergie for handling most of the testing and thus saving me a lot of time, also Juhana was present for moral support though he spent most of time installing Ubuntu Linux on his laptop.
Scripts/examples
Script1
serialize_cleaner.php (PHP4-CLI script)
#!/usr/bin/php
<?php
ini_set('max_execution_time', 0); //Unlimited
ini_set('memory_limit', '350M'); //This cannot be 0 for unlimited, set to about 3 times the input file size
$file=$argv[1];
$fp=fopen($file, 'r');
if (!$fp) exit();
while (!feof($fp)) {
$data.=fread($fp, 4096);
}
fclose($fp);
//echo "DEBUG: strlen(data)=".strlen($data)." filesize($file)=".filesize($file)."\n";
$preg='/s:([0-9]+):\\\"(.*?)\\\";/';
//echo "DEBUG: preg=$preg\n";
preg_match_all($preg, $data, $matches);
/*
print_r($matches);
reset($matches);
*/
while (list ($k, $origFullStr) = each ($matches[0])) {
$origLen=$matches[1][$k];
$origStr=$matches[2][$k];
$newLen=strlen($origStr);
//echo "DEBUG: origFullStr=$origFullStr, origLen=$origLen, newLen=$newLen \n";
if ($newLen!=$origLen) {
$newFullStr="s:$newLen:\\\"$origStr\\\";";
echo "DEBUG: string key $k has changed lenght (orig: $origLen, new: $newLen)\n";
echo " ** O: $origFullStr \n";
echo " ** R: $newFullStr \n";
if (!$cache[$origFullStr]) { //For performance we cache information on which strings have already been replaced
$data=str_replace($origFullStr, $newFullStr, $data);
$cache[$origFullStr]=TRUE;
}
}
}
echo "DEBUG: Writing $file.new \n";
$fp2=fopen($file.".new", 'w');
fwrite($fp2, $data);
fclose($fp2);
echo " ** Done \n";
?>
Script2
run in Midgard console (found under OpenPSA Manager)
<?php
mgd_include_snippet("NemeinLocalization/init");
//Convert all nemeinloc registered libraries
while (list ($k, $libPath) = each ($localization_libraries)) {
nnloc_convert_charset_lib($libPath, 'iso-8859-1', 'utf-8'); //Comment out if using OpenPSA 1.10.1
}
//Convert double-converted (we got UTF-8 versions in localization import) libraries back
nnloc_convert_charset_lib('/NemeinLocalization/Libraries/Errors', 'utf-8', 'iso-8859-1'); //Comment out if using OpenPSA 1.10.1
nnloc_convert_charset_lib('/NemeinLocalization/Libraries/Generic', 'utf-8', 'iso-8859-1'); //Comment out if using OpenPSA 1.10.1
//Convert Aegir (not-registered) libraries
nnloc_convert_charset_lib('/AegirCore/Localization/AegirCore', 'iso-8859-1', 'utf-8');
nnloc_convert_charset_lib('/AegirAddOn_Gallery/Localization', 'iso-8859-1', 'utf-8');
nnloc_convert_charset_lib('/AegirAddOn_Schedule/Localization', 'iso-8859-1', 'utf-8');
nnloc_convert_charset_lib('/AegirAddOn_Webstats/Localization', 'iso-8859-1', 'utf-8');
//Add calls for any other AegirAddOns you have installed
?>
Script3
text_blobs_convert.php (PHP4-CLI script)
#!/usr/bin/php
<?php
$fp=fopen($argv[1], 'r');
while (!feof($fp)) {
$line=ltrim(rtrim(fgets($fp, 4096)));
//echo "DEBUG: line=\"$line\"\n";
if (!$line) continue;
$cmd1="cp $line $line.backup";
$cmd2="iconv -o $line.utf8 -f iso-8859-1 -t utf-8 $line";
$cmd3="mv -f $line.utf8 $line";
// echo "DEBUG: cmd1=\"$cmd1\"\n ** cmd2=\"$cmd2\" \n ** cmd3=\"$cmd3\" \n";
exec($cmd1, $out1, $ret1);
exec($cmd2, $out2, $ret2);
exec($cmd3, $out3, $ret3);
echo "DEBUG: ret1=$ret1, ret2=$ret2, ret3=$ret3\n";
}
?>
Textfile1
example attachment_locations.txt file
/var/www/blobs_utf8/6/3/caf1e259538b11a900e12a3a3f43e54e /var/www/blobs_utf8/6/4/de7ea7b9cc6328822dd094911c3168f7 /var/www/blobs_utf8/3/4/46b5e42e037f2ff766816d769d859be9 /var/www/blobs_utf8/3/3/3dd6ae80d71c4c949579eec6f236ca95