2009-10-13

Picky XML Interpreter and a Solution for Encoding Custom Node Names

Writing and Reading XML:

I have been working lots with Adobe's XML object within ExtendScript. There are some nice tools, but the biggest problem I've been having is dealing with its ability to create XML using custom nodes.
For example, I needed to write some Object Style names, not as data, but as the name of the node. This is because I needed to write some data that applied to the Object Styles. So I wanted XML node name to match the name of the Object Style.
That is fine if the name of the object style is something simple like 'Headline', but add a space as in 'Headline Frame' and suddenly there are problems. You can write the XML file because that is just text, and you can read the text file, but as soon as you ask ExtendScript to make a new XML object it will complain about the 'token'. And that makes sense because an XML node can't have a space. Nor can an XML node have < > & or %. Fine. So I wrote a function to encode those things. See entityReference.
It works to write and reread the file. And you can verify that the file is encoded, but when reading the file those items get automatically converted back into the normal strings and Adobe's XML parser still balks about the character.

Plan E: It at least works
After several other attempts, I finally settled on the encoding used in the two functions below. They leave the special characters (all but the alpha numerics) encoded with a
U_ and a 4 digit hexadecimal unicode glyph number. For example an — ( em-dash ) is:
U_2014
I would have liked to have used the more traditional:
0x followed by the unicode number as:
0x2014
but that places a number ( zero ) in the file. If the character needing to be encoded is the first character then the XML node would begin with 0 and once again, Adobe's Extendscript XML parser balks. So I settled on the U_ even though it was nonstandard.

Built-In Encoding Options
ExtendScript contains 3 pairs of encoding / decoding functions, but all three will trigger a complaint from the XML parser.
escape (aString) <--> unescape (stringExpression)
encodeURI (text) <--> decodeURI (uri)
encodeURIComponent (text) <--> decodeURIComponent (uri)

Using the epsEntitify functions:
If need to write non-standard node names to an XML file, just run the first function on the node names before creating them and writing the file. Here is the result of one node that includes two spaces:

Quote
Quote 1 Quote
It is at least marginally readable if you want to read the XML file itself.

Then when you read the XML, use the 2nd function to decode the node names. It works. The decoding is such that it very quickly returns if there is nothing to decode.

//
function epsEntitify ( str ) {
//-------------------------------------------------------------------------
//-- E P S E N T I T I F Y
//-------------------------------------------------------------------------
//-- Generic: Yes for ExtendScript.
//-------------------------------------------------------------------------
//-- Purpose: To replace the XML Reserved Characters in a passed string
//-- with custom values based upon hexidecimal versions of their
//-- unicode values and with a unicode U_ prefix.
//-------------------------------------------------------------------------
//-- Arguments: A string to clean up.
//-------------------------------------------------------------------------
//-- Calls: pad() to pad the hexideciaml value to 4 digits.
//-------------------------------------------------------------------------
//-- Returns: a string with all non word characters replaced with their
//-- hexidecimal value in a unicode format such as 'U_0020' for a space.
//-------------------------------------------------------------------------
//-- Sample Use:
//~ var unfitForXML = '<> close % percent'
//~ var safeForXML = epsEntitify ( unfitForXML ) ;
//-------------------------------------------------------------------------
//-- Notes: Using the .toString() method to convert to a hexideciaml value
//-------------------------------------------------------------------------
//-- Written: 2009.10.12 by Jon S. Winters of electronic publishing support
//-- eps@electronicpublishingsupport.com
//-------------------------------------------------------------------------
//-- Create a regular expression pattern for acceptable characters
var AlphaNumeric = new RegExp ('\\w');
//-- Create a return array ( it will be converted to a string at the end )
var eString = new Array (str.length) ;
//-- Loop through every character.
for ( var si = str.length - 1 ; si >= 0 ; si-- ) {
//-- Get a reference to the indexed character
var activeCharacter = str.charAt ( si ) ;
//-- If that character is included in the regular expression
//-- pattern then add it to the return array
if ( AlphaNumeric.test(activeCharacter) ) {
eString[si] = activeCharacter ;
}
else {
//-- It isn't an allowed character, convert it to a hexidecimal
//-- value. This uses a special feature of the built-in
//-- .toString() method to convert the value to hexideciaml
eString[si] = 'U_' + pad ( str.charCodeAt ( si ).toString(16) , 4 , '0' ) ;
}
}
//-- Convert the array to a string and send it back.
return eString.join ('')
}
//
//
function epsUnEntitify ( str ) {
//-------------------------------------------------------------------------
//-- E P S U N E N T I T I F Y
//-------------------------------------------------------------------------
//-- Generic: Yes, but has a very specific purpose.
//-------------------------------------------------------------------------
//-- Purpose: To take a string that has been processed with the
//-- epsEntitify() function and return it to its original values.
//-- The pair was written to encode XML files in ExtendScript
//-------------------------------------------------------------------------
//-- Arguments: str: the string to decode
//-------------------------------------------------------------------------
//-- Calls: Nothing.
//-------------------------------------------------------------------------
//-- Returns: The string decoded.
//-------------------------------------------------------------------------
//-- Written: 2009.10.13 by Jon S. Winters of electronic publishing support
//-- eps@electronicpublishingsupport.com
//-------------------------------------------------------------------------
//-- Create the custom pattern to find the special strings used by the
//-- epsEntitfy function. Note the parenthesis which are used for
//-- a backreferenece in the .exec() method later.
var p = new RegExp ( 'U_([0-9a-f][0-9a-f][0-9a-f][0-9a-f])' , 'gm' ) ;
//-- Loop through every match of that pattern
//-- Using the .text() method which returns true only if the
//-- passed string has a match.
while ( p.test ( str ) ) {
//-- Reset the pointer for the because the strings get
//-- shorter each time
p.lastIndex = 0 ;
//-- Use the .exec() method to determine the orignal string
//-- and the back reference.
//-- The result will have at least 2 values. [0] is the
//-- original string, and [1] is the backreference
var r = p.exec ( str ) ;
//-- convert the backreference into a base 10 number and
//-- then create a string using that character number.
var origChar = String.fromCharCode ( parseInt ( r [1] , 16) ) ;
//-- Use a basic search / replace to replace the base string
//-- with the original character.
//-- By using a regular expression and the 'gm' this
//-- can replace multiple matches at the same time.
str = str.replace ( new RegExp ( r [0] , 'gm' ) , origChar ) ;
}
return str ;
}
//

1 comment:

  1. Oooh handy - I've was about to write a string to legal XML node name function and you've got the basis of it here.

    The list of legal node name characters is a bit wider than \w but coming up with something fast to detect the illegal character ranges is proving "interesting".

    ReplyDelete