| Prev | Next | Start of Chapter | End of Chapter | Contents | Glossary | Index | Comments | (10 out of 11)

Working With Text Conversion Styles

The TEXT-CONVERSION-STYLE class lets you specify certain text conversion parameters to represent different character sets for importing or exporting text. For example, if your KB required translation for three different character sets:

you could create three text-conversion-style items. Each of the text-conversion-style items would represent a particular character set that you required. For example, to facilitate the Gensym, Cyrillic, and Japanese character sets, you could create these three text conversion styles:

Use this text-conversion-style item... For importing and exporting...
gensym-text-style
Gensym character set text
cyrillic-text-style
Cyrillic text
shift-jis-text-style
Japanese text

Once you create the text conversion styles your KB requires, any item that interacts with text conversion can use them, as described in Using a Custom Text Conversion Style.

To create a text-conversion-style item:

  1. Choose:

  2. Position the new item on the workspace.

  3. Click on the item to display its menu.

  4. Choose table.

Naming the Conversion Style

You must name each TEXT-CONVERSION-STYLE item. Other items refer to text conversion styles by name.

Determining the External Character Set to Use

The External-character-set-to-use attribute lets you choose from the following character sets, where the symbol gensym is the default. The external character set determines how G2 encodes characters whenever the text conversion style is in use.

Character set Description
us-ascii
7-bit, single byte character set
latin-1
8-bit, single byte character set ISO-8859-1
latin-2
8-bit, single byte character set ISO-8859-2
latin-3
8-bit, single byte character set ISO-8859-3
latin-4
8-bit, single byte character set ISO-8859-4
latin-cyrillic
8-bit, single byte character set ISO-8859-5
latin-arabic
8-bit, single byte character set ISO-8859-6
latin-greek
8-bit, single byte character set ISO-8859-7
latin-hebrew
8-bit, single byte character set ISO-8859-8
latin-5
8-bit, single byte character set ISO-8859-9
latin-6
8-bit, single byte character set ISO-8859-10
jis
7-bit, JIS X 0208 (Japanese)
jis-euc
8-bit, JIS X 0208
shift-jis
Shift-jis encoded JIS X 0208 (Japanese)
ksc
7-bit, KS C 5601 (Korean)
ksc-euc
8-bit, KS C 5601
unicode
Unicode as series of-16 bit character codes.
unicode-byte-swapped
Unicode as series of-16 bit character codes, but as byte-swapped codes.
unicode-ucs-2
8-bit byte sequences of Unicode in UCS-2 format, most significant byte first.
unicode-ucs-2-byte-swapped
8-bit byte sequences of Unicode in UCS-2 format, least significant byte first.
unicode-utf-7
Standard 7-bit encoding of Unicode.
unicode-utf-8
Standard 8-bit encoding of Unicode.
gensym
The Gensym character set, as used in G2 and related products since Version 1.0, modified to handle Unicode.
x-compound-text
X compound text with subset of ISO 2022 escapes.

Using a Replacement Character

You can specify a replacement character to use in the event that Unicode does not have a character code for any imported character, or for any exported character that Unicode cannot represent. In the Replacement-character attribute, specify a one-character string or character code. The default is none, which means that any unrepresented characters will be omitted.

Specifying the Han-Unification Mode

You can specify whether Japanese, Korean, or Chinese is preferred when translating Chinese characters into non-Unicode character sets such as gensym.

In the Han-unification-mode attribute, choose:

The default mode is japanese.

Specifying the External Line Separator

Line separators vary among different character sets. The External-line-separator attribute lets you specify what characters are used to indicate the end of one text line and the beginning of the next.

The External-line-separator choice is valid only when exporting text. When importing text, G2 separates lines of text whenever it sees any of the available options. An exception is for the Unicode line separator options, which G2 only searches for when the current Text-conversion-style is using one of the Unicode character sets. Character set options are described in Determining the External Character Set to Use.

These are the six possible line separators:

Line Separator Description
per-platform
This is the default value. With this value, G2 determines the current operating system and selects a line separator as follows: If G2 cannot determine the operating system, or it is not one of those listed, the default option is LF.
CR
The carriage return character, which is ASCII 13 decimal and Unicode 000D hexadecimal.
LF
The linefeed character, which is ASCII 10 decimal and Unicode 000A hexadecimal.
CRLF
The two character carriage return and linefeed sequence.
unicode-line-separator
Code 2028.
unicode-paragraph-separator
Code 2029.

While you can choose the line separator of your choice, not every option is applicable to every external character set. For example, the unicode-line-separator or the unicode-paragraph-separator cannot be expressed in ASCII.

Using a Custom Text Conversion Style

The Text-conversion-style attribute, appears in all items that interact with text conversion:

GFI is a superseded capability. For more information see Appendix F, Superseded Practices.

When at least one text conversion style exists in a KB, you can direct any one of the previous items to use that particular style by including its name in the Text-conversion-style attribute.


Note: Providing a named TEXT-CONVERSION-STYLE for any one of these items causes G2 to use that style for all other items that require one.

Using the Default Text Conversion Style

If you do not provide your own TEXT-CONVERSION-STYLE, and an item requires one, G2 uses a system-defined text-conversion-style. The relevant attribute values of the system-defined class are as follows:

This attribute... Has this value...
External-character-set
gensym
Replacement-character
8-bit replacement char: none
Han-unification-mode
japanese
External-line-separator
per platform

The system-defined text-conversion-style is generally designed to import and export text as it was done in G2 Version 4.0. For text whose external encoding was not specified in 4.0 such as Greek, Hebrew, Arabic, and Georgian, such a comparison is meaningless, but the definition of the Gensym character set clarifies the interpretation that should be assigned.

The reason japanese is used as the default Han-unification-mode is that Han (Chinese) characters are infrequently used in Korean writing, but frequently used in Japanese writing.

Working with G2-Stream Objects

Several text-oriented system procedures create G2-STREAM objects as part of opening and closing files external to G2. You can specify a particular TEXT-CONVERSION-STYLE for the G2-STREAM object to use as described in Using a Custom Text Conversion Style.

Failure to specify a particular text-conversion-style causes G2 to use the system-defined style, which assumes that the external character set is Gensym.


Note: Changing the Text-conversion-style attribute while reading from or writing to a G2-STREAM causes G2 to signal an error.

| Prev | Next | Start of Chapter | End of Chapter | Contents | Glossary | Index | Comments | (10 out of 11)

Copyright © 1997 Gensym Corporation, Inc.