2014-11-26

Three more RAW types:

← Older revision

Revision as of 09:43, 26 November 2014

(5 intermediate revisions by one user not shown)

Line 11:

Line 11:

= Analysis =

= Analysis =



With Delphi XE, Embarcadero decided that the one-byte-per-character String type that uses an (ANSI-compatible) encoding restricted to a certain set of languages (usually defined by the system locale) is not versatile enough. So they introduced a “code-page aware” string type that allows for each string to be held in a different encoding. A broad set of encoding variants is offered, including one Byte ANSI code pages, and Unicode in 1, 2 and 4 Byte variants. In principle each string variable in a project can be defined to use a different encoding. Hence the type “String” now comes in several “brands” (that can be denoted by a number in brackets). When leaving out the brackets the default is 2 byte Unicode, which makes a lot of sense with Windows OS, that internally uses this encoding style.

+

With Delphi XE, Embarcadero decided that the one-byte-per-character String type that uses an (ANSI-compatible) encoding restricted to a certain set of languages (usually defined by the system locale) is not versatile enough. So they introduced a “code-page aware” string type that allows for each string to be held in a different encoding. A broad set of encoding variants is offered, including one Byte ANSI code pages, and Unicode in 1, 2 and 4 Byte variants. In principle each string variable in a project can be defined to use a different encoding. Hence the type “String” now comes in several “brands” (that
in a variable or typename definition
can be denoted by a number in brackets). When leaving out the brackets the default is 2 byte Unicode, which makes a lot of sense with Windows OS, that internally uses this encoding style.

Delphi provides automatic code-conversion of the string content whenever it seems necessary.

Delphi provides automatic code-conversion of the string content whenever it seems necessary.



Unfortunately, Embarcadero forces a peculiar mix of static and dynamic “branding” of the string variables. Each variable is gets a certain encoding brand by defining it's name, on the other hand each string variable content additionally contains a (potentially) dynamically definable setting of as well the count of bytes per character as the encoding style to be used to interpret the content whenever necessary. IMHO this is the cause of major misconception. Supposedly they feared to be bashed due to some run-time overhead necessary with fully dynamic string code handling. Obviously with the partly-dynamic string code handling, the decision whether or not to do an automatic conversion when accessing a string can be done at compile time rather than at run time.

+

Unfortunately, Embarcadero forces a peculiar mix of static and dynamic “branding” of the string variables. Each variable is gets a certain encoding brand by defining it's name, on the other hand each string variable content additionally contains a (potentially) dynamically definable setting of as well the count of bytes per character as the encoding style to be used to interpret the content whenever necessary. IMHO this is the cause of major misconception. Supposedly they feared to be bashed due to some run-time overhead necessary with fully dynamic string code handling. Obviously with the partly-dynamic string code handling, the decision whether or not to do an automatic conversion when accessing a string
,
can be done at compile time rather than at run time.



+



+

= The problem =

= The problem =



While simple assignment and compiler built-in functions such as “pos()” and “copy()” with any instance can decide what to do – i. e. do or not do auto-conversion of the content based on the encoding brand of the arguments (by implicitly faking dynamically encoded arguments), not-built-in functions and properties need to be defined in a fixed string brand. Hence when using them with differently encoded Strings, the compiler will introduce very time consuming and loss-prone auto conversion (unless the function itself is provided in several brands).

+

While simple assignment and compiler built-in functions such as “pos()” and “copy()” with any instance can decide what to do – i. e. do or not do auto-conversion of the content based on the encoding brand of the arguments (by implicitly faking
fully
dynamically encoded arguments), not-built-in functions and properties need to be defined in a fixed string brand. Hence when using them with differently encoded Strings, the compiler will introduce very time consuming and loss-prone auto conversion (unless the function itself is provided in several brands).

This problem is much more urgent with fpc than with Delphi, as fpc is supposed to run with multiple OSes that might use different system-wide default encoding styles (e. g. UTF-8 with Linux and UTF-16 with Windows). While Delphi XE just creates Windows software, and projects that explicitly other but system-default-endcoded strings (e. g. read from a file or retrieved from a network resource) are rather seldom done, fpc needs to be able to handle user software that is compilable for multiple OSes with different encoding defaults.

This problem is much more urgent with fpc than with Delphi, as fpc is supposed to run with multiple OSes that might use different system-wide default encoding styles (e. g. UTF-8 with Linux and UTF-16 with Windows). While Delphi XE just creates Windows software, and projects that explicitly other but system-default-endcoded strings (e. g. read from a file or retrieved from a network resource) are rather seldom done, fpc needs to be able to handle user software that is compilable for multiple OSes with different encoding defaults.



The most obvious candidate for pain on that behalf is “TStrings”. This basic class is used to derive lots of string handling classes from. Storing the string and retrieving the content could easily – and with close to no overhead – be done in a fully dynamic way: each string comes with it's branding notes, anyway. But the fixed branding of the string type used in TString's interface forces auto-conversion in all cases but one. Another example is the interface of a GUI connection libraries – such as Lazarus or mse: portable programs would benefit from a common Interface for all OSes, but forcing any fixed encoding independent from the OS does not seem appropriate either, as permanent code conversion on the way between the user code and the OS could not be avoided at all. Of course the fpc RTL could be mentioned as well.

+

The most obvious candidate for pain on that behalf is “TStrings”. This basic class is used to derive lots of string handling classes from. Storing the string and retrieving the content could easily – and with close to no overhead – be done in a fully dynamic way: each string comes with it's branding notes, anyway. But the fixed branding of the string type used in
Delphi compatible
TString's interface forces auto-conversion in all cases but one. Another example is the interface of a GUI connection libraries – such as Lazarus or mse: portable programs would benefit from a common Interface for all OSes, but forcing any fixed encoding independent from the OS does not seem appropriate either, as permanent code conversion on the way between the user code and the OS could not be avoided at all. Of course the fpc RTL could be mentioned as well.



+



+

= A possible solution =

= A possible solution =

Line 35:

Line 31:

Delphi just provides the coding brand “RawByteString” as a string type that the compiler does not force auto-conversion for with any assignment to/from another brand. But AFAIK, with this brand, auto-conversion is disabled for all assignments, and maybe some assignment even are forbidden. This obviously is not what we want here.

Delphi just provides the coding brand “RawByteString” as a string type that the compiler does not force auto-conversion for with any assignment to/from another brand. But AFAIK, with this brand, auto-conversion is disabled for all assignments, and maybe some assignment even are forbidden. This obviously is not what we want here.



Hence the suggestion is to introduce yet another brand of string encoding styles, that is fully dynamic and prone to auto-conversion. This might be called the “DynamicString” encoding brand and might be assigned by the encoding brand number $F000 or $FF00 (to leave some room for encoding brands more similar to RawByteString which
AFASIK
is $FFFE.

+

Hence the suggestion is to introduce yet another brand of string encoding styles, that is fully dynamic and prone to auto-conversion. This might be called the “DynamicString” encoding brand and might be assigned by the encoding brand number $F000 or $FF00 (to leave some room for encoding brands more similar to RawByteString which
AFAIK
is $FFFE.



By this, all normal String handling introduced by an operator and compiler-built-in function
stay
completely unchanged, unless
they
explicitly use
this String
brand. Also “old-style” user or library functions using the traditional Delphi String brands, the calling also stays as it is defined by Delphi.

+

By this, all normal String handling introduced by an operator and compiler-built-in function
stays
completely unchanged, unless
we
explicitly use
the DynamcString
brand. Also “old-style” user or library functions using the traditional Delphi String brands, the calling also stays as it is defined by Delphi.

If with an assignment one or more of the partners is a DynamicString, the compiler needs to adhere to some simple rules and needs to generate code to check the dynamic encoding brand of the appropriate String(s).

If with an assignment one or more of the partners is a DynamicString, the compiler needs to adhere to some simple rules and needs to generate code to check the dynamic encoding brand of the appropriate String(s).

Line 43:

Line 39:

If the target is Dynamic: Just assign the source (be it dynamic or not or even a RawByteString).

If the target is Dynamic: Just assign the source (be it dynamic or not or even a RawByteString).



If the target is not dynamic (but no Raw String), generate code to check the encoding brand of the source (this is just a single Word, so no noticeable overhead). If this is the same as the static (compile-time) brand of the target, just assign. If the encoding is different, have the appropriate conversion library function be called (which would be necessary with pure static encoding paradigm in a similar case as well).

+

If the target is not dynamic (but no Raw String), generate code to check the encoding brand of the source (this is just a single Word, so no noticeable overhead). If this is the same as the static (compile-time) brand of the target, just assign. If the encoding is different, have the appropriate conversion library function be called (which would be necessary with
a
pure static encoding paradigm in a similar case as well).

If the target is a Raw String (static compile-Time brand), generate code to check the encoding brand of the source (this is just a single Word, so no noticeable overhead). If this is Raw String as well, just assign, if the encoding is different, behave like Delphi would do (I don't know what it does when assigning something to a RawByteString).

If the target is a Raw String (static compile-Time brand), generate code to check the encoding brand of the source (this is just a single Word, so no noticeable overhead). If this is Raw String as well, just assign, if the encoding is different, behave like Delphi would do (I don't know what it does when assigning something to a RawByteString).



One more case: if the source is a DynamicString and happens to have been assigned a “RAW” content, but the target is static and
nit
RAW, behave like Delphi would do when assigning Raw to any other encoding.

+

One more case: if the source is a DynamicString and happens to have been assigned a “RAW” content, but the target is static and
not
RAW, behave like Delphi would do when assigning Raw to any other encoding.

Now it's obvious how a “dynamic” version of TStrings would work.

Now it's obvious how a “dynamic” version of TStrings would work.

Line 57:

Line 53:

Reworking the code for storing and retrieving with TStringList would be easy to do. Of course more complex classes and functions (such as sorting) would need some more effort.

Reworking the code for storing and retrieving with TStringList would be easy to do. Of course more complex classes and functions (such as sorting) would need some more effort.



Hence, if setting the same (static) encoding brand for all the variables you use with a TStrings based store, there will
be
exactly the same functionality and no noticeable overhead regarding the Delphi way of always using UTF-16.

+

Hence, if setting the same (static) encoding brand for all the variables you use with a TStrings based store, there will
provide
exactly the same functionality and no noticeable overhead regarding the Delphi way of always using UTF-16.



As an advantage of the dynamic TStringList besides the obvious enhanced versatility for (portable) code that is done with another String encoding brand as UTF-16,  is, that you can  force it to use any encoding you like by just assigning your strings to an appropriate variable before storing (no need for any considering while retrieving). So you can e. g. do a compact UTF-8 store while using Delphi-style UTF-16 (more easily handled regarding pos() and friends) in your user code The conversion between UTF-8 and UTF-16 is
faster than with locale based ANSI one byte encoding
and not prone to information loss.

+

As an advantage of the dynamic TStringList besides the obvious enhanced versatility for (portable) code that is done with another String encoding brand as UTF-16,  is, that you can  force it to use any encoding you like by just assigning your strings to an appropriate variable before storing (no need for any considering while retrieving). So you can e. g. do a compact UTF-8 store while using Delphi-style UTF-16 (more easily handled regarding pos() and friends) in your user code
.
The conversion between UTF-8 and UTF-16 is
fast
and not prone to information loss.

= Three more RAW types =

= Three more RAW types =



While we are defining a not Delphi compatible String type brand, of course we should add the obviously missing
type
RawWordString, RawDWordString, and RawQWordString.

+

While we are defining a not Delphi compatible String type brand, of course we should add the obviously missing
types
RawWordString, RawDWordString, and RawQWordString.

= See Also =

= See Also =

* [[FPC Unicode support]]

* [[FPC Unicode support]]

Show more