This article discusses about the alpha field coding used by SIM card to display UCS2 on Mobile Equipment's display (phone's display).
Alpha fields? UCS2? What are they? Okay let's start by learning the definitions...
Terms and Definitions
What is alpha fields?
Alpha fields are text strings, which are labeled as Alpha Identifier. You can find them on some EFs in 3GPP TS 11.11 specification and in all STK menu having Alpha Identifier TLV in TS 11.14.
What is UCS2?
UCS2 (Unicode Character Set, coded on 2 bytes) is character encoding which coded each character into 2 bytes. You can read more about the UCS on wikipedia.
What is 3GPP?
3GPP is an organization standard, focusing on GSM technical specifications. You can find more about this organization here. The Alpha Field coding for UCS2 is described in Technical Specification
number 11.11 (TS 11.11) which can be downloaded here.
Alpha Fields Formats
There are 3 kinds of format used by SIM to display UCS2.
'80' format
The encoding for '80' format is as follow:
- The first octet/byte is '80'
- The following octets are the 16 bit UCS characters, Little Endian format.
Example:
We have 3 UCS2: Sকদ
The characters in bytes are: '0053' for "S", '0995' for "ক", and '09A6' for "দ".
The coding for Alpha field for this format is: '80 0053 0995 09A6'. As simple as that!
'81' format
The encoding for '81' format is as follow:
- The first octet is '81'
- The second octet is the number of UCS2 characters
- The third octet is Base Pointer for bit15 to bit8 for the UCS2: 0xxxxxxxx0000000
- The following octets are the coded characters with the following rule:
- If the MSB (most significant bit) is zero, the remaining 7 bits contain GSM Default Alphabet
- If the MSB is one, the remaining 7 bits are offset value added to Base Pointer which the result defines the UCS2 character
Example:
We have 3 UCS2: Sকদ
The characters in bytes are: '0053' for "S", '0995' for "ক", and '09A6' for "দ".
The coding for Alpha field for this format is: '81 03 13 53 95 A6'.
How can we get that value?
First, the first octet is '81'.
The second octet shall be '03' since we have 3 UCS2.
The third octet is the Base Pointer. If we look at all UCS2 characters which high byte (two first digits) is not '00', then we get '0995' and '09A6'. In binaries we get:
16 1 (bit position)
'0995' = 0000 1001 1001 0101
'09A6' = 0000 1001 1010 0110
**********
Base pointer '0980' coded as '13'
So, the Base pointer value is 00010011b or '13'.
The fourth octet is the first character "S". Since it is default alphabet, we simply set bit 7 with zero, and get 7-bits of "S":
"S" = '0053' = 0000 0000 0101 0011
********
(0 + 1010011)b =
01010011b = '53'
TIPS: when you get '00XX', then the octet is always the low byte XX.
The fifth octet is for character "ক" ('0995').
To encode this character, we calculate the additional offset from the Base Pointer.
Additional value = '0995' - '0980' = '15' = 0010101b (only 7 bit)
The coded character has MSB set to 1. Hence the value is (1 | 0010101)b = '95'
The sixth octet is the character for "দ" ('09A6').
By doing the same way as fifth octet, we get 'A6'.
'82' format
The encoding for '82' format is as follow:
- The first octet is '82'
- The second octet is the number of UCS2 characters
- The third octet and fourth octet is full 16-bit Base Pointer for the UCS2
- The following octets are the coded characters with the following rule:
- If the MSB (most significant bit) is zero, the remaining 7 bits contain GSM Default Alphabet
- If the MSB is one, the remaining 7 bits are offset value added to Base Pointer which the result defines the UCS2 character
Example:
We have 3 UCS2: Sকদ
The characters in bytes are: '0053' for "S", '0995' for "ক", and '09A6' for "দ".
The coding for Alpha field for this format is: '82 03 09 95 53 80 91'.
How can we get that value?
The first octet and second octet is quite clear.
For the third and fourth octet, the Base Pointer, we can get it from the lowest value from all UCS characters. Of course, it is better if we have the UCS2 characters look-up table which indicates the Base Pointer for each specific set. In this example, I set the Base Pointer as '0995'.
The fifth octet is the character "S" ('0053'). Since it is default alphabet, then the octet value is '53'.
The sixth octet is the character "ক" ('0995').
To encode this character, we calculate the additional offset from the Base Pointer.
Additional value = '0995' - '0995' = '00' = 0000000b (only 7 bit)
The coded character has MSB set to 1. Hence the value is (1 | 0000000)b = '80'
The seventh octet is the character for "দ" ('09A6').
Additional value = '09A6' - '0995' = '11' = 0010001b (only 7 bit)
The coded character has MSB set to 1. Hence the value is (1 | 0010001)b = '91'
Which one to choose: '80', '81', or '82'?
- Whenever possible, use '81' format.
- Strong point: '81' offers smallest number of memory required, i.e. (3 + N) bytes.
- Weak point: this format only works for character set containing 128 characters that lies between 'XX00' to 'XX7F', or between 'XX80' to 'XXFF'
- If '81' is impossible, try the '82' format
- Stong point: '82' offers slightly bigger number of memory required compared to '81' format, i.e. (4 + N) bytes
- Weak point: this format only works for character set containing 128 characters
- If '81' and '82' is not possible, you must use '80'
- Strong point: '80' can covers all UCS2 range from '0000' to 'FFFF'
- Weak point: the number of bytes required is large, i.e. (1 + 2 * N) bytes
Sample source in Java
The following is example of source code for UCS2 to Alpha Field conversion, and vice versa.
/**
* Converts UCS2 characters into Alpha Fields format according to 3GPP TS 11.11 Appendix B, and
* vice versa.
*
* @author SDK
*/
public class AlphaFields
{
/**
* Converts UCS2 alpha field into UCS2 bytes
*
* @param src
* source byte array
* @param srcOff
* offset to first octet ('80', '81', or '82') in source byte array
* @param srcLen
* length of alpha field
* @param dest
* destination byte array
* @param destOff
* offset to store the result in destination
* @return number of bytes stored in destination
*/
public static short convertAlphaFieldToUcs2(byte[] src, short srcOff, short srcLen, byte[] dest,
short destOff)
{
short base; // base of UCS2 page for '81' and '82' format
short nChar = 0; // number of UCS2 characters
short i; // loop counter
switch (src[srcOff])
{
case (byte) 0x80:
// if first octet is '80', any following bytes are 16 bit UCS2 characters
// copy all the bytes to destination buffer, excluding the '80' byte
srcLen--;
Util.arrayCopyNonAtomic(src, (short) (srcOff + 1), dest, destOff, srcLen);
return srcLen;
case (byte) 0x81:
// if first octet is '81', second octet is number of characters
nChar = Util.makeShort((byte) 0, src[(short) (srcOff + 1)]);
// second octet is the base pointer bit 15 to 8: 0hhhhhhhh0000000
// we need to shift left the bits 7 times to get the base pointer.
base = (short) ((short) (src[(short) (srcOff + 2)] & 0x00FF) << 7);
// skip 3 bytes ('81', number of characters, and base pointer)
srcOff += 3;
// jump to for loop below
break;
case (byte) 0x82:
// if first octet is '81', second octet is number of characters
nChar = Util.makeShort((byte) 0, src[(short) (srcOff + 1)]);
// third and fourth octet are 16-bit base pointer
base = Util.getShort(src, (short) (srcOff + 2));
// skip 4 bytes ('81', number of characters, and 2 bytes base pointer)
srcOff += 4;
break;
default:
// handle of unknown format
return 0;
}
// for every byte in data under '81' and '82'
for (i = 0; i < nChar; i++)
{
// if MSB is not set, meaning GSM default alphabet, set the output into 00XX
if (src[srcOff] >= 0)
{
dest[destOff] = 0;
dest[(short) (destOff + 1)] = src[srcOff];
}
// if MSB is set, meaning the UCS2 character is base pointer plus 7-bit of the value
else
{
Util.setShort(dest, destOff, (short) (base + (byte) (src[srcOff] & 0x7F)));
}
// next iteration
srcOff++;
destOff += 2;
}
// return number of UCS2 bytes
return (short) (nChar * 2);
}
/**
* Converts UCS2 bytes into Alpha field format. The conversion is made automatically to use '81' or
* '82' for optimization purposes.
* <p>
*
* @param src
* source byte array
* @param srcOff
* offset to first UCS2 byte in source byte array
* @param srcLen
* number of UCS2 bytes
* @param dest
* destination byte array
* @param destOff
* offset to store the result in destination
* @return
* number of bytes stored in destination
*/
public static short convertUcs2ToAlphaField(byte[] src, short srcOff, short srcLen, byte[] dest,
short destOff)
{
short i; // looping counter
short min = (short) 0x7FFF; // the minimum range
short max = (short) 0; // the maximum range
short temp; // temporary short
short outOff; // offset in destination byte array
// Use '81' or '82' only if number of UCS2 characters is more than 2
if (srcLen > 2)
{
// Determine the minimum and maximum range of all characters
for (i = 0; i < srcLen; i += 2)
{
if (src[(short) (srcOff + i)] != 0)
{
temp = Util.getShort(src, (short) (srcOff + i));
// Cannot process UCS2 page for range 8000 to FFFF
if (temp < 0)
{
// set max to min+130 so that it will the next checking
max = (short) (min + 130);
break;
}
if (min > temp)
{
min = temp;
}
if (max < temp)
{
max = temp;
}
}
}
}
// If all characters can fit in half page (128 bytes)
if ((short) (max - min) < (short) 129)
{
// Set number of characters for both '81' and '82' format
dest[(short) (destOff + 1)] = (byte) (srcLen / 2);
// If the bit15 to bit8 for minimum and maximum are the same, we can use '81' format
// Since we have checked that the range is less than 129, we can simply check bit8
if ((byte) (min & 0x80) == (byte) (max & 0x80))
{
// Alpha field 81
dest[destOff] = (byte) 0x81;
// Base pointer bit15 to bit8
min = (short) (min & 0x7F80);
dest[(short) (destOff + 2)] = (byte) ((short) (min >> 7) & 0x7F);
outOff = (short) (destOff + 3);
}
// Otherwise the bit8 has conflict and we shall use '82' format
// Example:
// min = 0x0514 = 0000 0101 0001 0100
// max = 0x0593 = 0000 0101 1001 0011
// *********^
// +------------ conflict bit8
else
{
// Alpha field 82
dest[destOff] = (byte) 0x82;
// Set minimum range as base pointer (two bytes)
outOff = Util.setShort(dest, (short) (destOff + 2), min);
}
// For each UCS2 characters
for (i = 0; i < srcLen; i += 2)
{
// If high byte is '00', the character is default alphabet format
// We set the value using 7 bit of the low byte
if (src[(short) (srcOff + i)] == 0)
{
dest[outOff] = (byte) (src[(short) (srcOff + i + 1)] & 0x7F);
}
// If the high byte is not '00', then get the difference between the character code
// and the minimum. Assign the value as 7 bit difference and set the MSB
else
{
temp = (short) (Util.getShort(src, (short) (srcOff + i)) - min);
dest[outOff] = (byte) (temp | 0x80);
}
// next iteration
outOff++;
}
// return the output length (3+N bytes for '81' and 4+N bytes for '82')
return (short) (outOff - destOff);
}
// The characters can not fit into half page or index > 70xFFF, must use '80' coding
// first octet is 0x80
dest[destOff] = (byte) 0x80;
// following octets are the UCS2 bytes
Util.arrayCopyNonAtomic(src, srcOff, dest, (short) (destOff + 1), srcLen);
// return the output length (1+N bytes for '80')
return (short) (srcLen + 1);
}