Coding of Alpha fields in the SIM for UCS2

This article discusses about the alpha field coding used by SIM card to display UCS2 on Mobile Equipment's display (phone's display).

Alpha fields? UCS2? What are they? Okay let's start by learning the definitions...

Terms and Definitions

What is alpha fields?
Alpha fields are text strings, which are labeled as Alpha Identifier. You can find them on some EFs in 3GPP TS 11.11 specification and in all STK menu having Alpha Identifier TLV in TS 11.14.

What is UCS2?
UCS2 (Unicode Character Set, coded on 2 bytes) is character encoding which coded each character into 2 bytes. You can read more about the UCS on wikipedia.

What is 3GPP?
3GPP is an organization standard, focusing on GSM technical specifications. You can find more about this organization here. The Alpha Field coding for UCS2 is described in Technical Specification number 11.11 (TS 11.11) which can be downloaded here.

Alpha Fields Formats

There are 3 kinds of format used by SIM to display UCS2.

'80' format

The encoding for '80' format is as follow:

The first octet/byte is '80'
The following octets are the 16 bit UCS characters, Little Endian format.

Example:
We have 3 UCS2: Sকদ
The characters in bytes are: '0053' for "S", '0995' for "ক", and '09A6' for "দ".
The coding for Alpha field for this format is: '80 0053 0995 09A6'. As simple as that!

'81' format

The encoding for '81' format is as follow:

The first octet is '81'
The second octet is the number of UCS2 characters
The third octet is Base Pointer for bit15 to bit8 for the UCS2: 0xxxxxxxx0000000
The following octets are the coded characters with the following rule:

If the MSB (most significant bit) is zero, the remaining 7 bits contain GSM Default Alphabet
If the MSB is one, the remaining 7 bits are offset value added to Base Pointer which the result defines the UCS2 character

Example:
We have 3 UCS2: Sকদ
The characters in bytes are: '0053' for "S", '0995' for "ক", and '09A6' for "দ".
The coding for Alpha field for this format is: '81 03 13 53 95 A6'.

How can we get that value?

First, the first octet is '81'.

The second octet shall be '03' since we have 3 UCS2.

The third octet is the Base Pointer. If we look at all UCS2 characters which high byte (two first digits) is not '00', then we get '0995' and '09A6'. In binaries we get:

        16                 1 (bit position)
'0995' = 0000 1001 1001 0101
'09A6' = 0000 1001 1010 0110
          **********
        Base pointer '0980' coded as '13'

So, the Base pointer value is 00010011b or '13'.

The fourth octet is the first character "S". Since it is default alphabet, we simply set bit 7 with zero, and get 7-bits of "S":
"S" = '0053' = 0000 0000 0101 0011
********
(0 + 1010011)b = 01010011b = '53'

TIPS: when you get '00XX', then the octet is always the low byte XX.

The fifth octet is for character "ক" ('0995').
To encode this character, we calculate the additional offset from the Base Pointer.
Additional value = '0995' - '0980' = '15' = 0010101b (only 7 bit)
The coded character has MSB set to 1. Hence the value is (1 | 0010101)b = '95'

The sixth octet is the character for "দ" ('09A6').
By doing the same way as fifth octet, we get 'A6'.

'82' format

The encoding for '82' format is as follow:

The first octet is '82'
The second octet is the number of UCS2 characters
The third octet and fourth octet is full 16-bit Base Pointer for the UCS2
The following octets are the coded characters with the following rule:

If the MSB (most significant bit) is zero, the remaining 7 bits contain GSM Default Alphabet
If the MSB is one, the remaining 7 bits are offset value added to Base Pointer which the result defines the UCS2 character

Example:
We have 3 UCS2: Sকদ
The characters in bytes are: '0053' for "S", '0995' for "ক", and '09A6' for "দ".
The coding for Alpha field for this format is: '82 03 09 95 53 80 91'.

How can we get that value?

The first octet and second octet is quite clear.

For the third and fourth octet, the Base Pointer, we can get it from the lowest value from all UCS characters. Of course, it is better if we have the UCS2 characters look-up table which indicates the Base Pointer for each specific set. In this example, I set the Base Pointer as '0995'.

The fifth octet is the character "S" ('0053'). Since it is default alphabet, then the octet value is '53'.

The sixth octet is the character "ক" ('0995').
To encode this character, we calculate the additional offset from the Base Pointer.
Additional value = '0995' - '0995' = '00' = 0000000b (only 7 bit)
The coded character has MSB set to 1. Hence the value is (1 | 0000000)b = '80'

The seventh octet is the character for "দ" ('09A6').
Additional value = '09A6' - '0995' = '11' = 0010001b (only 7 bit)
The coded character has MSB set to 1. Hence the value is (1 | 0010001)b = '91'

Which one to choose: '80', '81', or '82'?

Whenever possible, use '81' format.

Strong point: '81' offers smallest number of memory required, i.e. (3 + N) bytes.
Weak point: this format only works for character set containing 128 characters that lies between 'XX00' to 'XX7F', or between 'XX80' to 'XXFF'

If '81' is impossible, try the '82' format

Stong point: '82' offers slightly bigger number of memory required compared to '81' format, i.e. (4 + N) bytes
Weak point: this format only works for character set containing 128 characters

If '81' and '82' is not possible, you must use '80'

Strong point: '80' can covers all UCS2 range from '0000' to 'FFFF'
Weak point: the number of bytes required is large, i.e. (1 + 2 * N) bytes

Sample source in Java

The following is example of source code for UCS2 to Alpha Field conversion, and vice versa.

 /**  
  * Converts UCS2 characters into Alpha Fields format according to 3GPP TS 11.11 Appendix B, and   
  * vice versa.   
  *   
  * @author SDK  
  */  
 public class AlphaFields  
 {  
   
 /**  
  * Converts UCS2 alpha field into UCS2 bytes  
  *   
  * @param src  
  *      source byte array  
  * @param srcOff  
  *      offset to first octet ('80', '81', or '82') in source byte array  
  * @param srcLen  
  *      length of alpha field  
  * @param dest  
  *      destination byte array  
  * @param destOff  
  *      offset to store the result in destination  
  * @return number of bytes stored in destination  
  */  
 public static short convertAlphaFieldToUcs2(byte[] src, short srcOff, short srcLen, byte[] dest,  
     short destOff)  
 {  
   short base;     // base of UCS2 page for '81' and '82' format  
   short nChar = 0;  // number of UCS2 characters  
   short i;      // loop counter  
   
   switch (src[srcOff])  
   {  
   case (byte) 0x80:  
     // if first octet is '80', any following bytes are 16 bit UCS2 characters   
     // copy all the bytes to destination buffer, excluding the '80' byte  
     srcLen--;  
     Util.arrayCopyNonAtomic(src, (short) (srcOff + 1), dest, destOff, srcLen);  
     return srcLen;  
   
   case (byte) 0x81:  
     // if first octet is '81', second octet is number of characters  
     nChar = Util.makeShort((byte) 0, src[(short) (srcOff + 1)]);  
     // second octet is the base pointer bit 15 to 8: 0hhhhhhhh0000000  
     // we need to shift left the bits 7 times to get the base pointer.   
     base = (short) ((short) (src[(short) (srcOff + 2)] & 0x00FF) << 7);  
     // skip 3 bytes ('81', number of characters, and base pointer)  
     srcOff += 3;  
     // jump to for loop below  
     break;  
   
   case (byte) 0x82:  
     // if first octet is '81', second octet is number of characters  
     nChar = Util.makeShort((byte) 0, src[(short) (srcOff + 1)]);  
     // third and fourth octet are 16-bit base pointer   
     base = Util.getShort(src, (short) (srcOff + 2));  
     // skip 4 bytes ('81', number of characters, and 2 bytes base pointer)  
     srcOff += 4;  
     break;  
   
   default:  
     // handle of unknown format  
     return 0;  
   }  
   
   // for every byte in data under '81' and '82'  
   for (i = 0; i < nChar; i++)  
   {  
     // if MSB is not set, meaning GSM default alphabet, set the output into 00XX  
     if (src[srcOff] >= 0)  
     {  
       dest[destOff] = 0;  
       dest[(short) (destOff + 1)] = src[srcOff];  
     }  
     // if MSB is set, meaning the UCS2 character is base pointer plus 7-bit of the value  
     else  
     {  
       Util.setShort(dest, destOff, (short) (base + (byte) (src[srcOff] & 0x7F)));  
     }  
     // next iteration  
     srcOff++;  
     destOff += 2;  
   }  
   
   // return number of UCS2 bytes  
   return (short) (nChar * 2);  
 }  
   
 /**  
  * Converts UCS2 bytes into Alpha field format. The conversion is made automatically to use '81' or  
  * '82' for optimization purposes.  
  * <p>   
  *   
  * @param src  
  *      source byte array  
  * @param srcOff  
  *      offset to first UCS2 byte in source byte array  
  * @param srcLen  
  *      number of UCS2 bytes   
  * @param dest  
  *      destination byte array  
  * @param destOff  
  *      offset to store the result in destination  
  * @return  
  *      number of bytes stored in destination  
  */  
 public static short convertUcs2ToAlphaField(byte[] src, short srcOff, short srcLen, byte[] dest,  
     short destOff)  
 {  
   short i;              // looping counter  
   short min = (short) 0x7FFF;     // the minimum range  
   short max = (short) 0;       // the maximum range  
   short temp;             // temporary short  
   short outOff;            // offset in destination byte array   
   
   // Use '81' or '82' only if number of UCS2 characters is more than 2  
   if (srcLen > 2)  
   {  
     // Determine the minimum and maximum range of all characters  
     for (i = 0; i < srcLen; i += 2)  
     {  
       if (src[(short) (srcOff + i)] != 0)  
       {  
         temp = Util.getShort(src, (short) (srcOff + i));  
         // Cannot process UCS2 page for range 8000 to FFFF  
         if (temp < 0)  
         {  
           // set max to min+130 so that it will the next checking  
           max = (short) (min + 130);  
           break;  
         }  
         if (min > temp)  
         {  
           min = temp;  
         }  
         if (max < temp)  
         {  
           max = temp;  
         }  
       }  
     }  
   }  
   
   // If all characters can fit in half page (128 bytes)  
   if ((short) (max - min) < (short) 129)  
   {  
     // Set number of characters for both '81' and '82' format  
     dest[(short) (destOff + 1)] = (byte) (srcLen / 2);  
   
     // If the bit15 to bit8 for minimum and maximum are the same, we can use '81' format  
     // Since we have checked that the range is less than 129, we can simply check bit8   
     if ((byte) (min & 0x80) == (byte) (max & 0x80))  
     {  
       // Alpha field 81  
       dest[destOff] = (byte) 0x81;  
       // Base pointer bit15 to bit8  
       min = (short) (min & 0x7F80);  
       dest[(short) (destOff + 2)] = (byte) ((short) (min >> 7) & 0x7F);  
       outOff = (short) (destOff + 3);  
     }  
     // Otherwise the bit8 has conflict and we shall use '82' format  
     // Example:  
     // min = 0x0514 = 0000 0101 0001 0100  
     // max = 0x0593 = 0000 0101 1001 0011  
     //         *********^  
     //             +------------ conflict bit8  
     else  
     {  
       // Alpha field 82  
       dest[destOff] = (byte) 0x82;  
       // Set minimum range as base pointer (two bytes)    
       outOff = Util.setShort(dest, (short) (destOff + 2), min);  
     }  
   
     // For each UCS2 characters  
     for (i = 0; i < srcLen; i += 2)  
     {  
       // If high byte is '00', the character is default alphabet format  
       // We set the value using 7 bit of the low byte   
       if (src[(short) (srcOff + i)] == 0)  
       {  
         dest[outOff] = (byte) (src[(short) (srcOff + i + 1)] & 0x7F);  
       }  
       // If the high byte is not '00', then get the difference between the character code  
       // and the minimum. Assign the value as 7 bit difference and set the MSB  
       else  
       {  
         temp = (short) (Util.getShort(src, (short) (srcOff + i)) - min);  
         dest[outOff] = (byte) (temp | 0x80);  
       }  
       // next iteration  
       outOff++;  
     }  
     // return the output length (3+N bytes for '81' and 4+N bytes for '82')  
     return (short) (outOff - destOff);  
   }  
   
   // The characters can not fit into half page or index > 70xFFF, must use '80' coding  
   // first octet is 0x80  
   dest[destOff] = (byte) 0x80;  
   // following octets are the UCS2 bytes  
   Util.arrayCopyNonAtomic(src, srcOff, dest, (short) (destOff + 1), srcLen);  
   // return the output length (1+N bytes for '80')  
   return (short) (srcLen + 1);  
 }