Friday, December 24, 2010

Implementation of Multiple languages in MYSQL

How to enable Multiple language support in mysql??

FIrst you need to go through What is unicode ??how it works ?? for better understandingThen you can go through this Article Which is totally practical

Steps:
1.Get all the character sets of MySQL
mysql>SHOW CHARACTER SET;

2.See all the collations for particular Char sets
mysql>SHOW COLLATION LIKE '%latin%';

Will give you result for latin Charsets
3.We will be going to set utf8 CHARSET;

First Highest priority will be given to query which COLLATIONS and Charset it wants then will be columns and if not mention will be table if not mention will be database
So If you are going to fire this query

  mysql>SELECT c1 COLLATE utf8_bin FROM aa
 
  Will override all the collations you mention and will use utf8_bin.now if not mention then table collation would be used and so on....

4.Remember one thing to meet this collation requirment one must set the client side collation and charset should meet up with server side mentioned parameters means both the collations and charset must be matching

SET character_set_client = 'utf8';
SET character_set_connection = 'utf8';
SET character_set_database = 'utf8';
SET character_set_results = 'utf8';
SET character_set_server = 'utf8';  

This will keep character_set on server side of utf8 and will procedd work on that basis you can set this into my.cni file or you can fire this stmts on Starting of Session will work
now for all your char set is utf8 and now you need to set collation

SET collation_connection='utf8_bin'
SET collation_database ='utf8_bin'
SET collation_server='utf8_bin'


Which would set the collation of the server.

7.Now your database is ready to deal with multi languages.Now create one table

CREATE TABLE t1
(
c1 varchar(20);
)

INSERT INTO aa values ('Њ');

Will get data into that now fire select query you will get this as a dataIf you are getting question mark or junk value then its not configured yet.


8.Now change table or column level collation whichever is high in priority and check it will display the junk data so now your table is ready to accept multilanguage values

Now enable hindi language in OS

Step A: To enable Indic Languages in Windows XP
Go to Start-> Control Panel > Regional & Language Options >Click
on Languages Tab (the following screen will appear)
Tick the Check box to Install files for complex scripts... and click OK.



2. (Following message would appear) Click OK




3. You will be required to place the Windows XP CD in the CD drive to enable
Indic languages including Hindi
Reboot the System


Thnks and Regards 
Kamesh
(First ever blog 100% written and ingenuited by me)


What is UTF-8??
-UCS[1] Transformation Format — 8-bit.Now here UCS stands for Universal Character SET.As this is widely used transformation unit over the WEB and is very prominently used.Actually ASCII stores values from 0 to 128 and that 128 incorporates all the Alphabates like smaller case a to z and UPPER CASE A TO Z.after this its also contains some non printable chars like carriage return new lines(/n) and symbols which can represent one proper sentence with punctuation you can say it consists all symbols on key board.Now the question is how ascci is stored in a computer as you are aware that all the values are stored in bits(0,1) on your Disk.Now ASCII List with your characters:


After this how it is stored on disk second major thing






Up to the mark very much it is cleared how Coputer is dealing with ascii and how it is stored.Over all your things on your computer is goin to breathe in 0 and 1

Now ASCII was only content English chars more or less some punc symbols Upper the mentioned byte can store upto 255 values means different combinations out of that 128 wer allooted Standard for ASCII now the time was what about another 128 chars.So big players in IT started use as per there require symbols and for different Languagaes.but it was not standard over a world.Now one machine will mean some other character for extended ascii and some will mean something else so it will create a mess up there UTF 8 came into picture.

First to under stand How UTF 8 process you need to be aware of some  terms like COLLATIONS,ENCODINGS,CHARACTER SET what are the for that follow the below link


What is collationa n charsets??? Click on link


Presumming you are aware of above mention terms Now i think we can proceed as i tell you UTF-8 can store upto 4 bytes per char
AS you can see below cited table first range is like ASCII and everything is unchanged over there.So the first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode.This includes Latin letters with diacritics and characters from the Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tana alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts.


How unicode is store into disk??

To understand the way UTF-8 works, we have to examine the binary representation of each byte. If the first bit (the high-order bit) is zero, then it’s a single-byte character, and we can directly map its remaining bits to the Unicode characters 0 – 127. If the first bit is a one, then this byte is a member of a multi-byte character (either the first character or some followup of it).

For a multi-byte character (any character whose Unicode number is 128 or above), we need to know how many bytes will make up this character. This is stored in the leading bits of the first byte in the character. We can identify how many total bytes will make up this character by counting the number of leading 1’s before we encounter the first 0. Thus, for the first byte in a multi-byte character, 110xxxxx represents a two-byte character, 1110xxxx represents a three-byte character, and so on.


Lets go with an Example



Above cited char in image we want to store as Its Decimal value with utf8 is 362 it require 2 bytes to store the value
now binary value would be 101101010 But its oing to stoer the binary value of the HExadeimal form so HExaDecimal is 16A for 362 and Binary would be 000101101010
But UTF8 have its own encoding standard and it will be converted to abide the above rules of leading ones and all that after that we will get one hex value that is C5AA and that is stored on a disk


How it will be stored
First bit would be 1 now as its a multibyte character as 1 more byte will make a leading this character so second value would be 1
now we got binary is 11XXXXXX:XXXXXXXX

So we encountered Zero now so third would be 110XXXXX:XXXXXXXX (1110xxxx represents a three-byte character, and so on.)

Now With this we got one sequence and now binary would be attached atlast of this

So now the binary for utf8 form for cited image would be 11000101:10101010




What is Encoded Byte in term in UTF8?

So encoded byte is the hexadecimal value for that particular Symbol or character See the mentioned image