Thursday, December 23, 2010

Maintaining hierarchical data in Flat database tables

Introduction hierarchical 

Most users at one time or another have dealt with hierarchical data in a SQL database and no doubt learned that the management of hierarchical data is not what a relational database is intended for. The tables of a relational database are not hierarchical (like XML), but are simply a flat list. Hierarchical data has a parent-child relationship that is not naturally represented in a relational database table.
For our purposes, hierarchical data is a collection of data where each item has a single parent and zero or more children (with the exception of the root item, which has no parent). Hierarchical data can be found in a variety of database applications, including forum and mailing list threads, business organization charts, content management categories, and product categories. For our purposes we will use the following product category hierarchy from an fictional electronics store:
These categories form a hierarchy in much the same way as the other examples cited above. In this article we will examine two models for dealing with hierarchical data in MySQL, starting with the traditional adjacency list model.

The Adjacency List Model

Typically the example categories shown above will be stored in a table like the following (I'm including full CREATE and INSERT statements so you can follow along):

CREATE TABLE category(
        category_id INT AUTO_INCREMENT PRIMARY KEY,
        name VARCHAR(20) NOT NULL,
        parent INT DEFAULT NULL
);

INSERT INTO category VALUES(1,'ELECTRONICS',NULL),(2,'TELEVISIONS',1),(3,'TUBE',2),
        (4,'LCD',2),(5,'PLASMA',2),(6,'PORTABLE ELECTRONICS',1),(7,'MP3 PLAYERS',6),(8,'FLASH',7),
        (9,'CD PLAYERS',6),(10,'2 WAY RADIOS',6);

SELECT * FROM category ORDER BY category_id;
+-------------+----------------------+--------+
| category_id | name                 | parent |
+-------------+----------------------+--------+
|           1 | ELECTRONICS          |   NULL |
|           2 | TELEVISIONS          |      1 |
|           3 | TUBE                 |      2 |
|           4 | LCD                  |      2 |
|           5 | PLASMA               |      2 |
|           6 | PORTABLE ELECTRONICS |      1 |
|           7 | MP3 PLAYERS          |      6 |
|           8 | FLASH                |      7 |
|           9 | CD PLAYERS           |      6 |
|          10 | 2 WAY RADIOS         |      6 |
+-------------+----------------------+--------+
10 rows in set (0.00 sec)

In the adjacency list model, each item in the table contains a pointer to its parent. The topmost element, in this caseelectronics, has a NULL value for its parent. The adjacency list model has the advantage of being quite simple, it is easy to see thatFLASH is a child ofmp3 players, which is a child of portable electronics, which is a child of electronics. While the adjacency list model can be dealt with fairly easily in client-side code, working with the model can be more problematic in pure SQL.

Retrieving a Full Tree

The first common task when dealing with hierarchical data is the display of the entire tree, usually with some form of indentation. The most common way of doing this is in pure SQL is through the use of a self-join:

SELECT t1.name AS lev1, t2.name as lev2, t3.name as lev3, t4.name as lev4
FROM category AS t1
LEFT JOIN category AS t2 ON t2.parent = t1.category_id
LEFT JOIN category AS t3 ON t3.parent = t2.category_id
LEFT JOIN category AS t4 ON t4.parent = t3.category_id
WHERE t1.name = 'ELECTRONICS';

+-------------+----------------------+--------------+-------+
| lev1        | lev2                 | lev3         | lev4  |
+-------------+----------------------+--------------+-------+
| ELECTRONICS | TELEVISIONS          | TUBE         | NULL  |
| ELECTRONICS | TELEVISIONS          | LCD          | NULL  |
| ELECTRONICS | TELEVISIONS          | PLASMA       | NULL  |
| ELECTRONICS | PORTABLE ELECTRONICS | MP3 PLAYERS  | FLASH |
| ELECTRONICS | PORTABLE ELECTRONICS | CD PLAYERS   | NULL  |
| ELECTRONICS | PORTABLE ELECTRONICS | 2 WAY RADIOS | NULL  |
+-------------+----------------------+--------------+-------+
6 rows in set (0.00 sec)

Finding all the Leaf Nodes

We can find all the leaf nodes in our tree (those with no children) by using a LEFT JOIN query:

SELECT t1.name FROM
category AS t1 LEFT JOIN category as t2
ON t1.category_id = t2.parent
WHERE t2.category_id IS NULL;

+--------------+
| name         |
+--------------+
| TUBE         |
| LCD          |
| PLASMA       |
| FLASH        |
| CD PLAYERS   |
| 2 WAY RADIOS |
+--------------+

Retrieving a Single Path

The self-join also allows us to see the full path through our hierarchies:

SELECT t1.name AS lev1, t2.name as lev2, t3.name as lev3, t4.name as lev4
FROM category AS t1
LEFT JOIN category AS t2 ON t2.parent = t1.category_id
LEFT JOIN category AS t3 ON t3.parent = t2.category_id
LEFT JOIN category AS t4 ON t4.parent = t3.category_id
WHERE t1.name = 'ELECTRONICS' AND t4.name = 'FLASH';

+-------------+----------------------+-------------+-------+
| lev1        | lev2                 | lev3        | lev4  |
+-------------+----------------------+-------------+-------+
| ELECTRONICS | PORTABLE ELECTRONICS | MP3 PLAYERS | FLASH |
+-------------+----------------------+-------------+-------+
1 row in set (0.01 sec)

The main limitation of such an approach is that you need one self-join for every level in the hierarchy, and performance will naturally degrade with each level added as the joining grows in complexity.

Limitations of the Adjacency List Model

Working with the adjacency list model in pure SQL can be difficult at best. Before being able to see the full path of a category we have to know the level at which it resides. In addition, special care must be taken when deleting nodes because of the potential for orphaning an entire sub-tree in the process (delete the portable electronics category and all of its children are orphaned). Some of these limitations can be addressed through the use of client-side code or stored procedures. With a procedural language we can start at the bottom of the tree and iterate upwards to return the full tree or a single path. We can also use procedural programming to delete nodes without orphaning entire sub-trees by promoting one child element and re-ordering the remaining children to point to the new parent.

Source : Net-search i came across this article

New to Unicode ??? Go through this

In order for a computer to be able to store text and numbers that humans can understand, there needs to be a code that transforms characters into numbers. The Unicode standard defines such a code by using character encoding.

Character Encoding

All character encoding does is assign a number to every character that can be used. I could if I really wanted to make a character encoding right now. For example, I could say "A" becomes the number 13, "a" = 14, "1" = 33, "#" = 123 and so on. My character encoding scheme might work brilliantly on my computer but when I come to send some text to another computer I will have problems. It won't know what I'm talking about unless it understands my encoding scheme too.
This is where industry wide standards come in. If the whole computer industry uses the same character encoding scheme, all the computers can display the same characters.

What is Unicode?

ASCII which stands for American Standard Code for Information Interchange became the first widespread encoding scheme. However, it is limited to only 128 character definitions. Which is fine for the most common English characters, numbers and punctuation but is a bit limiting for the rest of the world. They naturally wanted to be able to encode their characters too. And, for a little while depending on where you were, there might be a different character being displayed for the same ASCII code. In the end, the other parts of the world began creating their own encoding schemes and things started to get a little bit confusing. Not only were the coding schemes of different lengths, programs needed to figure out which encoding scheme they were meant to be using.
It became apparent that a new character encoding scheme was needed and the Unicode standard was created. The objective of Unicode is to unify all the different encoding schemes so that the confusion between computers can be limited as much as possible. These days the Unicode standard defines values for over 100,000 characters and can be seen at the Unicode Consortium. It has several character encoding forms, UTF standing for Unicode Transformation Unit:
  • UTF-8: only uses one byte (8 bits) to encode English characters. It can use a sequence of bytes to encode the other characters. UTF-8 is widely used in email systems and on the Internet.
  • UTF-16: uses two bytes (16 bits) to encode the most commonly used characters. If needed, the additional characters can be represented by a pair of 16-bit numbers.
  • UTF-32: uses four bytes (32 bits) to encode the characters. It became apparent that as the Unicode standard grew a 16-bit number is too small to represent all the characters. UTF-32 is capable of representing every Unicode character as one number.

Code Points

A code point is the value that a character is given in the Unicode standard. The values according to Unicode are written as hexadecimal numbers and have a prefix of "U+". For example to encode the characters I looked at earlier, "A" is U+0041, "a" is U+0061, "1" is U+0031, "#" is U+0023. These code points are split into 17 different sections called planes. Each plane holds 65,536 code points. The first plane, which holds the most commonly used characters, is known as the basic multilingual plane.

Code Units

The encoding schemes are made up of code units. They are way to provide an index for where a character is positioned on a plane. For instance, with UTF-16 each 16-bit number is a code unit. The code units can be transformed into code points. For example, the flat note symbol "♭" has a code point of U+1D160 and it lives on the second plane of the Unicode standard. It would be encoded using the combination of the following two 16-bit code units: U+D834 and U+DD60 .
For the basic multilingual plane the values of the code points and code units are identical. This allows a shortcut for UTF-16 that saves a lot of storage space. It only needs to use one 16-bit number to represent those characters.