Test sản phẩm có > 5000 ký tự 123

This article is for the beginners that are often puzzled by the big term, "Unicode" and also those users who ask questions like, "how to store non-English or non-ASCII text in the database and get it back". I remember, a few months ago I was in the same situation, where most of the questions were based on the same thing, "how to get the data from the database in non-ASCII text and print it in the application". Well, this article is meant to target all these questions, users, and beginner programmers.

This article will most specifically let you understand what Unicode is and why it is currently used (and since the day it was created). Also, a few points about its types (such as what UTF-8 and UTF-16 are and what the difference is and why to use them, and so on) will be explained, then I will move on to using these characters in multiple .NET applications. Note that I will also be using ASP.NET web applications, to show the scene in a web-based environment too. .NET Framework supports all of these encodings and the code pages to allow you to share your data among various products that understand Unicode standards. There are multiple classes provided in .NET to let you kick-start your application based on Unicode characters to support global languages.

Finally, I will be using a database example (I will be using Microsoft SQL Server) to show how to write and extract the data from the database. It is quite simple, no big deal at least for me. Once that has been done, you can download and execute the commands on your machine to test the Unicode characters yourself. Let us begin now.

I will not talk about Unicode itself, instead, I will be talking about the .NET implementation of Unicode. Also note that the number value of the characters in this article is in numeric (and decimal) form, not in U+XXXX and hexadecimal form. I will, in the end, also show how to convert this decimal value into hexadecimal value.

Starting with Unicode

What Unicode is

Unicode is a standard for character encoding. You can think of as a standard for converting every character to its binary notation and every binary notation to its character representation. The computer can only store binary data. That is why non-binary data is converted into a binary representation to be stored on the machine.

Originally, there were not many schemes for developers and programmers to represent their data in languages other than English, although that was because application globalization was not general back then. Only the English language was used and the initial code pages included the codes to represent and process the encoding and decoding of English letters (lower and upper case) and some special characters. ASCII is one of them. Back in ASCII days, it encoded 128 characters of the English language to 7-bit data. ASCII doesn't only include encoding for text, but also for the directives for how text should be rendered and so on. Many are now not used. That was the most widely used standard because the technology was so limited and it fulfilled their needs at that time.

As computers became more widely used, technicians and many developers wanted their applications to be used in a client-locale-friendly version, there originated a requirement for a new standard because otherwise, every developer could create his own code page to represent various characters, but that would have removed the unity among the machines. Unicode had originated back in the late 1980s (see the history section in Wikipedia) but was not used because of its large size of 2 bytes for every character. It had the capability to represent more characters than the ASCII standard. Unicode supports 65,536 characters and that is capable of supporting all of the current world's characters. That is why Unicode is used widely, to support all of the characters globally and to ensure that the characters sent from one machine would be mapped back to a correct string and no data would be lost (by data loss I mean by sentences not being correctly rendered back).

Unicode is Different

Beginners stumble upon UTF-8, UTF-16, and UTF-32 and then finally on Unicode and they think of them being different. Well, no, they're not. The actual thing is just Unicode, a standard. UTF-8 and UTF-16 is the name given to a character set or encoding scheme of varying sizes. UTF-8 is 1 byte (but remember, this one can span to 2 bytes too if required and at the end of this article I will explain which one of these schemes you should use and why, so please read the article to the end) and so on.

UTF-8

UTF-8 is the variable-length Unicode encoding type, by default, it has 8 bits (1 byte) but can span, and this character encoding scheme can hold all of the characters (because it can span for multiple bytes). It was designed to be a type that supports backward compatibility with ASCII for machines that don't support Unicode at all. This standard can be used to represent the ASCII codes in the first 128 characters, then in the next 1920 characters, it represents the most used global languages, such as Latin, Arabic, Greek and so on and then all the remaining characters and code points can be used to represent the other characters. (See the Wikipedia article for UTF-8).

UTF-16

UTF-16 is also a variable-length Unicode character encoding type, the only difference is that the variable is a multiple of 2 bytes (2 bytes or 4 bytes depending on the character or more specifically the character set). It was initially a fixed 2-byte character encoding, but then it was made variable-sized because 2 bytes are not enough.

UTF-32

UTF-32 uses exactly 32 bits (or 4 bytes) per character. Regardless of code points or character set or language, this encoding would always use 4 bytes for each of the characters. The only good thing about UTF-32 (as in Wikipedia) is that the characters are directly indexable. That is not possible in variable-length UTF encodings. Whereas, I believe the biggest disadvantage of this encoding is the 4 bytes size per character, even if you're going to use Latin characters or ASCII characters specifically.

Getting to the .NET Framework

Enough of the small background of the Unicode standard. Now I will continue by providing an overview of the .NET Framework and the support of Unicode in the .NET Framework. The support for Unicode in .NET Framework is based on the primitive type, char. A char in the .NET Framework is 2 bytes and supports Unicode encoding schemes for characters. You can generally specify to use whichever Unicode encoding for your characters and strings, but by default, you can think of the support for it to be UTF-16 (2 bytes).

Char (C# Reference) .NET documentation

Char Structure (System)

The preceding documents contain different content but are similar. char is the System.Char object in the .NET Framework. By default .NET Framework supports Unicode characters too and would render them on the screen and you don't even need to write any separate code, ensuring the encoding of the data source only. All of the applications in the .NET Framework support Unicode, such as WPF, WCF, and ASP.NET applications. You can use all of the Unicode characters in all of these applications and .NET would render the codes into their character notation. Do read the following section.

Console applications

As for Console applications, they are a good point to note here, because I said that every .NET application supports Unicode but I didn't mention Console applications. Well, the problem isn't generally the Unicode support, it is neither the platform nor the Console framework itself. It is because Console applications do not support graphics. Yes, supporting a variety of characters is graphical and you should read about glyphs.

When I started to work around in console applications to test Unicode support in Console applications, I was amazed to see that Unicode character support doesn't only depend on the underlying framework, or the library being used, but instead, there is another factor that you should consider before using Unicode support. That is the font family of your console. There are multiple fonts for Consoles if you open the properties of your console.

Let us, now try out a few basic examples of characters from the range 0-127, then from the next code page see how the console application behaves and what other applications might respond to our data in a way.

ASCII codes

First I will try ASCII codes (well a very basic one, "a") in the code to see if the console behaves correctly or messes something up. I used the following code to be executed in the console application:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleUnicode
{
class Program
{
static void Main(string[] args)
{
// Characters a α क
char a = 'a';
Console.WriteLine(String.Format("{0} character has code: {1}", a,
// Let us print the code of the 'a' too..
Encoding.UTF8.GetBytes(a.ToString())[0].ToString()));
// Just for sake of pausing
Console.Read();
}
}
}

The response to this code was like this:

You can see that now there is no difference as if the code is from ASCII or Unicode because "a" is 97 in both of them. That was pretty basic. Now, let us take a step farther.

Non-ASCII codes

Let us now try Greek letters, the first one in the row, alpha. If we execute code similar to the preceding and replace the "a" with alpha, you will see the following result:

Well so far so good.

Let us take a big step now, why not try Hindi? Hindi is pretty much regularly asked about, for how to store and extract Hindi letters from the database, and so on. Let us now try Hindi characters in the console application.

Nope, I didn't specify a question mark! That was meant to be a "k" sounding character in Hindi, but it isn't. it is a "q" sounding question mark. Why is that so?

That was not a problem in Unicode, but the console application's low support for global fonts, to support my answer on this I created another line of code to store this code inside a text file with Unicode support. The following is the code to store the binary of the characters (using UTF-8 encoding).

Note: Notepad can support ASCII, Unicode, and other schemes. so be sure your file supports the character set before saving the data.

File.WriteAllBytes("F:\\file.txt", Encoding.UTF8.GetBytes(a.ToString()));

The preceding code was executed for the same character, on the console there was a question mark printed but the file presented something else.

This shows that the characters are widely supported in the .NET Framework, but it is the font that also matters, the glyphs in the font are to be available to be rendered by the character, otherwise, the application would just show such characters (in other frameworks there is a square box denoting that the character is not supported).

Conclusion

So according to this hypothesis, if there is any problem in your application when displaying Unicode characters in a console application, you need to ensure that the character you're trying to display is supported in the font family that you're using. The problem is similar to loading a Hindi character in a console application that is not supported in the font family. This would end the discussion for supporting the Unicode characters in console applications until you update the font family to support that code page (or at least that code point).

Unicode support in other application frameworks

Now let us see how much Unicode is supported in other frameworks, such as WPF and ASP.NET. I will not go into the use of Windows Forms. The process is similar to WPF. ASP.NET and WPF have a wide variety of fonts and glyphs that can support many characters, nearly all characters. So, let us continue from software frameworks to a web framework and then finally test the SQL Server database for each of this framework, to test what it would be like to support Unicode characters.

Let me coin the data source first

Before I continue to any framework, I would like to introduce the data source that I will use in the article to show how you can read and write the data in Unicode format from multiple data sources. In this article, I will use:

Notepad, that supports multiple encodings, ASCII, Unicode, and so on.
SQL Server database to store the data in rows and columns.

You can use either of these data sources (the first one is available to you if you're using a Windows-based OS) and they would support Unicode data writing and reading. If you're going to write the data and create the file from the code, then there is no need for anything. Otherwise, if you're going to create a new file yourself and name it, then before saving ensure you've selected UTF-8 encoding (not the Unicode that is UTF-16) before hitting the save button to create the file otherwise it will be the default ASCII encoding and the Unicode data would be lost if saved to it. You can use Notepad as the data source, or if you have SQL Server then you can use SQL Server as your data source; they can both satisfy your needs.

Using SQL Server Database

You can use the SQL Server database in your project too and if you're going to use the source code given here, you might require a sample database to be created and inside this newly created database (or inside your current testing database you can) create a new table, to hold the Language and UnicodeData. You can also run the following SQL command to do that.

CREATE TABLE (
Langauge nvarchar(50),
UnicodeData nvarchar(500)
);

Make sure you're selecting the correct database to create the table inside or use the USE DATABASE_NAME command before this command to execute. Initially, I filled in the database with the following data.

Language	UnicodeData
Arabic	بِسْمِ اللهِ الرَّحْمٰنِ الرَّحِيْمِ
Hindi	यूनिकोड डेटा में हिंदी
Russian	рцы слово твердо
English	Love for all, hatred for none!
Urdu	یونیکوڈ ڈیٹا میں اردو

Now there is quite enough data and languages to test our frameworks again. I am sure the console would never support it, so why even try? Yet if you want to see the output in a console application, I won't deny you.

WPF and Unicode

The only problem in the console application was the lesser support of character glyphs in the font family that has just been overcome in WPF. In WPF you can use many fonts (system-based fonts or your own custom generated fonts) that you can use to display different characters in your applications, in a way that you want them to be.

WPF supports all of the characters, we will see why I say that. First, let us write a simple expression in plain text, starting to print the same characters now in WPF. Once I have done that, I will try to see if the fonts are a factor in WPF or not. Stay tuned.

'a' character

First of all, I will try printing the 'a' character on the screen and see what the encoded code for it is; that would be similar to that of ASCII too. The following code can be interpreted:
1. // text is a TextBlock control in WPF and String.Format can be used to format the string
2. text.Text = String.Format("character '{0}' has a code: {1}", "a", ((int)'a'));
Int32 can map to all of the characters in Unicode and can store their decimal value.

Now the preceding code, once executed, would print the following output.

Quite similar output to that of the console application. Moving forward now.
'α' character

Now, moving to that greek character and trying it out would result in the following screen:
'क' character

Now for the problematic character, the Hindi character to test in our application to see what is the effect of it in our application. When we change the code to print and fill it with क, we get:

This shows that WPF really does support the character, because the font family, Segoe UI, supports the Unicode characters. That at the current instance is the Hindi alphabet set.

Testing SQL Server data in WPF

We saw how the console application treated the data, now to test our WPF application to see how it treats our Unicode data coming from SQL Server and see if it represents raw data on the screen, or do we need to do something with it. I will create a SqlClient and run some SqlCommands on SqlConnection of my database.

You will need a connectionString for your SQL Server database.

// Create the connection.
using (SqlConnection conn = new SqlConnection("server=your_server;database=db_name;Trusted_Connection=true;"))
{
// DO REMEMBER TO OPEN THE CONNECTION!!!
conn.Open();
// Connection Established
SqlCommand command = new SqlCommand("SELECT * FROM UnicodeData", conn);
// For better readability
text.FontSize = 13;
// End the line for the data in the database
text.Text = "Langauge\t | \tUnicodeData" + Environment.NewLine + Environment.NewLine;
using (SqlDataReader reader = command.ExecuteReader())
{
while (reader.Read())
{
// Write the data
text.Text += reader[0] + "\t \t | \t" + reader[1] + Environment.NewLine;
}
}
}

Now the WPF shows me the following output on the screen.

Now the preceding image shows us that there is no other effort required by us to do anything for the data to be rendered, WPF does that for us.

Adding and retrieving the data

Usually, people say that they have stored the data in the correct format, but when they try to extract the data, they get the data in the wrong format. Usually, Hindi, Arabic, Urdu, and Japanese users are asking such questions, so I thought I should also try to provide an overview of what happens when a user stores the data to the data source. I used the following code to insert 3 rows into the database.

SqlCommand insert = new SqlCommand(@"INSERT INTO UnicodeData (Language, UnicodeData)
VALUES (@lang, @data)", conn);
var lang = "language in every case";
var udata = "a few characters in that particular language";
// Adding the parameters
insert.Parameters.Add(new SqlParameter("@lang", lang));
insert.Parameters.Add(new SqlParameter("@data", udata));
if (insert.ExecuteNonQuery() > 0)
{
// Stop the app, to view this message!
MessageBox.Show("Greek data was stored into database, now moving forward to load the data.");
}

The data I inserted was:

Greek	Ελληνικών χαρακτήρων σε Unicode δεδομένων
Chinese	祝好运
Japanese	幸運

So now the database table should look like this:

Fonts do matter in WPF too

In the console application, the font family also matters. The same question arises, "Does the font family matter in WPF too?". The answer is, "Yes! It does matter". But the actual underlying process is different. The WPF framework maps characters to their encodings and encodings to their characters for every font family. If you cannot map a character to an encoding then it uses a fallback to the default font family that supports that character.

If you read the FontFamily class documentation on the MSDN, you will find a quite interesting section named "Font Fallback", which states the following.

Quote:

Font fallback refers to the automatic substitution of font other than the font that is selected by the client application. There are two primary reasons why font fallback is invoked:

The font that is specified by the client application does not exist on the system.
The font that is specified by the client application does not contain the glyphs that are required to render the text.

Now wait, it doesn't end there. It doesn't mean, WPF would use a custom font instead of that font or would create a box or question mark. What actually happens is wonderful (in my opinion), WPF uses a default font fallback font family and thus provides a default, non-custom font for that encoding. You should read that documentation to understand the fonts in WPF. Anyhow, let us change our font in the WPF application and see what happens.