Last updated: Aug 26, 2024 Optical Character Recognition (OCR) technology has revolutionized the way we interact with documents, images, and text data. By converting scanned images and PDFs into searchable and editable text, OCR opens up a world of possibilities for automation, data extraction, and text analysis. In this tutorial, we will walk you through using Tesseract OCR in C#, leveraging the power of IronOCR, a comprehensive .NET library that simplifies OCR processes. Whether you're working on Windows Forms, ASP.NET, or any other .NET framework, this guide will equip you with the knowledge to extract text from images quickly and efficiently.
IronOCR is more than just a library; it's a robust solution that encapsulates the Tesseract OCR engine within a user-friendly .NET wrapper. By using IronOCR, you get access to the advanced capabilities of Tesseract, coupled with enhanced features like error correction, language support, and cross-platform compatibility. The library is designed for developers who want to integrate OCR functionality into their .NET applications with minimal effort and maximum flexibility.
Begin by creating a new C# project in Visual Studio. You can choose any project type, such as a Console App, Windows Forms, or ASP.NET application. Once your project is set up, you'll need to install the IronOCR package via NuGet.
Open Visual Studio. I am using Visual Studio 2019, but you can use any version.
Select “Create New Project”. Select the Windows Form Application from the template.
Click “Next”. Name the Project, select Location, and click “Next”.
Click “Next” and select the “target framework''. I have chosen .Net (5.0), but you can choose your preferred option. Click “Finish”. The Windows Form Application will be created as shown below.
Before proceeding further, we need to install the Nuget Package for IronOCR.
Open the Nuget Package Manager Console from Tools > Nuget Package Manager > Package Manager Console.
The Package Manager Console will open as shown below.
Type “Install-Package IronOcr” in the Nuget Package Manager Console and click “Enter”.
IronOCR will begin installing in your project. Wait for a while. After installation is complete, open your Windows Form and design your Application.
For this tutorial, we'll create a simple Windows Forms application that allows users to select an image, perform OCR, and display the extracted text. Start by designing your form with the following controls:
Your form might look something like this:
Now that the interface is ready, let's write the code to handle image selection and OCR processing.
Double-click on the “Select Image” button.
The following code will appear:
private void SelectImage_Click(object sender, EventArgs e)Enter fullscreen mode
Exit fullscreen mode
Write the following code inside this function:
private void SelectImage_Click(object sender, EventArgs e) < OpenFileDialog open = new OpenFileDialog(); // image filters open.Filter = "Image Files(*.jpg; *.jpeg; *.gif; *.bmp)|*.jpg; *.jpeg; *.gif; *.bmp"; if (open.ShowDialog() == DialogResult.OK) < // display image in picture box pictureBox1.Image = new Bitmap(open.FileName); // image file path ImagePath.Text = open.FileName; >>
Enter fullscreen mode
Exit fullscreen mode
Next, double-click on the “Convert to Text Button” and the following code will appear:
private void ConvertToText_Click(object sender, EventArgs e)Enter fullscreen mode
Exit fullscreen mode
Add the following namespace at the top of the file: using IronOcr;
Next, add the following code inside the ConvertToText_Click() function:
private void ConvertToText_Click(object sender, EventArgs e)Enter fullscreen mode
Exit fullscreen mode
As you can see, we only needed to write three lines of code to perform this major task, all thanks to IronOcr.
Let’s run the Project.
Press Ctrl + F5 to run the Project.
Click on the “Select Image” button to select the image.
Select an image of your choice. I am selecting a snapshot of an article, but you can select any of your choosing.
Next, click the “Convert to Text” button to extract all the text from this newspaper image as shown below.
You can see that I have easily extracted text from an image of the article. It is very accurate and easy to use for any ongoing purpose. IronOcr has made this job incredibly easy.
One of the standout features of IronOCR is its support for over 150 languages. Whether you need to extract text in English, Chinese, Arabic, or any other language, IronOCR makes it straightforward.
To extract text in a language other than English, you need to install the corresponding language package via NuGet. For example, to work with Chinese, use the following command:
Install-Package IronOcr.Languages.Chinese
Enter fullscreen mode
Exit fullscreen mode
Once the language package is installed, update your code to specify the language: IronOcr.Language = OcrLanguage.ChineseSimplified;
Such as:
private void ConvertToText_Click(object sender, EventArgs e)Enter fullscreen mode
Exit fullscreen mode
Let’s do the test again.
We can see that we have easily converted our Chinese language image into text with just one line of code. The IronOcr .Net library provides accuracy, efficiency, and an easy method to employ with our .Net Application.
Let’s look at the following example to see how we can achieve the same goal using Tesseract OCR. We can keep the same Windows Form as the previous example and just change the code behind the “ConvertToText”_Click button. Everything else will remain the same as before.
Write the following command in the Nuget Package Manager Console.
Install-Package Tesseract
After installing the Nuget Package, you must install the language files manually in the project folder. One could say that this is a drawback of this particular library. Download the language files from the following link .Unzip it and copy the tessdata folder in the debug folder of your project.
Next, write the following code inside the ConvertToText_Click function:
Now, write the following code inside the ConvertToText_Click Function
private void ConvertToText_Click(object sender, EventArgs e)Enter fullscreen mode
Exit fullscreen mode
Press Ctrl + F5 to run the project. Select the image file you want to convert. I have selected the same file in the English language as in the previous example. Click the “Convert to Text” button to extract the text from the image. The following window will appear:
Tesseract also supports images featuring different languages. However, we have to add separate language files into our project folder.
It is now becoming clear that the IronOcr .Net Library is far easier to use.
Now, It is clearly understood that IronOcr .Net Library is more easy to use and easy to understandable.
IronOCR's versatility makes it a valuable tool in various industries and applications. Here are some common use cases:
While IronOCR is user-friendly, you might encounter some common errors during implementation. Here’s how to troubleshoot them:
Beyond basic OCR, IronOCR offers several advanced features:
IronOCR stands out as a top choice for developers integrating OCR in C# applications, offering a seamless experience with its easy integration, support for over 150 languages, and powerful features like image pre-processing and multithreading. Whether you're building simple or complex OCR solutions, IronOCR simplifies text extraction from images, catering to developers of all experience levels. Start your journey with IronOCR by downloading the library and exploring its extensive documentation. With regular updates and a free trial by Iron Software, you have everything you need to build robust, OCR-powered applications. Happy coding!