Information Extraction and Retrieval

Today, with the emergence of electronics resources and electronic information exchange, applied and theoretical research in the areas of information processing, storage, extraction and retrieval is of interest to computer community.

Information extraction (IE) is a task to identify specific pieces of information (data) in an unstructured or semi-structured textual document and transform this unstructured information into a structured data. IE is a technology using hand-make or system-make rules and methods in order to recognize, extract and classify specific data from unstructured electronic resources as input and classified and presents these unambiguous data to the user as output. These types of data are frequently used by other arbitrator systems for indexing in database or Internet surfing, information retrieval or Question Answering systems and some times used by the user directly. Information Retrieval (IR) on the other hand is a technology for finding relevant information requested by user from various resources and presents all the relevant results to the user. IR is used for finding information in documents, databases, web documents or searching for metadata. There are many differences between Information Extraction (IE) and Information Retrieval (IR). In an IR system the aim is to find relevant data and presents them to the user, and delivers all documents to the user. Where an IE application analyses input data and presents only the specific information from them that the user is interested in and the system delivers facts to the user or other arbitrator systems or applications. An IE system is more difficult and knowledge-intensive to build and apply. Since IE is more computationally intensive than IR, implementation of IE is more difficult than IR systems. IE is potentially much more efficient than IR because of the possibility of reducing the amount of time people spend reading systems results. In the Table below i show a comparison between IE and IR systems in their applications, implementation and results. It shows IE more applicable and useful than IR since if IE system can remove weaknesses points in domain independent and precision in results, it can satisfy the user more than IR.

For instance, when a user looking for a hand phone with simple specification in the Internet, a robust and famous search engine like Google or Yahoo return thousands and thousand search results to it (See Figure below). Which one of the results is relevant for user? Which one of the results contain the exact data that the user expected and looking for? Does the user have enough time to read all these response data? Such systems use Information Retrieval (IR) to answer the user query but most of the time the result cannot satisfy the user, because the user needs exact and relevant information. To overcome this problem, Information Extraction (IE) is used in IR system. One type of IE is Named Entity Recognition (NER). Learning to extract names such as Person name, Location name, Organization name, Date and Time in natural language text is called Named Entity Recognition (NER) task.




There are five types of IE task that were produced in final MUC-7 (SAIC 1998),(MUC 1998) they are:


a) Named Entity Recognition (NER) finds and classifies names.
b) Co-reference resolution (CO), identifies identity relations between entities.
c) Template Element construction (TE), adds descriptive information to NE
results(using CO).
d) Template Relation construction (TR), finds relations between TE entities.
e) Scenario Template production (ST), fits TE and TR results into specified event
scenarios.

In simpler terms: NE is about finding entities, CO is about which entities and references (such as pronouns) refer to the same thing, TE is about what attribute have, TR is about what relationship between entities there are; ST is about events that the entities participate in.

Named Entity Recognition (NER)

Named Entity Recognition (NER), is a sub problem of information extraction and involves processing structured and unstructured resources and identifying expressions that refer to Peoples, Places, Organizations, Dates and Times (See Figure) NER is a fundamental task in information extraction, to extract information based on analyzing natural language. The term Named Entity Recognition and other above terms was introduced in the sixth Message Understanding Conference.


Since the Named Entity Recognition (NER) task is seems relatively simple , a high accuracy rate and independent domain system with responding to senses names is expected.

At the first time building a NE system seems easy but with so many studies in this areas, still a large number of ambiguous cases that make it difficult leaves on the NER task and reduces the performance in existing methods.

For illustration in bellow example:

When is "The White house" as organization, and when is it a location?
When is "June" a person name? And when is it a month name?

In: "He visited Bush at White House", here White House is a location", but in “The White House has asked the Department of Justice to look into … ", White House is an organization.

For humans in daily dialog, text reviewing and NER is intuitively very simple, only sometimes they have a few problems to recognize very difficult and unheard names. As well as many named entities are proper names and most of them have initial capital letters and can easily be recognized by that way, so for unknown word they use of dictionary and other origins, but for machine, it is so hard. One might think the named entities can be classified easily using dictionaries and some fixed grammatically rules, because most of named entities are proper nouns, but this is a wrong opinion. As time passes, new proper nouns are created continuously also where there are more than 500 different languages in the world with various grammar rules that they changed and grow with time. Therefore, it is impossible to add all those proper nouns to a dictionary thus such systems not sufficient and language independent to recover all name Entities. Even though named entities are registered in the dictionary, it is not easy to decide their senses. Most problems in NER are that they have semantic (sense) ambiguity, on the other hand, a proper noun has Different senses according to the context.

NER involves two main processing tasks, the identification of proper names and second the classification of these names into a set of predefined categories of interest, such as person names, organizations (companies, government organizations, committees, etc), locations (cities, countries, rivers, etc), date and time expressions. Since it seems these are two independent tasks and can be divided and respond alone, but we will show there are strong relations between identification and classification of each name, where classify of names without consideration to recognition step it makes system empty of life.

PRPI 2007

About PRPI 2007
Poster presentation. PRPI 2007, University Putra Malaysia. November 2007

My Poster



Photos



ICCSIT 2008

About ICCSIT
The 2008 International Conference on Computer Science and Information Technology (ICCSIT 2008) has been held successful in Singapore in August 2008. Distinguished keynote speakers include Meritorious Prof. Dr. S. M. Aqil Burney from University of Karachi Dean Prof. Dr. V. Saravanan from Karunya University and Dr. Li Wenzheng, Dean Professor from Beijing Technology and Business University, HOD. Prof. Dr. S.R. Bhadra Chaudhuri from Bengal Engineering & Science University and Eng. Saleemullah Sohoo. It was attended by over 50 delegates from around the world. The ICCSIT 2008 proceedings are then compiled, cataloged and published by IEEE Computer Society, and it has been included into IEEE Computer Society Digital library (CSDL), IEEE Xplore and INSPEC up to now.

http://www.iccsit.org/

My Paper
Alireza Mansouri, Lilly Suriani Affendy, Ali Mamat, ”A New Fuzzy Support Vector Machine Method for Named Entity Recognition”, In proceeding of the IEEE International Conference on Computer Science and Information Technology (ICCSIT), Singapore, September 2008.

Photos






ITSim 2008

About ITSim
The 3rd International Symposium on Information Technology 2008 (ITSim2008) aims to provide an excellent opportunity to share and exchange technologies and applications in the information technology and communication fields for professionals, engineers, academics and industrial people worldwide. The symposium will be held at the Kuala Lumpur Convention Centre, from 26th to 29th of August 2008, hosted by Universiti Kebangsaan Malaysia and co-sponsored by IEEE Malaysia Section. The theme is “Cognitive Informatics: Bridging Natural and Artificial Knowledge”. http://www.itsim.org/

My Paper
Alireza Mansouri, Lilly Suriani Affendy, Ali Mamat, Rabiah Abdul Kadir ” Semantically Factoid Question Answering Using Fuzzy SVM Named Entity Recognition”, In proceeding of the IEEE 3rd International Symposium on Information Technology (ITSim), Kuala Lumpur, Aug 2008.

Photos












MySEC2007

About MySEC2007
The Third Malaysian Software Engineering Conference (MySEC'07) is the leading regional conference on software engineering that aims to bring together researchers and practitioners from academia, industry and government to advance the state-of-the-art research and practice in software engineering.
http://www.fsktm.upm.edu.my/mysec07/

My Paper
Alireza Mansouri, Ali Mamat, Lilly Suriani Affendy, ”Named Entity Recognition and Extraction using Machine learning Methods”, In proceeding of the 3rd Malaysian Software Engineering Conference (MySEC), Selangor Malaysia, December 2007.

Photos









My Journal Papers

ISI Journal:
1: Alireza Mansouri, Lilly Suriani Affendy, Ali Mamat, ”Context-based Named Entity Recognition Using Fuzzy Support Vector Machine”, Accepted and under review in Springer's Information Retrieval Journal .

International Journals:

1: Alireza Mansouri, Lilly Suriani Affendy, Ali Mamat, ”Named Entity Recognition Using a New Fuzzy Support Vector Machine”, International Journal of Computer Science and Network Security, 2008. 8(2): p: 320.

2: Alireza Mansouri, Lilly Suriani Affendy, Ali Mamat, ”Named Entity Recognition Approaches”, International Journal of Computer Science and Network Security, 2008. 8(2): p: 339.

BIODATA

Alireza Mansouri was born in Toyserkan, Iran in 1971
He went to Mohamadyyeh School, Hejrtas School, Shariatty School, and Amirkabir School for his primary education. He obtained his bachelor degree in the filed of applied mathematic in computer science from Islamic Azad University in 1997

Since 1993 until 2005 he worked in some companies namely:
Mohaseb Machine Co, Iran, Jul1993 – Sep1996
Bonyad Machine Co, Iran, Sep1996 – March1997
Ferdos Rayaneh Co, Iran, March1997 – Oct1 9999
Pak Pasteurized Dairy Product Co Public joint stock, Iran, Oct1999 – March2002
In March 2002 he went to Behnoush Iran Co, Public joint stock, and was in manager of computer department position until Jan 2005
He left Iran in 2006 to continue his studies in University Putra Malaysia in the field of database
in department of computer science.