Information Extraction and Retrieval

Today, with the emergence of electronics resources and electronic information exchange, applied and theoretical research in the areas of information processing, storage, extraction and retrieval is of interest to computer community.

Information extraction (IE) is a task to identify specific pieces of information (data) in an unstructured or semi-structured textual document and transform this unstructured information into a structured data. IE is a technology using hand-make or system-make rules and methods in order to recognize, extract and classify specific data from unstructured electronic resources as input and classified and presents these unambiguous data to the user as output. These types of data are frequently used by other arbitrator systems for indexing in database or Internet surfing, information retrieval or Question Answering systems and some times used by the user directly. Information Retrieval (IR) on the other hand is a technology for finding relevant information requested by user from various resources and presents all the relevant results to the user. IR is used for finding information in documents, databases, web documents or searching for metadata. There are many differences between Information Extraction (IE) and Information Retrieval (IR). In an IR system the aim is to find relevant data and presents them to the user, and delivers all documents to the user. Where an IE application analyses input data and presents only the specific information from them that the user is interested in and the system delivers facts to the user or other arbitrator systems or applications. An IE system is more difficult and knowledge-intensive to build and apply. Since IE is more computationally intensive than IR, implementation of IE is more difficult than IR systems. IE is potentially much more efficient than IR because of the possibility of reducing the amount of time people spend reading systems results. In the Table below i show a comparison between IE and IR systems in their applications, implementation and results. It shows IE more applicable and useful than IR since if IE system can remove weaknesses points in domain independent and precision in results, it can satisfy the user more than IR.

For instance, when a user looking for a hand phone with simple specification in the Internet, a robust and famous search engine like Google or Yahoo return thousands and thousand search results to it (See Figure below). Which one of the results is relevant for user? Which one of the results contain the exact data that the user expected and looking for? Does the user have enough time to read all these response data? Such systems use Information Retrieval (IR) to answer the user query but most of the time the result cannot satisfy the user, because the user needs exact and relevant information. To overcome this problem, Information Extraction (IE) is used in IR system. One type of IE is Named Entity Recognition (NER). Learning to extract names such as Person name, Location name, Organization name, Date and Time in natural language text is called Named Entity Recognition (NER) task.




There are five types of IE task that were produced in final MUC-7 (SAIC 1998),(MUC 1998) they are:


a) Named Entity Recognition (NER) finds and classifies names.
b) Co-reference resolution (CO), identifies identity relations between entities.
c) Template Element construction (TE), adds descriptive information to NE
results(using CO).
d) Template Relation construction (TR), finds relations between TE entities.
e) Scenario Template production (ST), fits TE and TR results into specified event
scenarios.

In simpler terms: NE is about finding entities, CO is about which entities and references (such as pronouns) refer to the same thing, TE is about what attribute have, TR is about what relationship between entities there are; ST is about events that the entities participate in.

1 comment:

MMina said...

Hello Mr Mansouri
I am student of science & Research branch of azad university in software engineering .I am looking for a good topic for my seminar , and I have read something about IR & IIR. I have passed the "Symbolic processing " from AI, it is about Deductive database , probabilistic Datalog ,reasoning,logics ,...And "IR in probabilistic Datalog " ,I would like to work on the latest (IR in probabilistic Datalog)Or intelligent information retrieval .would you please help me ?

Thanks
M.Minazadeh