Information Flow 2.10 ~~~ October 2003 ~~~ From Finding to Grasping~ Applications for Extraction

~~~ Ramana Rao's INFORMATION FLOW ~~~ Issue 2.10 ~~~ Oct 2003 ~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Information Flow is an opt-in monthly newsletter.  Your email
address was entered on www.ramanarao.com or www.inxight.com.
You may forward this issue in its entirety.
Send me your thoughts and questions:         [email protected]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~ IN THIS ISSUE ~~~ October 2003 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Introduction
* Shifting from Finding Documents to Grasping Statements
* Enterprise Applications for Information Extraction



~~~ Introduction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This month I've concentrated on trying to capture some thoughts
that I've found myself explaining a lot.  This issue is a first
pass at explaining the types of applications as well as some
particular cross-industry applications enabled by information
extraction technology.  In a nutshell, extraction focuses on
capturing some useful portion of the meaning embedded in the
sentences of the documents of content collections.

And as I'm about to turn into a pumpkin, I've granted myself a
break from gathering articles for the 49 classics of Information
Flow and other links.  I hope you don't mind.

~~~ Shifting from Finding Documents to Grasping Statements ~~~~~

Stretching a point to make it, I often claim that content may be
the most underutilized asset in large organizations today.
Organizations typically buy or create large amounts of content at
great costs, yet little attention goes to truly leveraging it in
broader knowledge processes.  By content, I mean collections of
electronic textual documents including research reports, product
collaterals, development specifications, internal memos, sales
materials, patents and invention proposals, press releases, news
articles, scientific literature, email messages, and so on.

Search has been the focus of past efforts to leverage
organizational content, yet the shortcomings of traditional
search are widely understood.  For example, search typically
focuses on helping users find documents, yet users aren't really
interested in the documents per se, but rather what they say.
Herein lies a key insight: content is made out of human language
statements about the world.  So it makes sense that the next leg
of the journey in fully utilizing content will depend on
technologies that focus on processing the statements in the
content not just finding documents.

Here I mean, not just single statements that stand out for their
uniqueness or relevance to our pursuits, but also patterns over
entire collections.  There is signal and meaning in the stocks
and flows of content, and we can go after these with software.
For many, this immediately conjures up the spectre of solving the
grand scientific challenge of natural language understanding by
machines, but let's hold that thought and first take a look at
various types of content use applications.

Types of Applications

All content use applications have something to do, surprise, with
content and with users.  Particularly, they all enable some kind
of interaction between the information needs of humans and the
meaning-bearing streams of content.  Differences in the nature of
interaction and handoff between the system and the user define
distinct types of applications.  First, activity may be driven by
the user or by the flow of content.  Second, the focus may be on
providing documents or elements of documents to the user, or
rather on analyzing or processing the contents of statements
contained in the documents.  These two axes capture four basic
types of applications:

Retrieval -- users find and understand relevant documents
Routing   -- system routes relevant documents to people 
Mining    -- users explore or analyze collections or flows
Alerting  -- system generates events or reports 

In retrieval applications, activity is user-initiated based on
information needs that arise during tasks or projects.  Retrieval
applications are certainly the most widely-deployed and
understood type of application.  Information retrieval has been
an active field for almost the entire history of computing, and
the Internet has catapulted it into the mainstream.  Though the
focus with retrieval is on finding documents, the requirement of
relevance underscores the importance of knowing what a document
is about.  So, even here, the use of content analysis can
dramatically improve retrieval systems.

Routing flips the retrieval paradigm by turning the pull of
retrieval into the push of content-triggered delivery, for
example, to an email box.  Routing makes sense when information
needs are not just one-time, but instead recur based on broader
roles or organizational needs.  A simple example is a syndication
service that matches new documents against saved queries or users
profiles.  Broader organizational applications include routing of
documents to the right people for further processing e.g. routing
patents to examiners or support cases to relevant specialists.
Because routing "pushes" content at people, it requires finer
discrimination on what the documents is about, otherwise the push
might quickly feel like shove.

While retrieval and routing applications can be improved by
finer-grained processing of contents, mining and alerting
applications absolutely require such processing.  Mining
applications enable users to explore the statistics of content
collections or flows looking for interesting patterns or
occurences.  Mining applications turn text documents into
structured data that can be combined with other data sources and
integrated into statistical or business intelligence
applications.  Alerting applications are the routing style
obverse of mining.  They notify users when particular patterns or
events occur in content flows.

Picking Statements Apart

All the types of applications described above depend on analyzing
content to "understand" some portion of the meaning of its
statements.  Somewhere between one extreme of completely
depending on humans to extract meaning and the other of expecting
machines to fully understand content themselves (whatever that
may mean), we can target particularly useful aspects of meaning
and particularly reliable extraction methods.

Content analysis can be viewed as the processing of content into
structured representations or databases that captures some
aspects of the meaning of the content's statements.  To get at
meaning, we can ask about what is the statement talking, and
about that, what is it saying?  These questions highlight the two
basic mechanisms for meaning in statements.  Statements "refer"
to objects in the world and they "say" something about them.

A search index can be seen as a trivial example of such a
structured database.  It provides a table of how many times and
where words are used in the documents of a content collection.
It's model of the world is that the world has documents in it,
and that the words used in a document tell you want the document
is about.  At the other extreme is a semantic network of the type
typical of knowledge-based systems in artificial intelligence.
Semantic networks try to model the complete "meaning" of the
statements in a way that the meaning is accessible to machine
reasoning systems.

In between these two structures, we can imagine a database that
like the semantic network truly is referring to objects in the
world, but that makes more limited types of statements.  These
statements are of high value in a particular applications and can
be reliably generated from textual content.  Again, it's about
looking for sweet spots that balance utility and viability.

For example, consider a collection of articles about company
events.  The world covered by the statements in the collections
is familiar.  It includes people, companies, roles people play in
companies, corporate event (e.g. founding, bankruptcy, mergers
and acquisitions) and so on.  A structured database over this
space of objects and relationships would capture more meaning
than a simple word index while not provide the structure to
answer arbitrary questions about the contents of the articles.

This kind of analysis technology is called information
extraction.  It includes what is called entity extraction,
figuring out about what objects in the world a statement is
talking about, and fact extraction, figuring out what the
statement is saying about them.  It is the key technology for
moving forward in our efforts to leveraging content.  It focuses
on the meaning of statements in content and on the problem of
graspability rather than that of findability.

~~~ Enterprise Applications for Information Extraction ~~~~~~~~~

The word application is often used in the context of enterprise
software to mean the problem area addressed by the software.  A
number of problems or organizational needs that can benefit from
extraction technology are showing up across many industries.
These solutions typically depend on more sophisticated retrieval
or routing capabilities or the mining or alerting capabilities
enabled by information extraction.  A quick survey of some of
these cross-industry solutions shows considerable resemblances
across these applications.

Regulatory compliance.  Increasingly, large businesses or
organizations are being regulated by laws or proactive policies
to disclose various communications or documents to the public or
to governmental agencies; or to monitor or restrict certain
communications with their customers; or to retain or destroy
documents for some period of time or under certain conditions.
Examples of regulations include the filing requirements on
customer complaints related to pharmaceuticals, HIPAA in the
healthcare industry, and of course, the most visible of such
regulatory acts, namely, Sarbanes-Oxley in the area of corporate
accountability.  A typical example of an application of
extraction technology in this arena is to monitor emails between
brokers and their client for inappropriate messages and forward
them to compliance officers.

Legal Discovery.  In preparing for litigation, law firms, on
behalf of their clients, dig through thousands or millions of
documents looking for evidence to build their cases.  Indices of
the people, organizations, and subjects and maps of the
communications can help focus or prioritize discovery work.  As a
case develops, it also becomes important to re-search based on
new lines of thought.  Because many of the documents are informal
and are created by different people, it is important to be able
to deal with vocabulary and name variation.  These highlighted
aspects of legal discovery also apply to many of the other
collection-oriented applications below.

Mergers and acquisitions.  Mergers depend crucially on being able
to integrate the content resources and activities of multiple
organizations, particularly because large mergers are usually
followed by attrition and headcount reduction.  Meanwhile, the
new organization typically has to handle all the same workload,
so it becomes all the more important to be able to understand
what information is available and to use it after the merger.

Corporate Licensing.  Many large corporations accummulate large
intellectual property (IP) portfolios through Research and
Development as well as Mergers and Acquistions.  Increasingly,
corporations look to external sources to license key technologies
and look for revenue opportunities from licensing their own IP.
Beyond the patents of a company, this application requires
dealing with other internal documents, the patents of others, and
external scientific, technology, and marketplace documents.

Competitive intelligence.  Monitoring the market for competitive
and marketplace dynamics is one of the oldest applications of
search technology.  Yet, this application is fundamentally about
the fine-grained understanding of the interactions between the
players, products, technologies, strategies, actions and so on in
the marketplace.  In the past, large companies tended to serve
this function through small departments staffed with skilled
research librarians and competitive intelligence specialist that
followed well-defined methodologies.  This approach hasn't been
able to keep pace with the increasingly complex competitive and
marketplace landscape, nor with the increasing variety or amount
of available information and user needs across large global
organizations.

Product Development.  Companies produce large amounts of content
during research and development as well as attain publically or
commercially available content.  For example, life sciences
companies leverage public content funded by government agencies,
e.g. National Institute of Health, as well as content from large
electronic publishers.  The pressures in the pharmaceutical
industry are rapidly mounting to improve their drug discovery and
development processes.  Though work has gone into integrating and
curating structured data sources (e.g. experimental data),
internal textual content remains relatively underutilized.

Marketplace Feedback.  Internet content sources and customer
email and surveys contain valuable feedback to an organization.
Monitoring statements made about a company or its products in the
press, on websites, in blogs, in discussion groups, and directly
to the customer support organization can help evaluate brand
perception and company reputation.  Such monitoring can help tune
corporate and product marketing activities, as well as help focus
product development efforts on important areas for improvement or
greater opportunity.

Customer self-support.  All successful product companies must
ultimately focus on support costs for their products.  One
strategy that many companies are pursuing is to publish product
and support information through interfaces that allow their
customers to retrieve relevant support information directly.
Besides mitigating costs for the company, a positive user
experience that leads to solving the customer's problem can also
enhance the company's brand.

Supplier management.  Large organizations that provision products
and services from a large number of suppliers often struggle with
product documentation and service level agreements.  Internal
users (e.g. product development) often must depend on an internal
service department to figure out how to find necessary
information.  Suppliers and products and agreements are
constantly changing, so manual organization efforts quickly fall
behind.  

Government Intelligence and Investigation.  Intelligence agencies
are focused on the mission of preventing terrorist attacks, while
Law Enforcement organizations are looking for clues that might
help them find the bad guys.  These missions have access to huge
repositories of textual content gathered by multiple agencies and
departments including field reports, collection summaries,
immigration records, Web page content, emails, message traffic,
open source news feeds and the like.  These mining and alerting
applications fundamentally depend on extracting important
entities like people, places, organizations, various kinds of
indications and warnings, and connecting them through statements
made in documents.

All these different applications rely on seeing content as a
series of possibly useful statements about objects of interest.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ramana Rao is Founder and CTO of Inxight Software, Inc.
Copyright (c) 2003 Ramana Rao.  All Rights Reserved.
You may forward this issue in its entirety.

See:  http://www.ramanarao.com
Send:   [email protected]
Archive:  http://www.ramanarao.com/informationflow/archive/
Subscribe:  mailto:[email protected]
Unsubscribe:  mailto:[email protected]