~~~ Ramana Rao's INFORMATION FLOW ~~~ Issue 2.10 ~~~ Oct 2003 ~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Information Flow is an opt-in monthly newsletter. Your email address was entered on www.ramanarao.com or www.inxight.com. You may forward this issue in its entirety. Send me your thoughts and questions: [email protected] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~ IN THIS ISSUE ~~~ October 2003 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Introduction * Shifting from Finding Documents to Grasping Statements * Enterprise Applications for Information Extraction ~~~ Introduction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This month I've concentrated on trying to capture some thoughts that I've found myself explaining a lot. This issue is a first pass at explaining the types of applications as well as some particular cross-industry applications enabled by information extraction technology. In a nutshell, extraction focuses on capturing some useful portion of the meaning embedded in the sentences of the documents of content collections. And as I'm about to turn into a pumpkin, I've granted myself a break from gathering articles for the 49 classics of Information Flow and other links. I hope you don't mind. ~~~ Shifting from Finding Documents to Grasping Statements ~~~~~ Stretching a point to make it, I often claim that content may be the most underutilized asset in large organizations today. Organizations typically buy or create large amounts of content at great costs, yet little attention goes to truly leveraging it in broader knowledge processes. By content, I mean collections of electronic textual documents including research reports, product collaterals, development specifications, internal memos, sales materials, patents and invention proposals, press releases, news articles, scientific literature, email messages, and so on. Search has been the focus of past efforts to leverage organizational content, yet the shortcomings of traditional search are widely understood. For example, search typically focuses on helping users find documents, yet users aren't really interested in the documents per se, but rather what they say. Herein lies a key insight: content is made out of human language statements about the world. So it makes sense that the next leg of the journey in fully utilizing content will depend on technologies that focus on processing the statements in the content not just finding documents. Here I mean, not just single statements that stand out for their uniqueness or relevance to our pursuits, but also patterns over entire collections. There is signal and meaning in the stocks and flows of content, and we can go after these with software. For many, this immediately conjures up the spectre of solving the grand scientific challenge of natural language understanding by machines, but let's hold that thought and first take a look at various types of content use applications. Types of Applications All content use applications have something to do, surprise, with content and with users. Particularly, they all enable some kind of interaction between the information needs of humans and the meaning-bearing streams of content. Differences in the nature of interaction and handoff between the system and the user define distinct types of applications. First, activity may be driven by the user or by the flow of content. Second, the focus may be on providing documents or elements of documents to the user, or rather on analyzing or processing the contents of statements contained in the documents. These two axes capture four basic types of applications: Retrieval -- users find and understand relevant documents Routing -- system routes relevant documents to people Mining -- users explore or analyze collections or flows Alerting -- system generates events or reports In retrieval applications, activity is user-initiated based on information needs that arise during tasks or projects. Retrieval applications are certainly the most widely-deployed and understood type of application. Information retrieval has been an active field for almost the entire history of computing, and the Internet has catapulted it into the mainstream. Though the focus with retrieval is on finding documents, the requirement of relevance underscores the importance of knowing what a document is about. So, even here, the use of content analysis can dramatically improve retrieval systems. Routing flips the retrieval paradigm by turning the pull of retrieval into the push of content-triggered delivery, for example, to an email box. Routing makes sense when information needs are not just one-time, but instead recur based on broader roles or organizational needs. A simple example is a syndication service that matches new documents against saved queries or users profiles. Broader organizational applications include routing of documents to the right people for further processing e.g. routing patents to examiners or support cases to relevant specialists. Because routing "pushes" content at people, it requires finer discrimination on what the documents is about, otherwise the push might quickly feel like shove. While retrieval and routing applications can be improved by finer-grained processing of contents, mining and alerting applications absolutely require such processing. Mining applications enable users to explore the statistics of content collections or flows looking for interesting patterns or occurences. Mining applications turn text documents into structured data that can be combined with other data sources and integrated into statistical or business intelligence applications. Alerting applications are the routing style obverse of mining. They notify users when particular patterns or events occur in content flows. Picking Statements Apart All the types of applications described above depend on analyzing content to "understand" some portion of the meaning of its statements. Somewhere between one extreme of completely depending on humans to extract meaning and the other of expecting machines to fully understand content themselves (whatever that may mean), we can target particularly useful aspects of meaning and particularly reliable extraction methods. Content analysis can be viewed as the processing of content into structured representations or databases that captures some aspects of the meaning of the content's statements. To get at meaning, we can ask about what is the statement talking, and about that, what is it saying? These questions highlight the two basic mechanisms for meaning in statements. Statements "refer" to objects in the world and they "say" something about them. A search index can be seen as a trivial example of such a structured database. It provides a table of how many times and where words are used in the documents of a content collection. It's model of the world is that the world has documents in it, and that the words used in a document tell you want the document is about. At the other extreme is a semantic network of the type typical of knowledge-based systems in artificial intelligence. Semantic networks try to model the complete "meaning" of the statements in a way that the meaning is accessible to machine reasoning systems. In between these two structures, we can imagine a database that like the semantic network truly is referring to objects in the world, but that makes more limited types of statements. These statements are of high value in a particular applications and can be reliably generated from textual content. Again, it's about looking for sweet spots that balance utility and viability. For example, consider a collection of articles about company events. The world covered by the statements in the collections is familiar. It includes people, companies, roles people play in companies, corporate event (e.g. founding, bankruptcy, mergers and acquisitions) and so on. A structured database over this space of objects and relationships would capture more meaning than a simple word index while not provide the structure to answer arbitrary questions about the contents of the articles. This kind of analysis technology is called information extraction. It includes what is called entity extraction, figuring out about what objects in the world a statement is talking about, and fact extraction, figuring out what the statement is saying about them. It is the key technology for moving forward in our efforts to leveraging content. It focuses on the meaning of statements in content and on the problem of graspability rather than that of findability. ~~~ Enterprise Applications for Information Extraction ~~~~~~~~~ The word application is often used in the context of enterprise software to mean the problem area addressed by the software. A number of problems or organizational needs that can benefit from extraction technology are showing up across many industries. These solutions typically depend on more sophisticated retrieval or routing capabilities or the mining or alerting capabilities enabled by information extraction. A quick survey of some of these cross-industry solutions shows considerable resemblances across these applications. Regulatory compliance. Increasingly, large businesses or organizations are being regulated by laws or proactive policies to disclose various communications or documents to the public or to governmental agencies; or to monitor or restrict certain communications with their customers; or to retain or destroy documents for some period of time or under certain conditions. Examples of regulations include the filing requirements on customer complaints related to pharmaceuticals, HIPAA in the healthcare industry, and of course, the most visible of such regulatory acts, namely, Sarbanes-Oxley in the area of corporate accountability. A typical example of an application of extraction technology in this arena is to monitor emails between brokers and their client for inappropriate messages and forward them to compliance officers. Legal Discovery. In preparing for litigation, law firms, on behalf of their clients, dig through thousands or millions of documents looking for evidence to build their cases. Indices of the people, organizations, and subjects and maps of the communications can help focus or prioritize discovery work. As a case develops, it also becomes important to re-search based on new lines of thought. Because many of the documents are informal and are created by different people, it is important to be able to deal with vocabulary and name variation. These highlighted aspects of legal discovery also apply to many of the other collection-oriented applications below. Mergers and acquisitions. Mergers depend crucially on being able to integrate the content resources and activities of multiple organizations, particularly because large mergers are usually followed by attrition and headcount reduction. Meanwhile, the new organization typically has to handle all the same workload, so it becomes all the more important to be able to understand what information is available and to use it after the merger. Corporate Licensing. Many large corporations accummulate large intellectual property (IP) portfolios through Research and Development as well as Mergers and Acquistions. Increasingly, corporations look to external sources to license key technologies and look for revenue opportunities from licensing their own IP. Beyond the patents of a company, this application requires dealing with other internal documents, the patents of others, and external scientific, technology, and marketplace documents. Competitive intelligence. Monitoring the market for competitive and marketplace dynamics is one of the oldest applications of search technology. Yet, this application is fundamentally about the fine-grained understanding of the interactions between the players, products, technologies, strategies, actions and so on in the marketplace. In the past, large companies tended to serve this function through small departments staffed with skilled research librarians and competitive intelligence specialist that followed well-defined methodologies. This approach hasn't been able to keep pace with the increasingly complex competitive and marketplace landscape, nor with the increasing variety or amount of available information and user needs across large global organizations. Product Development. Companies produce large amounts of content during research and development as well as attain publically or commercially available content. For example, life sciences companies leverage public content funded by government agencies, e.g. National Institute of Health, as well as content from large electronic publishers. The pressures in the pharmaceutical industry are rapidly mounting to improve their drug discovery and development processes. Though work has gone into integrating and curating structured data sources (e.g. experimental data), internal textual content remains relatively underutilized. Marketplace Feedback. Internet content sources and customer email and surveys contain valuable feedback to an organization. Monitoring statements made about a company or its products in the press, on websites, in blogs, in discussion groups, and directly to the customer support organization can help evaluate brand perception and company reputation. Such monitoring can help tune corporate and product marketing activities, as well as help focus product development efforts on important areas for improvement or greater opportunity. Customer self-support. All successful product companies must ultimately focus on support costs for their products. One strategy that many companies are pursuing is to publish product and support information through interfaces that allow their customers to retrieve relevant support information directly. Besides mitigating costs for the company, a positive user experience that leads to solving the customer's problem can also enhance the company's brand. Supplier management. Large organizations that provision products and services from a large number of suppliers often struggle with product documentation and service level agreements. Internal users (e.g. product development) often must depend on an internal service department to figure out how to find necessary information. Suppliers and products and agreements are constantly changing, so manual organization efforts quickly fall behind. Government Intelligence and Investigation. Intelligence agencies are focused on the mission of preventing terrorist attacks, while Law Enforcement organizations are looking for clues that might help them find the bad guys. These missions have access to huge repositories of textual content gathered by multiple agencies and departments including field reports, collection summaries, immigration records, Web page content, emails, message traffic, open source news feeds and the like. These mining and alerting applications fundamentally depend on extracting important entities like people, places, organizations, various kinds of indications and warnings, and connecting them through statements made in documents. All these different applications rely on seeing content as a series of possibly useful statements about objects of interest. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ramana Rao is Founder and CTO of Inxight Software, Inc. Copyright (c) 2003 Ramana Rao. All Rights Reserved. You may forward this issue in its entirety. See: http://www.ramanarao.com Send: [email protected] Archive: http://www.ramanarao.com/informationflow/archive/ Subscribe: mailto:[email protected] Unsubscribe: mailto:[email protected]