Capture, Distribution and Evolution of Information Needs in a Process-Oriented Knowledge Management Environment

Jochen Schäfer

14. April 2003

Department of Computer Science
University of Kaiserslautern

Working group Artificial Intelligence
Prof. Dr. Michael M. Richter

Betreuer:
Prof.Dr. Michael M. Richter
Dr. Harald Holz

Introduction
KM support in MILOS
Collaborative Filtering
- Motivation
- CF
Task-specific CF
Conclusion
- Discussion
- Related work
  - GroupLens
  - CALVIN
  - CASPER
  - PHOAKS
  - Entree
  - Google
- Outlook
Details
- Data structures
- Query expression parser
Glossary
Bibliography
List of Figures
List of Tables
List of Algorithms
Footnotes
About this document ...

Introduction

"Our ability to imagine complex applications will always exceed our ability to develop them"

Grady Booch

Introduction

In a software engineering (SE) project there are many tasks that have to be dealt with, e.g. software design, project planning, implementation, testing and more. The staff depends on various pieces of information to fulfill their tasks as efficiently as possible. Typically, this information has to be obtained by the particular employees themselves. For the purpose of planning and managing a software project in an automatic and systematic way through process models, Process-centered Software Engineering Environments (PSEE) were developed.

Processes are a set of activities with the aim of reaching a goal. These activities will be called tasks in the following. A process model is a representation of such a process. The key elements of a process model are the tasks, the team members ^1.1 that execute these tasks, products that are produced by the tasks, e.g. source code and resources that are needed to execute the tasks [GJ96].

Garg et al [GJ96] specify three areas that are connected by a PSEE:

Process engineering: Definition and maintenance of process models.
Software engineering: Development and maintenance of software products, which are created according to the process model.
Project management: Coordination and supervision of all activities in software engineering.

Each PSEE is designed around a set of heuristics and assumptions about what information is needed by what agent for which task. For this information a variety of access concepts are defined, e.g. a to-do list or that each task has input and output documents.

Most existing approaches have one or more of the following drawbacks:

Many relevant information resources exist outside of the PSEE which cannot be found easily in the overwhelming amount of irrelevant information resources, a prime example being web sites. For more details see [Kön00].
Most approaches only have static links to information resources, i.e. the information items may be task-specific, but they are not able to adapt to the current situation. E.g. if the required tool set for a task changes, these static links do not change also. They do not have a feedback loop to
- allow users to post requests for additional information,
- learn lessons from these requests about other information resources in the future.

In order to show how this problem can be addressed, Könnecker [Kön00] introduced the concept of information needs into the existing PSEE MILOS (Minimally Invasive Long-term Organizational Support).

His approach makes it possible to explicitly model recurrent information needs with questions and presents ways to satisfy them by helping finding answers to the questions.

Although his concept constitutes a major advancement, there are some problems with the original implementation that this thesis addresses.

Goals of this thesis

First, there is a high up-front modelling effort, since the original implementation demanded that several aspects of an information need, especially preconditions, are modelled correctly by a group of knowledge engineers^1.2. The knowledge department is required to capture and maintain the information needs for each task. The situation becomes especially unpleasant if there is no knowledge department, so that other agents, who usually are not trained in modelling recurrent information needs, are hindered at doing their original work.

Second, there are some structural and conceptual problems that can make maintaining recurrent information needs difficult. E.g. each information need can contain only one information source.

Third, useful information found by colleagues during enactment of earlier similar tasks are not captured and distributed to agents causing superflous information searches.

The goal of this thesis is to lower the burden of managing and handling information resources by providing the agents with tools that allow them to easily capture new information resources and that automatically recommend these resources to certain other agents.

In order to do this we will

introduce privately managed information needs for tasks, so that each agent can maintain his own set of information needs without having to contact the knowledge department. To further ease the acquisition of information items, we will enable the agents to use internal or external message boards by means of a ``post information request'' action (cf section 2.4.2),
refactor Könnecker's design in order to remove the structural problems (cf chapter 2.4.1),
introduce peer-recommended lists of captured information resources that are created by making recommendations using Collaborative Filtering (CF) and task-specific information (cf chapter 4). The use of Collaborative Filtering is based on the assumption that agents that used similar information resources in the past will do so in the future.

Structure of this thesis

In Chapter 2 we briefly describe the PSEE MILOS and the concepts introduced by Könnecker [Kön00]. Furthermore, we describe the refactoring done to overcome some limitations of the original information need support by explaining what the limitations were and how we solved them.

In Chapter 3 we give a short introduction to Collaborative Filtering that will be used for the extensions described in Chapter 4.

In Chapter 4 we will lay out how we want to capture information about the use of information needs, how to distribute them to other users, and how to evolve them so that the distribution of information needs gets more relevant in respect to tasks.

Chapter 5 compares the approach presented here to related works, summarizes it and outlines future work.

The appendix A describes technical and implementational details that do not appear in the thesis. A glossary of important terms used in this thesis is contained in Appendix B.

Knowledge management support in MILOS

"I hear and I forget. I see and I remember. I do and I understand."

Confucius

In the following we will describe the PSEE MILOS and the original information need implementation.

MILOS

MILOS is a system to support software development in distributed project groups by providing tools for process modelling, project planning and project plan enactment.

MILOS provides a process modelling component to define process models based upon three basic types: processes, methods and parameters. Relations of instances of these types define what needs to be done in a project that is conducted by the process model. A process is a description of tasks that need to be executed to fulfill a project goal. A method describes how a process goal can be achieved. The process model can contain a set of methods for every process. A parameter describes products that are used or created within a process, e.g. the programming specification for the software product. A process model is shown in the process model editor of MILOS (figure 2.1). In this example, the process ``Component-based development'' is decomposed by a method ``Iterative enhancement'' into the subprocesses ``Requirements analysis'', ``Component design'', ``Component implementation'', ``Component testing'' and ``System integration''. For the subprocess ``Component design'' the methods ``OMT'', ``UML'' and ``SDL'' are defined.

**Figure 2.1:** Process model editor in MILOS (from [Kön00])

While process models are generic reusable descriptions of how the processes should be executed (i.e. containing best-practices with respect to processes), concrete projects have specific needs which are satisfied with a project plan. A project plan can be composed of several process models. A project plan that is derived from the process model in figure 2.1 is shown in the project plan editor of MILOS (figure 2.2).

**Figure 2.2:** Project plan editor in MILOS (from [Kön00])

Project enactment guides the execution of the project plan and leads to creation of the actual products. The MILOS system has a workflow engine to provide the enactment environment. Figure 2.3 shows the To-Do-GUI of MILOS. The current agent Barbara can manage her assigned tasks and the products of each of these tasks. Furthermore, the agent can access messages from events that MILOS creates during project enactment. E.g. if the project plan changes, the affected agents will get notified. The change notification can be defined by ECA (Event Condition Action) rules. This allows on-the-fly changes of project plans.

**Figure 2.3:**To-Do-GUI of MILOS (from [Kön00])

In his diploma thesis [Kön00] Könnecker introduced the concept of information needs into MILOS. An information need describes a situation where an agent requires certain information items to fulfill his current task or to enhance the quality of the produced objects. Information needs are explicitly modelled in MILOS with information need objects (INO). In this context INOs represent information needs as questions. Furthermore, an INO contains preconditions that formally determine when it arises, and an information source and corresponding access specifications, that help answering the question.

Mayer described in this diploma thesis [May01] how Orenge (Empolis http://www.empolis.com), a case based reasoning (CBR) tool, was integrated into MILOS. In particular, this extension provided the means to characterize tasks by attribute-value-pairs. We will explain the details of the support provided by this extension in section 2.3.

In the following we will explain the concepts, that were introduced in [Kön00].

Representing information needs

Concepts

Planning, management and execution of projects are difficult tasks. To decide what actions have be taken, information has to be accessed, e.g. to decide which agent is suitable for performing a certain task. Further difficulties arise when project plans, requirements and documents change, so that required information changes.

In MILOS, the representation of information needs are covered by the following aspects:

what information items might be useful (usually as a question),
where and how can this information be found (i.e. a information source that provides access to the wanted information),
when and to whom it might be useful (i.e. constraints on activities, role, skills etc.).

Information sources help satisfying the information need by giving access to information that might be helpful in task enactment.

The motiviation for the original implementation was to enable users of the Information Assistant (IA) to have information resources that may need for fulfilling their tasks presented to them proactively.

In the following we will distinguish between information needs and sources, being abstract concepts, and information need objects (INO) and information source objects (ISO) which are the implementation objects in MILOS.

Information need and information source objects

The UML diagram in figure 2.4 shows the interdependence of class InfoNeed, which represents an INO, and class InfoSource, which represents an information source object (ISO). One can recognize that an INO can contain an ISO. But an ISO can be associated with many INOs.

**Figure 2.4:** UML class diagramm of Könnecker's INO implementation

An information need as an object (INO) contains the following attributes:

Representation: The representation describes the information need and is a text that usually comes in the form of a question. Because the representation may fit for a set of processes with common aspects, the representation can contain attributes, which will be replaced with the current values of the attributes before the INO is presented to the user. E.g. consider the representation ``What agent is assigned to process <process>?''. For a process named ``Implement GUI'' the attribute ``<process>'' will be replaced with ``Implement GUI'', so that the representation becomes ``What agent is assigned to process Implement GUI?''.
Supported role: The role concept lets a knowledge engineer refine the context in which the information need should be used. Some information needs can only be used reasonably by agents with a specific role, e.g. an information need about programming details is usually not valid for project planners.
Information category: The information category can be used to further refine the usage context. Categories can help to structure the information needs, which otherwise would be presented in a large unstructured set.
Precondition: The precondition defines which conditions have to be satisfied in the current context of an information need. During enactment of a task the precondition will be evaluated. The INO will only be presented to the user if the result evaluates to true.
Information source: The information source represents the resource from which information can be retrieved to satisfy the information need.
Query representation: The query representation specifies a query to the information source. Like the representation the query can contain attributes that will be replaced before the query is executed.

**Table 2.1:** Information need object representation (from [Kön00])
INO attribute	Example value
Representation	What bugs have been reported for the system <system> and the agent <assigned agent>?
Information source	Bugzilla Search-for-Bugs
Query	?bug_status=NEW&agent_name=<assigned_agent>&product=<system>
Supported role	Programmer
Information category	product creation
Precondition	system.numberBugs > 0

An example for an INO is shown in table 2.1. For further details on this implementation see [Kön00].

An information source has attributes as follows:

Information source name: The name is an unique identifier for the information source.
Contents description: The contents description explains in more details the purpose of the information source to help the user identify the information source.
Interface type: The interface type identifies by which means the information is accessed. Könnecker had implemented two types, CGI for access of web sites and ``MILOS Internal'' for access of information that are internal to the MILOS system.
Query execution access: This attribute defines the access point of the information source. During execution of a query, this will be used to form the actual query that is sent to the information source.
Query engine access: The engine access provides a means to access the information source and execute a query manually. This is also useful for creation and testing of a query.
Query syntax: This attribute specifies the syntax of queries to help users in creating a query.
Example query: This attribute serves as an example for a valid query and as a possibility to test the query access.

**Table 2.2:**Information source object representation (from [Kön00])
ISO attribute	Example value
Information source name	Mozilla.org Search-for-Bugs
Contents description	Information source to search for a list of recorded bugs of currently managed systems.
Interface type	CGI
Query execution access	http://wwwagr.informatik.uni-kl.de/~bugzilla/buglist.cgi
Query engine access	http://wwwagr.informatik.uni-kl.de/~bugzilla/query.cgi
Query syntax	?bug_status={NEW, CLOSED, ASSIGNED, VERIFIED}&email1=String&product=System&...
Example query	?bug_status=NEEW&emailtype1=substring&email1=holz&emailassigned_to1=1

An example for an ISO is displayed in table 2.2. For further information on representation of information sources consult [Kön00].

Graphical user interface

Figure 2.5 shows the Information Assistant (IA) as implemented by Könnecker. The tree display in the upper area shows all available information needs sorted by roles and categories and whose precondition holds in the current task's context.

**Figure 2.5:** The Information Assistant (from [Kön00])

By pressing the button ``Answer'', the user initiate the execution of the information source's query and the result of the query will be presented to the user. The user is also able to access the information source directly by pressing the button ``Access to IS...'' in the lower part of the IA one can create a new INO by posting a question.

The creation and editing of information needs is done with the Information Need Manager. Since INOs are always associated with project plan or process model objects, the knowledge engineer has to select the appropriate object, e.g. ``Implementation'' in figure 2.6, in the upper left part, causing the already created information needs, if any, to be displayed in the upper right part. The knowledge engineer is now able to create new INOs and edit or delete existing information needs.

**Figure 2.6:** The Information Need Manager (from [Kön00])

The attributes of a selected INO are displayed in the lower part of the manager, where they can be edited by pressing the appropriate ``Edit''-button.

For the attribute ``Information Source'' of an INO, the user selects one of the defined ISOs that are specified within the Information Source Manager (cf figure 2.7).

**Figure 2.7:** The Information Source Manager (from [Kön00])

Information sources are created by using the Information Source Manager (figure 2.7). The operation of the Information Source Manager is similar to the operation of the Information Need Manager, except that all ISOs are global to the system, so that there is only one list of ISOs.

Workflow

Figure 2.8 depicts an example workflow from IA usage scenario, which we will describe in the following:

Agent Barbara has opened her To-Do-List (a) and selected the task ``Implement WFE component''. By pressing the button ``Info Assistant ...'' she invokes the Information Assistant (b).

The IA displays some information needs in diverse categories. Agent Barbara is interested in information ``What bugs have been report for the project 'Distributed workflow'?'' at the moment, so she select this INO and presses the ``Answer'' button.

The IA computes and executes the query and presents the result, in this example a bug report from Bugzilla.

**Figure 2.8:** An example IA workflow (from [Kön00]): (a) To-Do-List, (b) Information Assistant, (c) query result

Case Based Reasoning

The classic approach to Case Based Reasoning (CBR) is to split cases into a case specification part and a solution part. Then the CBR system uses the case specification to find similar cases and adapt the solutions of the retrieved cases to the current problem. The last step is to retain the learned case and save it to the database of known cases. This is called R⁴ model derived from the four steps Retrieve, Reuse, Revise and Retain [AP94] and is depicted in figure 2.9.

**Figure 2.9:** CBR cycle according to Aamodt and Plaza [AP94]

In [Bur00b], Burke argues that Collaborative Filtering has some similarities to case-based reasoning:

In Collaborative Filtering a user votes on items. The item-vote-pair can be regarded in case-based reasoning terms as a case the user adds. The problem in Collaborative Filtering lies in the fact that a prediction has to be made about how much the user will like a new item he has not voted on yet. In case-based reasoning terms, this is a problem situation, where a case is specified containing an item that the user has not voted on. The solution to the problem corresponds to the prediction.

Orenge integration

Mayer describes in his diploma thesis [May01] how he integrated similarity-based knowledge management into MILOS with Orenge.

The main concept is the characterization of project plan and workflow objects. This enables the user to annotate the plan elements with attribute-value-pairs, on which one can define similarity measures [Ric,RA99]. To formalize characterization objects and ease the creation of similarity measures, the characterization objects are instances of characterization types that are organized in a type hierarchy. Mayer calls the set of characterization types SEDM (Software Engineering Domain Model) [May01]. Figure 2.10 depicts an excerpt from an example characterization type hierarchy of process types.

**Figure 2.10:** Characterization type hierarchy
SEDMProcess DesignProcess ImplementationProcess Implement_SW_System Implement_SystemComponent MigrationProcess ReviewProcess TestingProcess BlackBox_Testing GlassBox_Testing

Every time a process is created in MILOS, a new case containing the characterization of the new process is recorded. So the case base is made up of characterizations of previous processes.

The user can define queries on the characterizations. Furthermore the user is able to define weights for every attribute in a query, thereby emphasizing or deemphasizing the attribute.

**Figure 2.11:** Characterization editor

Limitations

The information need support in MILOS as presented above has several limitation and weaknesses which we will explain in this section. For each of the limitations we will furthermore describe how we solve them.

We can divide the problems into structural ones and conceptual ones. The former are determined by deficiencies in the representation of INOs, while the latter are determined by restrictions in the functionality of the IA.

Structural problems

Single information source for an INO

The implementation of Könnecker [Kön00] uses a one-to-one relationship between INO and ISO. This makes it impossible to offer multiple answers for the same information need. If one wanted to know possible sources of information about Java programming, many information sources are available that can help in satisfying the information needs, e.g. agents, who know about Java or web sites about Java. This means that a set of INOs would have to be created that all represent the same information need, but have different information sources.

To solve this issue, we allow every INO to have a list of ISOs. Since every information source needs its own query, the query cannot be specified in an INO anymore. This leads to the concept of ``Information Source Usage Recommendation''^2.1 (ISUR), which consists of the query and the ISO. Instead of referencing an ISO, an INO now references the usage recommendation. Section A.1.2 describes the representation of an ISUR.

Preconditions

Each INO has a precondition that controlles whether or not an INO is shown in the IA. One has to put all conditions in one single expression, which would become very hard to maintain over time.

This issue is addressed by the following steps:

We introduced usage recommendations (ISUR) as suggested in [Hol02] and detailed in section 2.4.1.1.
We introduced a set of preconditions that was included in INOs as well as in ISURs. The preconditions of INOs control the appearance of INOs in the user's IA, while the precondition of ISURs control the validity of ISURs in regard to helping to satisfy an information need.
1. Activity: Constraints concerning the actual activity. Example: EJB development.
2. Role: Constraints concerning the role of the owning agent. Example: Agent is a programmer.
3. Skill: Constraints concerning the skill of the owning agent. Example: Agent is EJB novice.
4. Topic: Constraints concerning the required topics of the information need. Example: Task's key topic is ``EJB''.

Each aspect can specify a condition expressed in a formal language based on Java. Also included is a means to include variable expressions from the MILOS context that can be evaluated with the result replacing the expression. E.g. ``<process:name>'' would be evaluated to the name of the current process. For more details on this parser refer to the appendix A.2.

Single information source type

Furthermore, there is only one type of ISO implemented, which would constrain the possible information sources. The use of the existing ISOs for other types of information sources, e.g. bookmarks, would be possible but counter-intuitive, making usage much more difficult.

Our solution to this issue was to identify the common attributes of all information source objects, such as name and description, and to introduce a new abstract class InfoSourceBase, which contains these common attributes. The old class InfoSource now inherits these attributes from InfoSourceBase. Now we are able to derive as many new types of information sources as we need, e.g. bookmarks (ISBookmark) or agents (ISAgent). The old class hierarchy is shown in figure 2.4, whereas the new class hierarchy is shown in section A.1.

No refinement support for information needs

Here we have three issues:

Könnecker's implementation only allows for one category per INO, but we realized that there is the possibility that an INO could be filed in more than one category. E.g. consider an information need ``What informations about EJB exist?''. This information need can be categorized under ``Distribution techniques'' as well as under ``Database connections''.
We have no way of supporting the user in satisfying a complex information need. E.g. consider again the above information need. Since EJB is a very complex topic, we want a way to ease the task of finding answers to this information need.
MILOS allows refinement of processes by means of subtyping similar to the inheritance facility known to object-oriented programming languages. E.g. process types ``Glass-Box Testing'' or ``Black-Box Testing'' could be defined to be subtypes of ``Testing''. Since INOs are inherited by a subtype, it can happen that some inherited INOs have to be modified to be suitable. An example is depicted in figure 2.12.

We solved the first issue by allowing that INOs can have more than one category. Furthermore every category can be nested into any other, so a hierarchy of categories can be created. E.g. if we have a category ``EJB'' containing all information needs concerning EJB, the list of contained INOs would be rather long and unmanageable. So we can create sub categories like ``DB technology'' or ``Application servers'' to better manage INOs.

The second issue was solved by introducing the concept of sub information needs according to [Hol02]. Sub information needs are a set of information needs that are referenced by another information need. The sub information needs are supposed to help in satisfying the referencing information need by decomposition into easier information needs. Since sub information needs are itself information needs, they too can have sub information needs.

**Figure 2.12:** Specialization example according to [Hol02]

As a solution to the third issue Holz [Hol02] introduces a specialization relation on INOs that make it possible to refine information needs inherited from a process super type, so that it fits into the context of the subtype.

There are three aspects that need to be refined in an inherited INO: The representation, the precondition and the ISURs. For details on how the specialization relations on each aspect are achieved see [Hol02].

An example for specizialization of the representation is detailed in figure 2.12: The information need ``What tools are available for glass-box testing?'' that is attached to process type ``Glass-Box Testing'' is a specialization of information need ``What tools are available for testing?'' which is attached to the process type ``Testing''.

Conceptual problems

No support for generic information resources

The original concept was designed to only allow INOs to be created as top level objects in the IA. Use of the IA showed that an agent sometimes wants to add information items without formulating questions. E.g. if a Java programmer wants to bookmark URLs about programming language information, he would need to create an INO with a question like ``Where can I find language information on the programming language Java ?''.

Therefore, we decided to integrate the concept of information resources to enable a more general approach to handling objects in the user's IA. Information resources now subsume INO and ISUR (cf figure 2.3). This concept allows to use ISURs for modelling direct information offers that are not answers to information needs.

**Figure 2.13:** Information Resources

No personalization support

The original implementation of the IA only has a single list of information needs. Generally this list will be maintained only by knowledge engineers, since the modelling of such globally available information needs requires knowledge engineering skills. Through this we have a high up-front modelling effort that hinders other agents to do their original work (cf section 2.4.2.3).

Therefore we face two problems:

What would happen if the current user had an information need that is not reflected in the global list?
How can we collect and track which information needs actually will occur during enactment of a task?

We solved both problems by introducing a private list in the user's IA. The private list in the agent's IA now enables him to bookmark web pages or documents that he finds useful for completing his task without involving the knowledge department every time. Furthermore, we do not need to assume that there is an existing or functioning knowledge department. Instead we assume that as an effort to make development more efficient, our system is introduced and the appropriate departments or working groups are built up, but not yet working completely.

A user may still have questions about information items, i.e. information needs. To further ease the process of managing information needs we provide support to access discussion forums by means of an ``information request'', which is reflected in the user's IA through an automatically created INO. Examples are shown in figures 2.13. By this means the user is able to get recommendations about possible information sources through internal or external message boards, e.g. the jGuru message boards (http://www.jguru.com).

**Figure 2.14:** IA in a posting scenario: (a) IA allows it user to post a request (b) transparently to a forum (c).

High up-front modelling effort

The original approach requires users to make a high effort in modelling the information needs. The modelling effort expresses itself in the need of correct expressions in predicate logic in the preconditions of the information resources. This is especially a problem for users that are not formally trained in creating models.

As a consequence we wish for alternatives that would lower the effort for using and managing resources. One approach we choose is to also allow a privately managed information need list that does not need much modelling effort, if any. For details see section 2.4.2.2.

The following question arises here: How can a user benefit from the private INOs from other users without formalizing the preconditions of INOs? Formalizing the preconditions for the globally available INOs is very important to prevent an information overload.

Based on this private information need list, we created a third information need list, the peer list, which is built by some generic heuristics known from social filtering techniques^2.2.

Collaborative Filtering tries to tackle the problem of growing quantity of information by letting users vote on items they are interested in or find relevant. The main assumption for this information management technique is that people with similar votings will have similar interests, which enables the Collaborative Filtering system to recommend items to a user based on his previous recommendations.

The next chapter will describe Collaborative Filtering approaches in more details.

Overview of Collaborative Filtering

"The way to get things done is not to mind who gets the credit for doing them."

Benjamin Jowett

Motivation

We want to simplify creation and management of information needs by providing agents with access to relevant information resources other agents found useful.

We will use the information recorded and provided by the Orenge case base to identify similar tasks and show their associated information needs.

Collaborative Filtering

Collaborative Filtering (CF) [RIS⁺94] is an information retrieval technique. CF collects ratings of items by users and calculates recommendations for new items. Therefore systems that use CF are often called recommender systems.

The goal is to either guess the rating a user would give to an item or to give recommendations which items could be relevant for a user.

In the literature CF algorithms are distinguished by their approach:

Memory-based algorithms (cf [BHK98,RIS⁺94]) consider all information in the user data base which consists of a set of user votes on items.
Model-based algorithms (cf [BHK98]) use the user data base to estimate or learn a model to use for prediction or recommendation.

Furthermore, a distinction is made between explicit or implicit voting:

Explicit voting means that the user has to rate an item on a given scale, which is usually discrete. E.g. many online help systems let the user vote whether the selected help article was helpful or not.
Implicit voting means that the recommendation system interprets user actions as votes on certain items. E.g. the Entree restaurant recommender [Bur00a] votes on selected restaurants, depending on whether the user begins browsing the recommendations or browses for further recommendations.

There are a number of issues of recommender systems that must be considered. Since the data space is usually sparsely populated, recommender systems need a lot of votes on items to make useful recommendations. Especially recommendations of new items or for new users tend to be not very useful.

Another issue is that users tend to vote only on items they really like. So most recommender systems let the user make only positive votes.

Memory-based algorithms

Usually a predicted vote is calculated by computing the weighted sum of all related votes on the selected item. The weights needed for computing this sum can reflect distance, similarity or correlation.

Common Formulae

Memory-based algorithms generally calculate the predicted vote on an item as follows.

The mean vote for user x is defined as:

v_{x, j}

(3.1)

with I_x being the set of items that user x already has voted on and v_{x, j} being the votes of user x on item j.

The predicted vote of the active user x on item j is computed as follows:

p_{x, j} =

w(x, y)

v_{y, j} -

(3.2)

where n is the number of users in the CF database, that have non-zero weights, i.e. that have voted on j before. is a normalizing factor, so that the absolute values of the weights add up to unity. w(x, y) is a weight function.

We can now categorize model based algorithms by their weight function.

Correlation

Correlation first appeared in the context of the GroupLens project [RIS⁺94]. Usually the Pearson correlation coefficient will be used as a correlation function. The Pearson correlation is a linear correlation coefficient for two random variables X₁ and X₂ and is defined as:

(X₁, X₂) =

where

[X]is the expectation for the random variable X. The value of (X₁, X₂)is always in the range of [- 1, 1] (cf [Pra01]).

This resolves to the following formula, when used with common formulae:

w(x, y) =

(3.3)

where the summations over j are over the items that both users x and y have voted on.

Similarity

A common technique of information retrieval to measure the similarity of two documents is to interpret each document as a vector of word frequencies and calculate the similarity as cosine of both vectors [SM83]. We can adopt this technique for CF by letting users take the role of documents and letting votes take the role of word frequencies.

Another possible weight is distance as opposed to similarity.

Model-based algorithms

From a probabilistic point of view we can calculate the expected value of votes, i.e. the probability that the current user will vote with a particular value on an item. For this we need a model that has to be learned. In the following we will consider two alternative probabilistic models, i.e. cluster models and Bayesian network models.

Cluster models

The idea of this model is that the voting vectors of users with similar interests tend to cluster in the data space, forming classes of users. Therefore instead of considering each user's preferences, the model groups the preference of each class of users. For details see [BHK98,RBS00].

Bayesian network models

An alternative to cluster models are Bayesian network models that are based on the well-known theorem of Bayes. In this model a node represents an item and the states of the nodes correspond to the possible votes. For details see [BHK98,CH97,CHM97].

Problems with Collaborative Filtering

There are several issue to consider regarding CF.

The biggest issue is the ``missing data'' problem. The sparse distribution of user data makes it hard to generate good recommendations for new items or new users. A new user has not made enough votes so that other users can be considered similar. For new items there are too few votes on them to recommend them to any user. Burke [Bur00b,Bur00a,Bur99] calls this problem ``ramp-up problem''.

One way to address this problem is to introduce default voting. We assume some default vote value for items for which we don't have explicit votes. We can use a union of voted items ( I_x I_y) for matching instead of an intersection of votes ( I_x I_y) for the standard algorithm (cf section 3.2.1).

Furthermore one has to consider that CF is very insensitive to preference changes and is subject to statistical anomalies [BHK98]. This is true for memory-based as well as model-based algorithms.

To solve the problem of statistical anomalies we can use a technique called inverse user frequency which is derived from inverse document frequency, a technique common to information retrieval [SM83]. The idea is that very common words are not useful for information retrieval, so weights for these words should be reduced. In the context of CF this translates to the fact that items that are liked by a lot of people are not suited as indicators for correlation. Burke [Bur00b] refers to this problem as the ``banana problem''. Derived from the observation that in the USA bananas are bought by almost all shoppers, as a result bananas would be almost always get a good vote. As a consequence the weight for item ``banana'' should be reduced.

In their observations Breese et.al. [BHK98] find that in 23 out of 24 test cases, there is a statistical significant performance improvement.

We define f_j = log , where n_j is the number of users with votes on item j and n is the total number of users. As a consequence, if all users vote on item j, f_j becomes zero. In the case of correlation (equation 3.3) f_j is treated as a frequency, so that a higher f_j will become a higher weight. The weight function is then defined as:

w(x, y)	=		(3.4)
where
U	=	f_j f_jv_{x, j}² - f_jv_{x, j} ²
V	=	f_j f_jv_{y, j}² - f_jv_{y, j} ²

Performance

According to [BHK98] Bayesian network and correlation models perform best most of the time. The focus of this empirical analysis was the quality of the recommendations, but it also considered the size of the models and runtime performance. While memory based algorithms tend to use a smaller code base, they model the whole user data base and thus are typically slower than model based algorithms. On the contrary, model based algorithms use a learned model of the user data base, which is smaller than the whole user data base.

In terms of runtime performance, probabilistic models perform much better than memory based models, but one has to consider that this performance advantage is bought at the cost of inaccuracy. This is because the learned model does not reflect the whole knowledge about users' voting behaviours.

Another issue to consider is the impact of learning the probabilistic model, especially if the model changes frequently. To learn a model, the system maintainer has to do off-line learning runs using special test data.

Collaborative task-specific information resource Filtering

``Not all who wander are lost''

J.R.R. Tolkien

Introduction

In this chapter we will describe how we use Collaborative Filtering (CF) to support the agents in creating and maintaining information resources needed to fulfill their tasks. In the following we describe how the phases Capture, Distribution and Evolution of information resources are supported.

Motivation

The problem to be solved is to find a means of calculating the relevance of information resources in regard to the characterization of a task and sorting them by their relevance.

Here is a formal description for the problem:

Given:

I = (i₁, i₂,...), a set of information resources.
U = (u₁, u_2,...), a set of users.
T = (t₁, t₂,...), a set of tasks.
C = (c₁, c₂,...), a set of categories.
v = (i_k, u_l, t_m, v_{k, l, m}, c_n), a voting with v_{k, l,
m}being the value of user u_l's vote on item i_k in task t_m.
V = (v₁, v₂,...), a set of votings.
Current user u is working on task t.

Wanted:

A partial order , such that i₁ i₂ iff user u finds information resource i₂ more relevant than i₁ in the context of t.

Capture

Objective

Our objective is to capture information about information resources, so that we can make recommendations about them. We distinguish two kinds of information: (i) newly created information resources and (ii) votings on information resources. We will do implicit voting wherever possible.

We will do this by applying the following heuristics^4.1:

Any globally defined information resource that is accessed by the user's IA will be labelled as SHOWN by that user.
Any global information resource that is marked as being useless will be labelled as USELESS by that user.
Any new privately created information resource will automatically be labelled as SHOWN by that user.
Any information resource that was recommended and presented in the user's IA will be labelled as SHOWN by that user.
Any other known information resources gets the DEFAULT label, which means that there was actually no voting made.

The captured information will be recorded separately for each task. For details on the stored data see section 4.1.1.

Explanation of the heuristics

In this section, we will motivate the heuristics.

First, we think that if the user makes the effort to actually access a global information resource, then it is worth to be counted as being voted on (item 1). If the user however feels that the information resource was not useful for his task after all, he can mark it USELESS (item 2). Because of the effort of privately creating information resources we automatically label such information resources as SHOWN (item 3).

Any recommendation that is accessed will also be counted as voted on (item 4).

Because recommendations are usually distributed very sparsely (cf section 3.2.3), we expect a lot of undefined values, so we introduce a default voting (item 5). The meaning of this default voting is that the user has not voted on the item.

The motivation for recording captured information by task is two-fold: First, we want to do task-specific recommendations, so that it does not seem to be appropriate to compare all captured information globally. Secondly, we argue that although an information resource may be useful for one task, it may be less useful for another task. E.g. consider as an information resource the installation manual of a development environment, which is attached to all tasks that use this development environment as a tool. If the task at hand is concerned with the setup of the development environment, the information resource is probably useful. If the task is only concerned with programming using this tool, that information may not be very useful. Therefore the possibility of recording distinct votings on information resources in different tasks is required.

For the votes we set the values as follows:

SHOWN: This value will be set to 10. Despite the fact that we currently have only one positive vote, we will be able to introduce some positive votes that are in the range of 1...10 in the future. Expanding this range would invalidate the already captured information.
USELESS: This value will be set to -10. Again, we now have only one negative vote. Here it is also important to have some flexibility.
DEFAULT: This value will be set to 0, since this value is neutral concerning the summation in the presented CF formulae (cf section 3.2.1).

A consequence of these values is that a division by zero is possible, so that this fact has to be considered.

The mean vote (formula 3.1) can contain a division by zero, if user x has not voted. In such a case it is safe to set the mean vote to DEFAULT, since the default vote has the meaning of not having voted.

The correlation weight (formula 3.3) can become undefined in many situations. The weight has then to be set to 0, since this value means there is no linear correlation between both users at all (cf [Lan03]).

Finally, the predicted vote (formula 3.2) can have a division by zero through the factor kappa , if no votes are made at all. kappa can be set to 0, since the whole term will become 0 anyway.

Distribution

To be useful, the captured information must be used to present relevant information resources in regard to the current task to the user. This section describes how this distribution is done.

The recommender should do its suggestions proactively, i.e. without requiring the user to initiate the recommendation. So the recommendations will be displayed automatically in the middle section of the user's IA.

Furthermore the user should be able to focus on a recommendation by selecting it. Then the user's IA should display information resources which were accessed by users that also accessed this particular information resource. Figure 4.1 depicts how this will look. In this example, one can see privately created and peer recommended information resources. Since the user created the information needs ``EJB specification'' and ``EJB Tutorial'', the IA recommends the information need ``EJB demo applet?''. By selecting this information need, the bookmarks ``VisualAge EJB FAQ'' and ``JBoss Webserver Tutorial'' are selected.

As a Collaborative Filtering algorithm we use the standard memory-based approach with the Pearson correlation as a weight function (cf formula 3.3).

**Figure 4.1:**Information Assistant with focused recommendation

Task-specific recommendations

Our aim is to make task-specific recommendations, i.e. to make recommendations based on information about the user's current task that is available in the form of the task's characterization (cf section 2.3.1).

In order to make these task-specific recommendations, we assume to have the following information available:

a CF data base with the characterizations of processes and information resources.
a similarity measure on task's characterizations.

First, we have captured recommendation data and attached it to the task's characterization (cf section 4.2).

Second, we can define similarity measures with Orenge tools which define how items of a particular characterization type can be compared. With these measures we can find out whether some information resources have some characterization attributes similar to those of the current task.

We extend the Pearson correlation (formula 3.3) by including a similarity coefficient as follows:

w(x, y) =

(4.1)

where sim(t_x, y, j) is the similarity of task t_x and the most similar task of user y, in that user y has voted on information resource j. With this terms, we emphasize information resources belonging to similar tasks. By this means users will get better recommendations in regard to task relevance.

All this information we gathered will be used in the Collaborative Filtering algorithm in which information resources take the role of items.

The implementation is based on the algorithm 1.

Pseudo code of recommender system ^4.2 ^4.3

Furthermore, the agent can make votings on the recommended information resources themselves to further emphasize the relevance of a particular information resource.

Example scenario

The following example is situated in the GUI programming group of a software company. The purpose of this group is to build the GUIs for the company's software.

**Figure 4.2:** Mock-up GUI of the address book

Agent Bob has the task to implement a dialog with the Swing toolkit, with which he is not familiar. This dialog should show the entries of an address book, stored in an Oracle data base. To achieve this goal, he has to use the JTable class, which implements a generic table.

Agent Anna has some experience with Swing and with JTable in particular. She has implemented an application with a table that allowed to view database entries from a DB2 data base.

A mock-up GUI of the address book dialog to be implemented by Bob can be seen in figure 4.2.

Tasks

The following section describes the tasks of agents Anna and Bob. These task descriptions formalize the characteristics of the tasks. In the following, we adopt the task representation introduced by [Hol02] and set the descriptions itself in typewriter font, e.g. t.name, where ``name'' denotes an attribute of object ``t''. Around set values we put braces, e.g. {A, B}. We call Anna's task t and Bob's task s. Both tasks have characterizations of type ``Implementation Process''.

Agent Anna
- t.name = ``Implement DB2 address book''
- t.programLanguage = Java1.2
- t.keyTopics = {Swing, DB2, JTable}
- t.tools = {VAJ3.5}
Agent Bob
- s.name = ``Implement Oracle address book''
- s.programLanguage = Java1.2
- s.keyTopics = {Swing, Oracle, JTable}
- s.tools = {VAJ3.5}

As we can see the tasks of Anna and Bob have one difference, i.e. whereas Anna had to program a data base viewer for DB2, Bob has to implement a viewer for Oracle.

Used information resources

The following information resources were used in this example. Table 4.1 depicts the bookmarks in the first column, the owner^4.4 of the information resources in the second column, and the category of the information resource in the third column. The fourth column depicts the object, which the information resource is attached to.

**Table 4.1:** Information resources which are used in the scenario
Information resources	Owner	Category	Attached to object
JDK 1.2 API Documentation http://java.sun.com/products/jdk1.2/docs/api	global	Java	Implementation process
Sun Java Swing tutorial http://java.sun.com/docs/books/tutorial/uiswing	global	Java	Implementation process
jGuru Swing FAQ http://www.jguru.com/faq/Swing	global	Java	Implementation process
VisualAge Database Access tutorial for DB2	global	Databases	Implementation process
Sun Swing examples http://developer.java.sun.com/developer/codesamples/swing.html	Anna	Java	Implement DB2 address book
jGuru Swing forum http://www.jguru.com/forums/Swing	Anna	Java	Implement DB2 address book
Google search on "JTable" http://www.google.de/search?q=JTable	Anna	Java	Implement DB2 address book
DB2 manual	Anna	Databases	Implement DB2 address book
Google search on "JTable" and "sample" http://www.google.de/search?q=JTable	Bob	Java	Implement Oracle address book
Oracle manual	Bob	Databases	Implement Oracle address book

We assume the following characterization for the following items:

Item1: VisualAge Database Access tutorial for DB2
- s.programLanguage = {Java1.2}
- s.keyTopics = {Swing, DB2, JTable}
- s.tools = {VAJ3.5, DB2}
Item2: DB2 manual
- s.programLanguage = {Any}
- s.keyTopics = {DB2}
- s.tools = {DB2}
Item3: Oracle manual
- s.programLanguage = {Any}
- s.keyTopics = {Oracle}
- s.tools = {Oracle}

The information need life cycle

In this section we want to make clear how we picture the use of the recommendation system. The aim of this section is to make clear how the recommendation system should ideally work.

Prerequisites

Agent Anna had to program a database viewer, therefore she had to use the Swing Tables API. The task in the project plan is called ``Implement DB2 address dialog'' and has a characterization of type ``Implementation Process'' (cf figure 2.10). From the To-Do dialog she opens the Information Assistant for this task.

The IA would be resemble the one displayed in figure 4.3. On the right, there is the section of globally available information needs. On the left, there is the section of private information needs, i.e. information needs created by Anna and manageable only by Anna.

The normal situation would be that at least some common characterizations and their information resources were created by the knowledge department. There are already several categorized global information needs available such as ``Where can I find documents on Java?''. Also shown is the ISUR ``Java Homepage'', which is a book mark to http://java.sun.com. In the middle section the peer information will be displayed. This information is recommended by the system we will describe here.

**Figure 4.3:**IA after use by Anna

Furthermore, we can see in figure 4.3 that Anna has created some additional information resources for her own use as detailed in section 4.4.2. From the global list of information resources Anna executed the jGuru Swing FAQ and VAJ DB access tutorial information resources.

Recommended information resources

Now Bob is assigned his task. When he opens his IA, he should have some recommendations, as depicted in figure 4.4.

**Figure 4.4:** Bob's IA with peer recommendations.

Bob should have recommendations for the following information resources:

The jGuru FAQ on Swing, because Anna has accessed this global information resource.
The VisualAge Database Access tutorial for DB2 (item1), since Anna has accessed this global information resource and the characterization for Bob's task s and this information resource are similar.
The privately created information resources of Anna except for the DB2 manual (item2).

All these information resources have characterization that the users can attach. Item2 is not shown on the peer list since the characterization does not fit concerning the attribute keyTopic. The attribute keyTopic of item2 contains the value ``DB2'', whereas the attribute keyTopic of task s does not contain ``DB2''.

Numerical example

This section's aim is to show how an actual calculation of a recommendation value is done. For simplicity's sake, we assume here that the similarity of tasks s and t are 1.0. This is reasonable, because (i) we can set similarity of tasks to 1.0 if no similarity measure is present, and (ii) in this example, the information resources are all in the same categories and the characterization type of both is the same.

Table 4.2 depicts the matrix needed to calculate the recommendation value.

**Table 4.2:**Calculation table
Information Resources \ Agents	Anna	Bob
JDK 1.2 API documentation	SHOWN	SHOWN
Sun Java Swing tutorial	SHOWN	SHOWN
jGuru Swing FAQ	SHOWN	SHOWN
VAJ DB Access tutorial for DB2	SHOWN	DEFAULT
Sun Swing examples	SHOWN	DEFAULT
jGuru Swing forum	SHOWN	DEFAULT
Google search on „JTable„	SHOWN	DEFAULT
DB2 manual	SHOWN	DEFAULT
Google search on „JTable sample„	DEFAULT	SHOWN
Oracle manual	DEFAULT	SHOWN

Now the mean vote of Anna is v_Anna = 8 and that of Bob is v_Bob = 5. The correlation between Anna and Bob computes to w(Anna, Bob) = - 100. The predicted vote for the information resource ``VAJ DB Access tutorial for DB2'' then computes to p_{Bob, VAJ DB} = 4. This decomposes the following way: The summation term of formula 3.2 computes to -200 and because the absolute sum of all votes is 200, kappa = 1/200.

Evolution

Initialization

Since we had to deal with the aforementioned ramp-up problem (cf section 3.2.3) of Collaborative Filtering and the possibly missing knowledge department, we thought of the following solution.

Because of the missing knowledge department, we assume that there are no or just a few globally defined information resources. We further assume that there are not enough votes to make good recommendations and that users do not correlate well.

For bootstraping purposes we will allow users to access all known private information resources of all others users. This will enable the user to make votings by looking at the information resources of others.

To make access easier the list of information resources will be sorted by the following three criteria in descending relevance:

by the characterization object the information resource is attached to, e.g. the characterization type ``Implementation process'',
by the categories the information source is filed into, and
by the task similarity to the current task.

This sorting will provide the agent with a means of accessing relevant information resources in regard to his current task first, but still have access to all known privately created information resources.

We think that this approach is feasible since we cannot assume the presence of a knowledge department that could initialize the voting data. Furthermore we have no training data, so that we cannot train a model-based CF system.

We expect that through this approach the ramp-up problem is not solved, but at least lessened. This approach will especially help the agents with their information searches if new kinds of projects are started. Consider a company that did RMI^4.5 programming project in the past, and now has an assignment for a CORBA^4.6 programming project. The agent will certainly not find any direct information on CORBA in the information data of the company, but they can at least get some ideas about distributed computing problems and where to find further information.

Continuous Feedback

The aspect of evolution in our concept is expressed by three points:

by voting as discussed above,
by allowing to attach privately created information resources to objects of the characterization type hierarchy (cf section 2.3.1),
by making modeling request to the knowledge department, if one exists.

First, voting makes it possible to get feedback about the agents' behaviour in regard to information retrieval, which is further emphasized by voting on the recommended information.

Second, we allow the agents to access the characterization type hierarchy and all instances of the respective types and attach privately created information resources to them. Furthermore the agent can attach characterizations to the information resource, e.g. by specifying the key topics of the information resource. This will allow finer control over the task-specific relevance. Consider a newly employed programmer, for example, who added as a information resource a document about the coding guidelines of his company. This is surely important for the task at hand, but is also important to other implementation processes, so it would be helpful to attach this information resource to the characterization type ``Implementation process'' (cf figure 2.10).

Third, a modeling request will be posted in a special discussion forum, similar to the information request (cf section 2.4.2.2). This will allow communication of the agents with the knowledge department. The topic of these discussions is the modelling of privately created information resources and integration into the global list of information resources. By this means the knowledge department is supported in creating appropriate information resources by the agents, who are using these resources in everyday work.

We expect that the system will gradually make better recommendations for established users and tasks. For new kind of tasks, e.g. Java programs instead of C programs, the recommendations will neither be plentiful nor particularily well suited. But again, with continuous use of our concept, the results will get better.

We further expect that, since we include the categorization of the information resources in the captured recommendation data, an ``oral tradition'' of categorizing information resources will evolve.

Conclusion

"Only those who dare to fail greatly can ever achieve greatly."

Robert Francis Kennedy

In this chapter, we summarize our approach, relate it with other works and outline future work.

Discussion

In the introduction (cf chapter 1) we described three issues we wanted to remedy:

High up-front modelling effort of globally created information needs,
structural and conceptual issues of the original information need implementation,
information resources of agents are not captured and distributed to other agents.

The first issue has been addressed in this thesis by introducing privately managed lists of information resources (cf section 2.4.2.2), which proved to be a good tool for dealing with information resources. Our observations showed that the ability to privately create information resources eases the task of software development by circumventing the high modelling effort.

The second issue has been addressed by refactoring the former concept of information needs by Könnecker to remedy the structural and conceptual problems, as laid out in chapter 2. This allowed us to solve the third issue.

The third issue has been addressed by creating a peer list of information resources by using task-specific information and Collaborative Filtering (cf chapter 4). We selected a memory-based correlation approach for the CF algorithm, which we extend by a task-specific term (cf formula 4.1). We select the set of information resources, for which a recommendation should be computed, by using task-specific information from the MILOS system (cf algorithm 1).

Related work

GroupLens

Resnick et al [RIS⁺94] extend Usenet news by means of special news servers, called Better Bit Bureaus (BBB), that compute scores predicting how much the user will like articles, and collect the ratings after the user read an article.

In the GroupLens system, a user rates an article by assigning to it a number from 1 to 5, with 5 being the highest rating. Optionally, the system can also keep track of the time the user needed to read the article. In order to help to identify ratings from the same person, but still preserving anonymity, the user chooses a pseudonym, which doesn't have to be same name as the one used to post articles. To make a prediction, a BBB uses a memory-based algorithm with a Pearson correlation as described in section 3.2.

The main assumption is that people who made similar votes will vote similarly in the future. Another important assumption is that people may agree in some topics, but not others, e.g. they may agree on the evaluation of technical articles, but not on jokes. Therefore ratings are separated for each newsgroup.

Although being one of the first groups to use Collaborative Filtering, Resnick et al already pointed out that during an initial phase recommender systems have problems making a recommendation. They also indicate that further research on the issue of performance had to be made.

Our recommendation system follows GroupLens with the assumption that it is necessary to separate the voting information, but we also use the information about the user's current task. Furthermore, we are able to use arbitrary information sources, not only newsgroups.

CALVIN

The CALVIN (Context Accumulation for Learning Varied Informative Notes) project investigates lessons learned systems that support the process of finding information relevant to a particular research task [LBMW]. The CALVIN system has three primary tasks:

learn about information sources by automatically recording cases that represent the consulted information sources.
learn about the domain by providing tools for incremental aggregation, refinement and capture of records of lessons learned from information sources.
proactively present useful information resources to the user as he browses for information.

CALVIN describes a user's research context by three classes of indices: (i) the user-specified task, (ii) a set of topics on which the user searches information, and (iii) a description of the content of information resources consulted by the user (i.e. a set of automatically generated keywords). As the user browses, the system maintains a global context of the user-viewed resources (i.e. user selected task and topics) and compares it with former contexts. If the similarity between two contexts exceeds a certain threshold, both are considered similar and the resources of the former context are presented to the user.

Three lessons can be learned by using CALVIN. First, it stores cases containing information about resources that previous users found useful in a similar context. Second, it captures lessons about annotations that the user attaches to information resources. Third, CALVIN supports the use of concept mapping tools for recording, elucidation and sharing of lessons on important concepts and relationships in the domain.

The approach taken by the CALVIN system does not take into account that users may have different skills and preferences, which will cause CALVIN to recommend irrelevant information resources to the user. Our approach addresses this issue by combining Collaborative Filtering with similarity-based retrieval.

CASPER

CASPER is a research project that investigates how artificial intelligence can be used to improve information access [RBS00]. In addition to the information items accessed, CASPAR uses three types of implicit relevancy information in its user profiles for its recommendation algorithm:

Revisit data: The amount of times that a user accesses an information item seems to correlate with a user's interest well.
Read time data: The time a user spends reading an information item seems an even better measure for user interest.
Activity data: It seems that the best way to judge a user's interest is to assess information about actions performed by the user. E.g. if the user emails himself about the information item this can be seen as indication that a real interest in this item exists.

CASPER uses a cluster-based approach to Collaborative Filtering that can identify both direct and indirect user relationships. User profiles become clustered into virtual communities. Users in such a community correlate quite well with each other, but not with users in other communities. A community is built by identifying the maximal set of users such that every user has a similarity value greater than a threshold with at least another community member.

Rather than applying an memory-based algorithm to the k-nearest neighbours, CASPER recommends the most frequently occured information items in the target user's virtual community. This leads to a richer recommendation base by identifying a larger set of users similar to the target user. A disadvantage of this approach is that CASPER is not able to evaluate direct links between all pairs of users.

Our approach neither uses any of these data types nor applies a cluster-based algorithm. Future work has to assess the value and applicability of this approach to MILOS.

PHOAKS

Most Collaborative Filtering approaches assume that all users are equal concerning their role. PHOAKS (People Helping One Another Know Stuff) [THA+97] assumes that users tend to play distinct producer or consumer roles, which are specialized and different. The consequence of this is that a few participants in the discussions make the effort to recommend a resource to others while the rest consumes this information. As a consequence, the frequency of recommendation of a resource is considered as a means of quality measure.

PHOAKS also reuses recommendations from online conversations. This is done by scanning the messages for URLs according to four rules:

If the message is cross-posted to too many newsgroups the URL does not count as a recommendation, because the message does not seem to be very specific to the newsgroup.
If the URL is part of the poster's signature it also does not count as a recommendation.
If the URL appears in a quoted section of the message it also does not count as a recommendation.
If the textual context of the URL suggests that this was a recommendation and not an advertisement it will be counted as a recommendation.

The extraction of recommendations allows for implicit voting, i.e. without requiring any user of PHOAKS to vote for a recommendation.

Our approach also assumes that users take distinct roles, but it does not make assumptions about the consumer or producer roles. Our concept is that each user automatically takes the producer role, while automatically taking the consumer role getting recommendations. Consequently, the distinction between consumer and producer does not seem to fit well.

Entree

Entree [RKB97, BHY96] is a restaurant recommender system that uses a hybrid CBR/CF approach.

Burke argues in [Bur00a, Bur99, Bur00b] that by combining Case-based Reasoning and Collaborative Filtering it is possible to overcome the weaknesses of both techniques. Table 5.1 depicts the trade-offs of both techniques and of an ideal hybrid.

**Table 5.1:** Trade-offs between CBR and CF techniques (from [Bur00a])
Technique	Pluses	Minuses
Knowledge-Based	A. No ramp-up required	H. Knowledge engineering
	B. Detailed qualitative preference feedback	I. Suggestions ability is static
	C. Sensitive to preference changes
Collaborative Filtering	D. Can identify niches precisely	J. Quality depended on large historical data set
	E. Domain knowledge not needed	K. Subject to statistical anomalies on data
	F. Quality improves over time	L. Insensitive to preference changes
	G. Personalized recommendations
Ideal hybrid	A, B, C, D, F, G	H

The retrieval strategy uses a cascading approach. First a CBR retrieval is executed and the result is sorted into buckets by similarity. Since the system only presents a fixed number of recommendations, e.g. ten, it can happen that a bucket has more items than is needed for the recommendation, e.g. the topmost bucket contains 20 items. Burke considers this result ``under-discriminated''. Eliminating under-discriminated results would require further knowledge engineering and perhaps a more detailed representation of the items.

Burke now argues that Collaborative Filtering can help to find the best recommendations of the under-discriminated result. This is done by ranking each item of the under-discriminated result. Here the ramp-up problem of Collaborative Filtering is not very risky, because if the worst thing, random ratings, happens, the system can still select equally good items.

Our approach shares with ENTREE the concept of refining the results from similarity-based retrievals with Collaborative Filtering. Both differ in the conceptual distinction between the set of objects on which similarity is computed (tasks) and on which Collaborative Filtering is applied (information resources associated with tasks).

Google

Google is not per se a Collaborative Filtering system, but uses some techniques common to such systems. Google was created as a research web search prototype at the Stanford University by Sergey Brin and Lawrence Page [BP98].

Common web search engines at that time mostly rated web pages on the frequency of word occurrence. Because of that technique web designers soon realized that by putting certain words into non-visible parts of the document, they could push their document to the top of the search result lists.

Brin and Page added implicit voting to their rating algorithm called PageRank. The more links (citations) to a web page can be found in the known web space, the higher the page gets rated. Also the higher the page rankings of the citing pages are, the higher the page ranking of the cited page gets. In terms of Collaborative Filtering this means that the citing page votes for the linked page.

A problem not mentioned in [BP98] is that web sites with very specialized topics, which are cited only by a few other web sites, do not get a high page ranking.

Outlook

An important future task is to built up a database of training cases, so that the implementation and its future iterations can be tested in real-world scenarios and to make the switch to model-based Collaborative Filtering algorithms possible. It is important to further investigate the benefits of our Collaborative Filtering approach, how to further improve it and whether other CF approaches could be beneficial.

Another possibility to make recommendations would be to sort the voting information by categories instead by tasks, if the user requests that. Although this concept has to be further investigated, we feel that it could be useful to have recommendation on certain categories. This would give the recommendation another focus possibly uncovering more relevant task-specific information.

Another issue is the investigation of a suitable similarity measure for tasks. We expect that there is no single way to define such a similarity measure. A possible solution would be to use a particular similarity measure by default. Other similarity measures can be presented to user, who can select the measures he thinks of as appropriate in the context of his current task.

Implementation details

Data structures

Information resource class depencies after refactoring

We described in chapter 2, why and how we refactored the class structure in MILOS. The UML class diagramm in figure A.1 depicts a snapshot of the class dependencies concerning information resources. We showed only the aspects, that are important for this thesis.

**Figure A.1:** UML class diagram of information resources

Information Source Usage Recommendations

An ISUR contains the following aspects:

Information source: This references the information source the ISUR encapsulates.
Process type: The type of process representing the class of activities the information source might be useful for.
Activity constraints: Constraints concerning the actual activity.
Role constraints: Constraints concerning the role of the actual actor.
Skill constraints: Constraints concerning the skill of the actual actor.
Topic constraints: Constraints concerning the topic of the actual process.
Usage direction: A natural language description of where to find the information

An example of an ISUR object is depicted in table A.1.

**Table A.1:**Example of an ISUR object
Attribute	Example value
Information source	JDK 1.2 language specification
Process type	Implementation process
Activity constraints	Programming language is Java 2
Role constraints	Useful for role "programmer"
Skill contraints	Programmer is not a Java expert
Topic constraints	Useful for using exceptions
Usage direction	See chapter 3

Collaborative Filtering

The class CapturingContainer is used to hold the information we need for using Collaborative Filtering and contains the following aspects:

The information resource that was actually accessed.
The information resource that actually that should be recommended.
The task in which the information resource was accessed.
The category in which the information resource was accessed.
The actual vote value.
The category the information resources were in when accessed.

All aspects except the both first directly to the aspects of the formal description of the voting (cf section 4.1.1). The both information resources correspond to the information resource aspect.

This distinction was made because it is possible that the accessed information resource is part of an information need, which is the item we actually want to recommend. If the information resource is not part of an information resource, i.e. if the information resource is a bookmark or ISUR, which is created at top level, then both aspects are set to this information resources.

The distinction is important, since an information need can contain more than one information source (an ISUR), that can be accessed. Each access of such information source is a different case, although the item which has to be recommended is actually the information need.

Query expression parser

The query expression parser was integrated into MILOS by Mayer, who created a Java expression parser with SableCC, a pure Java Compiler compiler (http://www.sablecc.org). It is used to resolve queries, representation and preconditions in the information resource concept.

All expressions will be evaluated in an evaluation context, that defines, how variables are resolved. The formal language that is accepted by the query expression parser, is built up from three parts:

Namespaces: Namespaces enable the parser to distinguish between grammars and data structures of different tools. Presently, the namespace ``Orenge'' is defined. E.g. an expression in this namespace looks like ``<Orenge: expression>''.
Java expressions: The parser accepts standard Java expressions like ``1 == 0 || 2 * 2 == 4''.
Variable substitution: To be able to use variables, it is allowed to include variables in a special syntax. A variable is always surrounded in parenthesis, e.g. ``<var>''. This variable will always resolved before the whole expression is evaluated. E.g. consider that var has a value of ``Information Need''. Then the expression ``<var> == ``Information Resource'''' will become ``''Information Need'' == ``Information Resource'''', which will evaluate to false. Every variable is an object, whose attributes can be referenced with ``.''. E.g. if var has an attribute ``name'', then <variable.name> resolves to the name of this object.
The parser supports five pseudo-variables as follows:
1. agent: This variable resolves to the current agent as defined in the evaluation context.
2. task: This variable resolves to the current task as defined in the evaluation context.
3. this: This variable resolves to the current evaluation object as defined in the evaluation context.
4. infosource: This variable resolves to the current information source as defined in the evaluation context.
5. each: This variable resolves to the current element of a collection. This is only valid in the context of the control structures (see below).
Control structures: To be able to work with collection, the parser also allows some control structures, that are surrounded by parenthesis like variables. The pseudo-variable each can be used in each term expr as follows.
1. collection.detect[expr]: This expression returns the first collection object, that fufills the condition expr, otherwise an exception is thrown.
2. collection.andconcat[expr]: This expression returns a string built up all collection elements, surrounded by a prefix and a suffix and separated by a infix. Prefix, infix and suffix are defined by the andFix aspect (cf figure A.1) of the information resources.
3. collection.orconcat[expr]: This expression returns a string built up of all collection elements, surrounded by a prefix and a suffix and separated by a infix. Prefix, infix and suffix are defined by the orFix (cf figure A.1) aspect of the iformation resources.
4. collection.includes[expr]: This expression returns true, if the collection contains an object that fulfills the condition expr, otherwise it returns false.

As a more elaborated example consider that one wants to know if the current agent has at least one privately created information resource for the current task, he would use the following expression:

<Orenge:task.inforesources.detect[each.owningAgent == agent]>

The expression task.inforesources resolves to the list of inforesources attached to the current task, to which we can apply detect. The condition each.owningAgent == agent tests if the owning agent of the current list element is equal to the current agent.

Glossary

Agent: An agent is a team-member of a software development group.
Case-Based Reasoning (CBR): Case-Based Reasoning solves problems by trying to identify similar problems which were already solved (cases) and then adapt it to the given problem.
Collaborative Filtering (CF): Collaborative Filtering is a information retrieval technique. It gives a recommendation about an item based on previous votings, that might be of interest for the user. The main assumption is that users who voted similarly in the past, will do so in the future.
Information Need: An information need expresses the necessity of an agent to find appropriate information for his current task.
Information Need Object (INO): This is the actual representation object of an information need in MILOS.
Information Source: An information source represents a source for information which helps satisfying an information need.
Information Source Object (ISO): This is the actual representation object of an information source in MILOS.
Information Source Usage Recommendation (ISUR): An ISUR contains an ISO and directions for usage of that ISO.
Knowledge Department: The knowledge department consists of knowledge engineers, who deal with all knowledge tasks, e.g. with modelling preconditions.
MILOS (Minimally Invasive Long-term Organisational Support): MILOS is a PSEE, which is developed as research project at the universities of Kaiserslautern, Germany and Calgary, Canada.
Process: A process is a set of activities (tasks) with the aim of reaching a goal. A process is represented by a process model.
Process-centered Software Engineering Environment (PSEE): A PSEE supports the planning and managing of software projects in an automatic and systematic way through process models.
Process Model: A process model describes in a generic reusable way how a process should be executed.
Project Plan: A project plan is an instantiation of a process model, which satisfies the special needs of the modeled project.
Task: A task is an activity, which is executed in the context of a project plan.

Bibliography

AP94: A. Aamodt and E. Plaza, Case-based reasoning: Foundational issues, methodological variations, and system approaches, AI Communications 7 (1994), no. 1.
BHK98: John S. Breese, David Heckerman, and Carl Kadie, Empirical analysis of predictive algorithms for collaborative filtering, Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence, July 1998, pp. 43-52.
BHY96: Robin Burke, Kristian Hammond, and Benjamin Young, Knowledge-based navigation of complex information spaces, AAAI/IAAI, Vol. 1, 1996, pp. 462-468.
BP98: Sergey Brin and Lawrence Page, The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems 30 (1998), no. 1-7, 107-117.
Bur99: Robin Burke, Integrating knowledge-based and collaborative-filtering recommender systems, 1999.
Bur00a: Robin Burke, Knowledge-based recommender systems, 2000.
Bur00b: Robin Burke, A case-based reasoning approach to collaborative filtering, EWCBR, 2000, pp. 370-379.
CH97: David Maxwell Chickering and David Heckerman, Efficient approximations for the marginal likelihood of bayesian networks with hidden variables, Tech. Report MSR-TR-96-08, Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, April 1997.
CHM97: David Maxwell Chickering, David Heckerman, and Christopher Meek, A bayesian approach to learning bayesian networks with local structure, Tech. Report MSR-TR-97-07, Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, August 1997.
GJ96: P.K. Garg and M. Jazayeri, Process-centered software engineering environments, IEEE Computer Science Press, 1996.
Hol02: Harald Holz, Process-based knowledge management support for software engineering, Ph.D. thesis, University of Kaiserslautern., 2002.
Kön00: Arne Könnecker, Extending a process-centered see by context-specific knowledge delivery, Diploma thesis, University of Kaiserslautern, 2000.
Lan03: David M. Lane, Pearson's correlation, http://davidmlane.com/hyperstat/A34739.html, 2003.
LBMW: David Leake, Travis Bauer, Ana Maguitman, and David Wilson, Capture, storage and reuse of lessons about information resources: Supporting task-based information search.
May01: Guido Mayer, Erweiterung einer prozeßorientierten software-entwicklungsumgebung um ähnlichkeitsbasiertes knowledge-management, Diploma thesis, University of Kaiserslautern, 2001.
Pra01: Andrea Prampolini, Modelling default correlation with multivariate intensity processes, December 2001.
RA99: Michael M. Richter and Klaus-Dieter Althoff, Similarity and utility in non-numerical domains, pp. 403-413, Physika-Verlag, 1999.
RBS00: Rachael Rafter, Keith Bradley, and Barry Smyth, Automated collaborative filtering applications for online recruitment services, Lecture Notes in Computer Science 1892 (2000), 363-??
Ric: Michael M. Richter, The knowledge contained in similarity measures, Invited talk ICCBR 1995.
RIS+94: P. Resnick, N. Iacovou, M. Suchak, P. Bergstorm, and J. Riedl, GroupLens: An Open Architecture for Collaborative Filtering of Netnews, Proceedings of ACM 1994 Conference on Computer Supported Cooperative Work (Chapel Hill, North Carolina), ACM, 1994, pp. 175-186.
RKB97: Robin Burke, Kristian Hammond, and Benjamin Young, The findme approach to assisted browsing, IEEE Expert 12 (1997), no. 4, 32-40.
SM83: G. Salton and M. McGill, Introduction to modern information retrieval, McGray-Hill, 1983.
THA+97: Loren Terveen, Will Hill, Brian Amento, David McDonald, and Josh Creter, PHOAKS: A system for sharing recommendations, Communications of the ACM 40 (1997), no. 3, 59-62.

List of Figures

Process model editor in MILOS (from [Kön00])
Project plan editor in MILOS (from [Kön00])
To-Do-GUI of MILOS (from [Kön00])
UML class diagramm of Könnecker's INO implementation
The Information Assistant (from [Kön00])
The Information Need Manager (from [Kön00])
The Information Source Manager (from [Kön00])
An example IA workflow (from [Kön00]): (a) To-Do-List, (b) Information Assistant, (c) query result
CBR cycle according to Aamodt and Plaza [AP94]
Characterization type hierarchy
Characterization editor
Specialization example according to [Hol02]
Information Resources
IA in a posting scenario: (a) IA allows it user to post a request (b) transparently to a forum (c).
Information Assistant with focused recommendation
Mock-up GUI of the address book
IA after use by Anna
Bob's IA with peer recommendations.
UML class diagram of information resources

List of Tables

Information need object representation (from [Kön00])
Information source object representation (from [Kön00])
Information resources which are used in the scenario
Calculation matrix
Trade-offs between CBR and CF techniques (from [Bur00a])
Example of an ISUR object

List of Algorithms

Pseudo code of recommender system

Footnotes

... members ^1.1: Such team members will be called agents in the following.
...engineers ^1.2: This group will be called knowledge department in the following.
... Information Source Usage Recommendation ^2.1: We will also call an Information Source Usage Recommendation shortly ``Usage Recommendation''.
... techniques ^2.2: Social filtering is often called Collaborative Filtering. Since this term is becoming more known and is more precise concerning our aim, we prefer to use this term in the following.
... heuristics ^4.1: In the following terms in upper case denote constants that are used without considering their actual value.
...Items_c( t_k) ^4.2: Items_{c(t_k)} is defined as set of item, that are associated to task t_k and in category c.
...p_{x, j} ^4.3: p_x,j is defined as the predicted vote for user x on item j (cf. formula 3.2). The set of items for calculation of correlation is restricted to I_c.
... owner ^4.4: The ``Global'' owner denotes globally created information resources.
... RMI ^4.5: Remote Method Invocation
...CORBA ^4.6: Common Object Request Broker Architecture

About this document ...

Capture, Distribution and Evolution of Information Needs in a Process-Oriented Knowledge Management Environment

This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.48)