4 The analysis of the descriptions

Morpholexical, syntactic and semantic analysis of software descriptions is performed to map a description to a frame- like internal representation.

The purpose of morpholexical analysis is to process the individual words in a sentence to recognize their standard forms, their grammatical categories and their semantic relationships with other words in a lexicon. Morpholexical analysis also performs the processing of collocations and idioms.

Two semantic relations between terms are currently considered: synonymy and hyponymy/ hypernymy. The predicate synonym(x,y) means that the term `y' is a synonym of the term `x' in a particular lexical category. The predicate hyponym(x,y,d) means that the term `y' is an hyponym (a specialization) of the term `x' at a d-distance in a thesaurus in a particular lexical category. The predicate hypernym(x,y,d) means that the term `y' is an hypernym (a generalization) of the term `x' at a d-distance in a thesaurus in a particular lexical category.

Just after morpholexical analysis, both syntactic and semantic analysis of software descriptions are performed interactively by using a definite clause grammar. The defined grammar implements a subset of the grammar rules for imperative sentences in English [12] and is considered broad enough for our initial experimental purposes. The grammar supports the case system and states domain- independent knowledge of the English language through a set of syntactic and semantic rules. The classification mechanism uses the grammar to parse software descriptions.

A set of semantic structures is generated as a result of the parsing process, representing the internal structures of software descriptions. A language for modelling these semantic structures is shown in Figure 2.

Case_frame       --> FRAME Frame_name Hierarchical_link CASES Case_list. 
Hierarchical_link--> IS_A Frame_name | IS_A_KIND_OF Frame_name
Case_list        --> Case (Case_list)
Case             --> Case_name Facet
Case_name        --> Semantic_case | Other_case
Semantic_case    --> Action | Agent | Comparison | Condition | 
                     Destination| Duration | Goal | Instrument | 
                     Location | Manner| Purpose| Source | Time
Other_case       --> Modifier | Head | Adjective_modifier | 	
                     Participle_modifier | Noun_modifier
Facet            --> VALUE Value | DOMAIN Frame_name | 
                     CATEGORY Lexical_category
Value            --> string | Frame_name
Lexical_category --> verb | adj | noun | adv |component_id | string
Figure 2 - A language to model the semantic formalism

The language defines a frame-like classification scheme for software components based on the defined semantic cases. The classification scheme consists of a hierarchical structure of generic frames (`IS-A-KIND-OF' relationship). Frames that are instances of these generic frames (`IS-A' relationship) implement the indexing units of software descriptions.

Major generic frames for the Knowledge Base are shown in Figure 3.

FRAME verb_phrase IS_A_KIND_OF root_frame
     CASES
          Action      CATEGORY verb
          Agent       DOMAIN   component
          Comparison  DOMAIN   noun_phrase
          Condition   DOMAIN   noun_phrase
          Destination DOMAIN   noun_phrase
          Duration    DOMAIN   noun_phrase
          Goal        DOMAIN   noun_phrase
          Instrument  DOMAIN   noun_phrase
          Location    DOMAIN   noun_phrase
          Manner      DOMAIN   noun_phrase
          Purpose     DOMAIN   verb_phrase
          Source      DOMAIN   noun_phrase
          Time        DOMAIN   noun_phrase.

FRAME noun_phrase  IS_A_KIND_OF root_frame
     CASES
          Adjective_modifier  CATEGORY  adj
          Participle_modifier CATEGORY  verb
          Noun_modifier       CATEGORY  noun
          Head                CATEGORY  noun.
  
FRAME component IS_A_KIND_OF root_frame
     CASES
          Name         CATEGORY  component_id
          Description  CATEGORY  string
          .
          . {Other information associated to        
          .  the component, e.g. source  code,       
             executable examples, reuse attributes, etc}
Figure 3 - Some generic frames of the Knowledge Base

The generic frames model semantic structures associated to verb phrases, noun phrases and the information associated to software components, like name, description, source code, executable examples, etc.

Semantic cases are represented as slots in the frames. `Facets' are associated to each slot in a frame, describing either the value of the case or the name of the frame where the value is instantiated (`value' facet); the type of the frame that describes its internal structure (`domain' facet) or the lexical category of the case (`category' facet). For instance, the `Location' slot in the verb phrase frame has a `domain' facet indicating that its constituents are described in a frame of type `noun phrase'.

Through the parsing process, the interpretation mechanism maps the verb, the direct object and each prepositional phrase in a sentence into a semantic case, based on both syntactic features and identified case generators.

Figure 4 shows the indexing structure for the `grep' family of Unix commands built from the description `search a file for a string'. An instance of the verb_phrase frame is generated by instantiating the slots corresponding to the semantic cases identified in the description ('Action', `Location' and `Goal'). These cases have an associated `value' facet indicating either the value of the slot (as `search' for the `Action' case) or the name of the instance frame with its value (grep_component, grep_noun_phrase_1 and grep_noun_phrase_2 for the semantic cases `Agent', `Location' and `Goal' respectively).

FRAME verb_phrase_1 IS_A verb_phrase
     CASES
           Agent      VALUE grep_component
           Action     VALUE `search'
           Location   VALUE grep_noun_phrase_1
           Goal       VALUE grep_noun_phrase_2.

FRAME grep_noun_phrase_1  IS_A noun_phrase
     CASES
           Head       VALUE `file'.

FRAME grep_noun_phrase_2  IS_A noun_phrase
     CASES 
           Head        VALUE `string'.

FRAME grep_component IS_A component
     CASES
           Name        VALUE `grep' 
           Description VALUE `search a file for a string'.
Figure 4 - An indexing structure for the "grep" command: "search a file for a string"

5 Similarity analysis


This is a section of a local copy of the paper A Similarity Measure for Retrieving Software Artifacts by M. R. Girardi and B. Ibrahim.