SPA Design

Overview

The design of the program is similar to the design provided, with the addition of a Lexer on the frontend that converts both SIMPLE source and PQL queries to a list of tokens before passing them to their respective parsers.

Components Design

PKB

The PKB contains all the knowledge of a SIMPLE program and provides an API to query it, such as finding if a specific Follows relation between two statements holds, or getting all the statements that follows a given statement. All design entities are stored as a strings, even statement numbers. This is possible as it is very clear from the SIMPLE grammar, that integers can only represent constants which in the PKB is always part of an expression. The types of knowledge and how they are stored are described below. Most of these entities are stored in a specialized Table object capable of performing natural join, cross product and recursive self-joins, all of which will be described in detail in the next section. Design abstractions that do not translate directly to a table may be stored in another data structure, usually a map. The storage of each design entity is described in the table below.

Design Entity	Data Structure
Variables	1-column table with each entry being a variable
Procedures	Map from procedure name to starting and ending statement numbers
Constants	1-column table with each entry being a constant
Statements	Map from type of statement to list of statement numbers of that type
Follows	2-column table with each entry being the left and right side of the relation
FollowsT	2-column table with each entry being the left and right side of the relation
Parent	2-column table with each entry being the left and right side of the relation
ParentT	2-column table with each entry being the left and right side of the relation
UsesS	2-column table with each entry being the left and right side of the relation
UsesP	2-column table with each entry being the left and right side of the relation
ModifiesS	2-column table with each entry being the left and right side of the relation
ModifiesP	2-column table with each entry being the left and right side of the relation
Calls	2-column table with each entry being the left and right side of the relation
CallsT	2-column table with each entry being the left and right side of the relation
Next	2-column table with each entry being the left and right side of the relation
NextT	CFG (not stored but retrieved)
Affects	CFG (not stored but retrieved)
AffectsT	CFG (not stored but retrieved)
Assign Pattern	Map from assign statement number to the modified variable and a string representing the expression in postfix form
If Pattern	2-column table with the first column being the statment number of the if statement and second column being the variables in the condition expression of the corresponding statement
While Pattern	2-column table with the first column being the statment number of the while statement and second column being the variables in the condition expression of the corresponding statement

The next table describe the setters for inserting design entities and how these setters interact with the data structures described above. The setters are used during the parsing of the SIMPLE source program and during design extraction.

Note that {} denotes a list or vector of items and <> denotes a tuple.

Setters	Algorithm
setVar(STRING var)	Insert {var} into variable table
setProc(STRING proc, INT startStmtNo, INT endStmtNo)	Insert the key-value pair <proc, {startStmtNo, endStmtNo}> into the procedure map
setConst(INT cons)	Insert {cons} into constant table
setStmtType(INT stmtNo, STATEMENT_TYPE type)	Add stmtNo to the list pointed by type in the statement map
setFollows(INT s1, INT s2)	Insert {s1, s2} into the Follows table
setFollowsT(INT s1, INT s2)	Insert {s1, s2} into the FollowsT table
setParent(INT s1, INT s2)	Insert {s1, s2} into the Parent table
setParentT(INT s1, INT s2)	Insert {s1, s2} into the ParentT table
setUses(INT s, STRING v)	Insert {s, v} into the UsesS table
setUses(STRING p, STRING v)	Insert {p, v} into the UsesP table
setModifies(INT s, STRING v)	Insert {s, v} into the ModifiesS table
setModifies(STRING p, STRING v)	Insert {p, v} into the ModifiesP table
setCalls(STRING p1, STRING p2)	Insert {p1, p2} into the Calls table
setCallsT(STRING p1, STRING p2)	Insert {p1, p2} into the CallsT table
setNext(INT s1, INT s2)	Insert {s1, s2} into the Next table
setCFG(CFG cfg)	Assign the provided CFG to the CFG in the PKB
setAssign(INT s, STRING v, STRING expr)	Insert the key-value pair <s, {v, expr}> into the assign pattern map
setIf(INT s, STRING v)	Insert {s, v} into the if pattern table
setWhile(INT s, STRING v)	Insert {s, v} into the while pattern table

The next table describes the getters of the PKB. All of the getters return a Table object containing the requested information. The getters are used in design extraction and PQL evaluation.

Getters	Algorithm
getVarTable()	Returns a copy of variable table
getProcTable()	Fill a single-column table with the keys of the procedure table and return it
getConstTable()	Returns a copy of constant table
getStmtType(STATEMENT_TYPE type)	Fill a single-column table with the statement numbers of type and return it
getFollows()	Returns a copy of Follows table
getFollowsT()	Returns a copy of FollowsT table
getParent()	Returns a copy of Parent table
getParentT()	Returns a copy of ParentT table
getUsesS()	Returns a copy of Uses table for statements
getUsesP()	Returns a copy of Uses table for procedures
getModifiesS()	Returns a copy of Modifies table for statements
getModifiesP()	Returns a copy of Modifies table for procedures
getCalls()	Returns a copy of Calls table
getCallsT()	Returns a copy of CallsT table
getNext()	Returns a copy of Next table
isNextT(INT start, INT end)	Call CFG traversal from start node, return true if end is in results
getNextT(INT start, BOOL isLeft)	Call CFG traversal from start node, create and return a table with the results
getNextT()	Call CFG traversal on every possible statement number, create and return a table with the results
isAffects(INT start, INT end)	Call CFG traversal on every possible statement number, create and return a table with the results
getAffects(INT start, BOOL isLeft)	Call CFG traversal from start node, create and return a table with the results
getAffects()	Call CFG traversal on every possible statement number, create and return a table with the results
getAffectsT()	Call CFG traversal, create and return a table with the results
getAssignMatches(STRING expr, BOOL isPartialMatch)	Find key-value pair in the assign pattern map where the given expr is a substring of the postfix expression (if isPartialMatch) or a full match (otherwise). Fills a two-column table with the assign statement number and modified variable
getIfMatches()	Returns a copy of if pattern table
getWhileMatches()	Returns a copy of while pattern table

Design Decision

The above described storage and API differed greatly from what it was in iteration 1. Focusing on the retrieval of relations in the PKB, iteration 1 had a relation table object and the following methods for each relation:

BOOLEAN isRelation(STRING s, STRING t)
SET getRelation(STRING s)
SET getReverseRelation(STRING t)
SET<VECTOR> getRelationTable() Meanwhile iteration 2 simply have a table object that has the following methods for each relation:
TABLE getRelation()

We evaluated the above 2 API designs with the following considerations

Considerations	Iteration 1	Iteration 2	Importance of Consideration
Ease of implementation	Hard to implement, have to write 4 methods per relation that calls different methods in the relation table	Easy to implement, simply copy and return the existing table	Important since our time is limited
Runtime of program	`O(1)` for isRelation, getRelation and getReverseRelation, `O(n)` where n is the number of relations for getRelationTable	`O(1)` to copy the entire table and return it	Very important as query evaluation time is limited
Usage of API	Evaluator and design extractor mostly uses the getRelationTable method to get the data table, which would then be operated on manually, other methods are used only in special cases	Evaluator and design extractor simply uses the getRelation method to get a Table object containing the requested data. The Table object can be used directly to perform merges with other tables.	Very important since the whole point of the PKB is to expose API to be used, the API exposed should be useful and meaningful

With the above criteria, we went with new design proposed for iteration 2. Not only did it simplified the implementation of the PKB, it also simplified the implementation of the PQL evaluator and improved its code quality, as described in the PQL evaluator section. Performance was still comparable as the O(1) isRelation, getRelation and getReverseRelation calls are rare and replacing it with an O(n) Table merge to achieve the same result did not impact the overall performance of evaluating a query.

Table

The Table class is used in the PKB as a storage object and in Design Extractor and PQL Evaluator to perform natural joins and cross product as required for deriving facts or answering queries. As a data structure it consists of 2 parts: a header row and the data table. The header row is simply a vector of strings while the data row is a set of data rows, which are also vector of strings. These data rows must be of the same size as the header row, which is set during the construction of the Table object.

The table provides a mergeWith(Table other) method, which merges the other table into itself without modifiying the other table. Depending on whether the 2 tables have common columns (i.e. strings in the header row exist in both table), the mergeWith method would perform natural join or cross product. The table also provides a transitiveClosure() function that performs transitive closure on table with 2 columns.

Natural Join

The common column, s2, is identified. The result of the natural join is the set of all combinations of rows in table 1 and table 2 that are equal on their common columns (i.e. s2).

Final Table

s1	s2	s3
1	1	1
1	1	2
1	2	2

Cross Product

No common columns can be identified. The result of the cross product is the set of all combinations of rows in table 1 and table 2.

Final Table

s1	s2
1	1
1	2
1	3
2	1
2	2
2	3

Transitive Closure

Transitive closure is only valid for tables with 2 columns. At the end of the algorithm, the table will have 2 columns containing the transitive closure of the entries in the previous table. The algorithm for transitive closure will be described in detail in the optimization sub-section.

Optimization

Natural Join

The natural join is the function that takes up the most CPU time during evaluation of AffectsT, and also in the PQL Evaluator for merging the results table. As such the algorithm used for natural join must be very efficient and optimized. In iteration 2, natural join was implemented with a naive nested loop join, which have a runtime of O(MN) (where M and N are the sizes of the tables). For iteration 3, the hash join algorithm is implemented instead which gave a much better runtime of O(M+N). After performing micro-optimization on the hash join such as pre-allocating vectors and reducing copies, the hash join had a 3 times reduction of CPU time over the nested loop join. The nested loop join is still used in cross product (minus the check for common keys). Both algorithms are described below.

Hash Join

Given 2 tables R and S
For every entry in R
1. Retrieve the key (tuple of values of common columns) and add a mapping in a map from the key to the entry.
For every entry B in S
1. Retrieve the key and probe the map using the key for the list of entries from R.
2. For every entry B in the list of entries
  1. Create a new entry in the output table that contains all distinct columns from A and B

Loop Join

Given 2 tables R and S
For every entry A in R
1. For every entry B in S
  1. Check if A and B contains common key, if they do create a new entry in the output table that contains all distinct columns from A and B

Transitive Closure

Previously, transitive closure was implemented with repeated self-joins n number of times where n is the number of unique elements from both columns of the table. Since the size of any table is upper-bounded by n², the runtime of the algorithm is O(n³).

For our use case, transitive closure is used in the design extractor to compute Follows*, Parent* and Call* relations and in the CFG to compute the Affects* relation. The number of these relations are actually bounded by O(n), where n is the number of statements/procedures in the source program. This is because the graph generated are mostly a tree structure: Parent follows the AST, Calls follow the call graph and cyclic calls are prohibited, Follows, Next and Affects follows the CFG. Even for the CFG which has loops, most vertices have an in-degree and out-degree of 1. While statements have an out-degree of 2. Statements following an if block have an indegree of 2ⁱ where i is the nesting level of the if block. Since an if block generates 2ⁱ vertices anyway the indegree can be distributed to those vertices resulting in a constant in and out degree for all vertices in the CFG. Therefore relations in CFG are still bounded by n.

This mean we can actually use repeated depth-first-search (DFS) instead to compute the transitive closure. This involves running depth-first search repeatedly from every vertex on the graph generated from the relations in the table. The runtime of depth-first search for our case is O(n), and by running repeatedly n times results in a O(n²) algorithm, which is remarkably improved from the previous O(n³).

In practice repeated DFS showed a 3-4 times reduction in CPU time compared to the repeated self-join algorithm.

Design decisions

Choice of data structures

The choice of data structure for representation of the header and data rows is highly important as it would affect the runtime and implementation of the mergeWith function, arguably the most important function to be optimized in the program. Therefore we have identified 2 possible ways to represent the headers and data as follow:

Column-based

VECTOR<MAP<STRING, LIST<STRING>>> table

Each map in the vector represents a column in the table, with the key being the column name and value being the list of entries in that column

Row-based

VECTOR<STRING> headerRow
SET<VECTOR<STRING>> data

HeaderRow contains the names for each column, while every vector in data represents a row in the table.

We evaluated the 2 options as follows:

Considerations	Column-Based	Row-Based	Importance of Consideration
Sorting of rows by lexicographical order	Must be done manually by going through each row and column	Automatic due to the use of set	Important in the implementation of in-place merge
Removal of duplicates	Must be done manually by going through each row and column	Automatic due to the use of set	Important as duplicates increases the runtime of the merge algorithm
Dropping a column	Trivial by popping the column	Must be done manually by going into every entry	Not very important as cases of dropping a column are rare (only with the use of "_" in the query)
With the above considerations, it is clear that a row-based representation of the table is much more suited for our problem, as such we adopted the row-based design in our implementation.

Control Flow Graph (CFG)

The CFG is a graph data structure populated by the design extractor. The following table shows the information stored in the CFG.

Name	Description
InitialToCompressed	A map from initial statement numbers to compressed statement numbers
CompressedToInitial	A map from compressed statement numbers to initial statement numbers
InitialGraph	A normal CFG without compression
ForwardCompressedGraph	A compressed CFG following the normal CFG edges' direction
ReverseCompressedGraph	A compressed CFG following the normal CFG edges' reversed direction
NumCompressedNodes	No. of nodes in compressed CFG

The CFG also contains some other information from the PKB to perform the algorithms for retrieval of Next*, Affects and Affects* relations.

The following table shows the private methods in the CFG.

Name	Description
getNextTForward(INT start, INT end)	Traverse the forward CFG from start till end (if any) and get all Next(start, _)/Next(start, end) relations.
getNextTReverse(INT end)	Traverse the reverse CFG from end and get all Next*(_, end) relations.
getAffectsForward(INT start, STRING variable)	Traverse the forward CFG from start, check if variable is modified and accumulate results along the way to get all Affects(start, _) relations.
getAffectsReverse(INT end, STRING variable)	Traverse the reverse CFG from start, check if variable is modified and accumulate results along the way to get all Affects(_, end) relations.

Getting Next* relations

To retrieve Next* relations, a modified BFS traversal (explained in Algorithm section below) would be done either on the forward compressed graph for Next*(const, synonym), or reverse compressed graph for Next*(synonym, const), depending on the parameters passed in. If Next*(synonym, synonym) is required by the query, the same algorithm will be applied once to every node in the forward compressed graph.

Algorithm

The algorithm to populate Next* is a modified BFS traversal. The only modification is that the start node is not marked as visited at the beginning of the traversal. The reason for this is to be able to retrieve Next*(s, s) relations inside a while loop of the source program. By not marking the start node as visited before going into the while loop, it allows the BFS traversal to go back to itself so that Next*(s,s) holds. Note that this implementation does not affect the rest of the nodes.

The algorithm is applied on the compressed CFG shown above.

Explanation:

Suppose we start at node D with statment line 8. As stated above, the initial node would node would not be set to visited. The neighbours of node D are then added to the queue.

current node = D
visited array = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
result = <>
queue = <E, F>

We process node E according to the normal BFS since it was not visited.

current node = E
visited array = [0,0,0,0,0,0,0,0,0,1,1,1,0,0,0]
result = <9, 10, 11>
queue = <F, D>

Repeat the above for node F

current node = F
visited array = [0,0,0,0,0,0,0,0,0,1,1,1,1,1,1]
result = <9, 10, 11, 12, 13, 14>
queue = <D>

Since D was not marked at the beginning, we are able to process it now.

current node = D
visited array = [0,0,0,0,0,0,0,0,1,1,1,1,1,1,1]
result = <8, 9, 10, 11, 12, 13, 14>
queue = <>

BFS terminates.

Getting Affects relations

To retrieve Affects relations, a modified BFS traversal (similar to the one used for Next*) is done either on the forward compressed graph for Affects(const, synonym), or reverse compressed graph for Affects(synonym, const), depending on the parameters passed in. If Affects(synonym, synonym) is required by the query, the same algorithm will be applied once to assignment node in the forward compressed graph.

Algorithm

Affects(const, synonym)

Retrieve the variable v modified in the assignment statement provided by constant.
Traverse the forward CFG using a modified BFS.
1. Check the compressed node which the assignment statement is in for any following statements that modifies/uses v. If any statement uses v, store as result. If any statement modifies v, stop traversal completely.
2. Continue traversal to other compressed nodes and repeat (i).

Affects(synonym, constant)

Retrieve the variables used in the assignment statement provided by constant.
For each variable v found in step 1, traverse the reverse CFG using a modified BFS.
1. Check the compressed node which the assignment statement is in for any prior statements that modifies v. If these statements are assignment statements, store them as result. Else, stop traversal completely.
2. Continue traversal to other compressed nodes and repeat (i) in each node, discontinuing traversal along the path instead of stopping traversal completely.

Affects(synonym, synonym)

For each assignment statements in the source program, retrieve the modified variable and use the same algorithm as Affects(constant, synonym) shown above.

Getting Affects* relations

To retrieve Affects* relations, three algorithms were considered and tested:

Transitive closure using Floyd Warshall on Affects relations
Modified BFS traversal on initial CFG
Transitive closure using DFS on Affects relations

Evaluation of these three algorithms will be discussed in the Design Decision section.

Algorithm

Transitive closure using Floyd Warshall on Affects relations

Get all Affects relations by using Affects(synonym, synonym) algorithm above.
Run Floyd Warshall's algorithm for transitive closure to get Affects* relations.

Modified BFS traversal on initial CFG

For each first statement of each procedure, do the following steps.
Enqueue first statement into the BFS queue with an empty vector. This vector will be used to keep track of variables modified and their corresponding line numbers during the traversal. At any point in the traversal, the vector will only contain variables that still have possibility of Affects* relation.
As per usual BFS, remove first statement from queue.
Case 1: If it is a read/call statement, remove the modified variables from this statement from the vector since these can no longer have an Affects* relation.
Case 2: If it is an assignment statement, for each modified variable stored in the vector, check if this variable is used in the assignment statement. Store as result if this is true. Next, use the Affects* relation found so far to update other prior statements with this assignment statement. Finally, remove the modified variable from this assignment statement from the vector since it can no longer have an Affects* relation.
Visit other statements of this current statement if the other statement has not been visited enough or it is a first statement in the while block.
Case 1: If other statement is not in a while block, simply check if statement has been completely visited (i.e. visited count equals to in degree) and enqueue if this is true.

Case 2: If other statement is in a while block, if this statement is an assignment statement, use retrieve Affects relation from this statement using same algorithm for Affects(constant, synonym). These Affects relation will be stored and transitive closure will be used at the end of the while block to obtain all Affects* relation in a while block.

Transitive closure using DFS on Affects relations

Get all Affects relations by using Affects(synonym, synonym) algorithm above.
Run DFS for transitive closure to get Affects* relations.

Design Decisions

Types of CFG

Normal CFG on the left; Compressed CFG on the right

We listed out some of our considerations between the two designs for the CFG below.

Considerations	Normal CFG	Compressed CFG	Importance of Consideration
Query time complexity	O(N+R) where N is number of program lines and R is number of Next relations	O(n+r) where n<N and r<R as explained below	Highly important since results for Next* are retrieved through the traversal of the CFG and query time is limited
Ease of implementation	Simple as we can just use the Next relations as the CFG	Harder since we need to write an algorithm to compress the CFG	Important as our time is limited

The normal CFG contains one line number per node while the compressed CFG groups a few line numbers together where possible. This reduces the number of nodes and edges in the graph which makes traversal faster.

Using the above criteria, we decided to move ahead and implement the compressed CFG. The query time complexity is the most crucial point as it is part of the requirement, while ease of implementation could be mitigated with proper thinking of the algorithm with the team members before actual implementation.

Affects* Algorithm

We listed out some of our considerations between the two different algorithms for computation of Affects*.

Considerations	Transitive Closure (Floyd Warshall)	Modified BFS	Transitive Closure (DFS)	Importance of Consideration
Query time complexity	Slow since time complexity of Floyd Warshall is O(N³) where N is no. of program lines	Medium since transitive closure only called on nodes in while blocks, however slowed down by other operations in BFS	Fast on graphs that are not so dense. Averaged better than modified BFS using performance profiler tools and many test cases. Refer to optimization of transitive closure in Table section for full theoretical analysis.	Highly important since results for Affects* are retrieved through the traversal of the CFG and query time is limited
Ease of implementation	Simple as we can just retrieve Affects relations and use the same transitive closure algorithm as we have defined for other relations	Harder since we need to modify the traversal algorithm to use in this case	Simple as we can just retrieve Affects relations and use the same transitive closure algorithm as we have defined for other relations	Important as our time is limited

Using the above criteria, we decided to move ahead and implement the transitive closure using DFS to get Affects* relations. The query time complexity is the most crucial point as it is part of the requirement, while ease of implementation could be mitigated with proper thinking of the algorithm with the team members before actual implementation.

Optimizations

Affects relations are cached whenever a call is made to retrieve Affects relation. This helps to reduce the running time of algorithms that depend on this relation (e.g. Affects*).

General Lexer

The general lexer is in charge of performing lexcial analysis on both SIMPLE source and PQL query. It converts a string of SIMPLE or PQL into a series of lexical tokens. There are 3 types of tokens: Identifier, Number and Symbol tokens. Identifier and Number tokens are identified using the following rules.

LETTER: A-Z | a-z -- capital or small letter
DIGIT: 0-9
IDENTIFIER: LETTER (LETTER | DIGIT)*
NUMBER: DIGIT+

Symbols are any other characters encountered that are not whitespace. There are also a few special cases symbols containg 2 characters as follow: >= <= == != && ||.

Comments that starts with \\ until the end of the line are not converted to tokens at all.

The algorithm to lex a source string is as follows:

Read in the source string character by character.
If the character is '' and the next character is '', read until '\n' is encountered and discard it.
If the character is an alphabet, keep reading alphanumeric characters and create a Identifier token.
If the character is a number, keep reading numeric characters and create a Number token.
If the character is a symbol, check if the next character would result in a valid 2 charcter symbol. If so read the next symbol. Create a Symbol token.
If the character is whitespace (i.e. tab, space, newline etc.), keep reading whitespace and discard it.
Throw an error if the character failed to match any of the above cases.

Design Decision

General Lexer vs SIMPLE and PQL Lexer

Previously, both SIMPLE and PQL had seperate Lexers with seperate tokens for the keywords of both grammars. We made the following considerations between rewriting the Lexers into a general version and sticking with the seperate lexers.

Considerations	General Lexer	Seperate Lexer	Importance of Consideration
Maintainability	Easy to maintain, only 1 codebase	Harder to maintain, 2 seperate codebases	Important as lexing is the same for both SIMPLE and PQL
Effort of reimplementation	Fairly simple but may require rewrite on parsers	Nothing needs to be done	Important since time is constrained
Evaluation of codebase	Would be simpler	PQL Lexer has close to 2.7k LoC and does things a Lexer shouldn't be doing	Highly important as more modification on PQL is needed

In the end we did rewrite of the Lexer, and therefore the SIMPLE Parser and PQL Parser too. The rewrite of the Lexer was not too difficult as predicted, with only 130 LoC compared to 230 LoC in the old SIMPLE Lexer and 2700 LoC in the old PQL Lexer.

SIMPLE Parser

The SIMPLE Parser implements a recursive descent parser in order to parse the source program and extract out "base facts" into the PKB.

Recursive Descent Parser (RDP)

The SIMPLE Parser takes in a list of tokens generated by the General Lexer and parses it using a Recursive Descent Parser (RDP). Tokens specific to the SIMPLE language such as keywords, operators and delimiters are predefined which the SIMPLE parser would match the tokens generated by the lexer against.

The aim of an RDP is to implement each of the non-terminals of the SIMPLE grammar. This parsing is only possible for context-free grammars where there exists some positive integer k that allows a RDP to decide which non-terminal to create by examining only the next k tokens of input. The following table shows a few examples of non-terminals found in the SIMPLE grammar.

Type	SIMPLE grammar	Non-terminal(s)
procedure	'procedure' proc_name '{' stmtLst '}'	stmtLst
read	'read' var_name	var_name
while	'while' '(' cond_expr ')' '{' stmtLst '}'	cond_expr, stmtLst

On encountering expr or cond_expr non-terminals, the RDP extracts out the entire set of tokens representing those expr or cond_expr and switches to parsing those set of tokens with a expression parser implementing the Shunting-Yard algorithm to convert the tokens to postfix notation.

Parsing of statements and statement lists

Extraction of base facts

Base facts refer to knowledge about the SIMPLE source program that can be extracted in one pass of the program. The following table lists out the base facts and how these facts are extracted.

Base Facts

Type	Algorithm	Calls to PKB
Variables	Extracted when parsing var_name	setVar(v)
Procedures	Extracted when parsing proc_name	setProc(p)
Constants	Retrieved from the expression in assign statements	setConst(c)
Statement Type	After determining the type of statement to parse and assigning a statement number	setStmtType(stmtNo, stmtType)
follows(s1, s2)	When parsing stmtLst, extract the relations by iterating through the statements in the stmtLst	setFollows(s1, s2)
parent(s1, s2)	When parsing while or if stmts, iterate through the child statements in the stmtLst(s) and extract the relation	setParent(s1, s2)
uses(s, v) where s is an assign or print statement	When parsing assign or print statements, extract the variables from the expression (for assign) or directly	setUses(s, v)
uses(s, v) where s is a container statement (if or while) and v is a variable in the condition of statement	At if or while statements, extract the variables from the conditional expression	setUses(s, v), setIf(s, v) / setWhile(s, v)
uses(p, v) where p is the procedure v is in	Extracted at the same locations uses(s, v) described previously is extracted	setUses(p, v)
modifies(s, v) where s is an assign or read statement	At assign and print statements, extract the variables directly	setModifies(s, v)
modifies(p, v) where p is the procedure v is in	Extracted at the same locations modifies(s, v) described previously is extracted	setModifies(s, v)
calls(p1, p2)	When parsing a call statement, extract the procedure called	setCalls(s, v)
next(s1, s2)	The parsing of next is more complicated and shall be described in the following subsection	setNext(s, v)
Assign pattern	When parsing an assignmnet statement s, the variable modified v and expression converted to postfix expr is extracted	setAssign(s, v, expr)

Extraction of next(s1, s2)

Before the extraction of next(s1, s2) can be explained, the concept of "exit statement numbers" must be discussed. Each statement, after being parsed, will return a set of "exit statement numbers" that describe the statements that will fulfill s1 given follows(s1, s2) holds (if there does not exist s2 that fulfills the follows relation, a dummy statement may be conceptually considered). The following table describes what exit statement numbers are returned for each statement.

Statement Type	Exit Statement Number(s)
assign/read/print/call/while	Returns its own statement number
if	Returns the union of exit statement numbers of each of the last statement in the statement lists of if

With the exit statement numbers of each statement known, these exit statements are connected to its next statement when parsing the follow non-terminals

Non-Terminal	Algorithm
stmtLst	For every statement in the statement list, create a next relation with the statement's exit statements to the statement that follows (if no statement follows then nothing is done)
while	Create a next relation with the exit statements of the last statement in the statement list in the while statement to the while statement
The following diagrams demonstrate a complicated example nested ifs connecting back to a while statement encompassing it.

This diagram shows the parse tree of the theoretical example source program with nested ifs of 2 levels nested in a while. Each node shows it's statement number, statment type and exit statements derived form the rules described previously. StmtLst nodes are implict within each edge. With this parse tree and following the rules for creating next, the CFG encoded by the next relations can be seen below.

Design Decision

In the previous iteration, we implemented the parser to output an AST instead of directly parsing into the PKB. The AST will then be traversed by the design extractor to fill the PKB with design abstractions. We have since changed the parser to extract design abstractions and insert into PKB during parsing in one parse. We listed out some of our considerations between the two designs below.

Considerations	Direct	AST	Importance of consideration
Ease of implementation	Easier to implement by performing the extraction in the parsing of each non-terminal	Harder to implement, with the need to define an AST data structure and implementing an AST traversal algorithm to perform the extraction	Highly important as our time is limited
Seperation of concerns	Poor seperation of concerns, parser by definition should result in a parse tree and not extract information	Good seperation of concerns, follows the definition of the parser and outputs a tree	Important for the maintainability of the program
Runtime of program	Perform 1 pass of the program to extract base facts	Perform 2 passes, 1 to generate the tree and 1 to traverse the tree	Unimportant as parsing time can be long
Ease of testing	Easy to test, check the PKB for the extracted design abstractions	Hard to test, unit testing must be performed for the AST builder which usually involves writing the expected parse tree from scratch to match against the resulting parse tree. Another set of unit tests is also required for extracting design abstractions from parse tree, with the same problems of writing the parse tree from scratch to feed into the extrator.	Highly important as testing is a vital part of our development process (Test driven development)
Work splitting	Harder to split since the parsing and extracting code are concentrated in one file with multiple overlapping functions	Easier to split, one person can work on parsing and one person can working and extracting from AST	Important for iteration 1 but not as important for iteration 2 onwards since there are more implementation and testing of other components to work on

Using the above criteria, we decided to move ahead and implement direct parsing. The ease of implementation and testing made it easy for 1 person to fully implement the parser in a short period of time, while maintainability was not as much of an issue as we thought.

Design Extractor

The design extractor takes in the PKB that is already populated with the base facts (by SIMPLE parser) and performs the 3 tasks described next.

The first task is to check the PKB for any semantic errors caused by the SIMPLE program, specifically checking for calls to non-existant procedures and checking for cyclic calls using toposort.

The second task is to extract design abstractions that can be derived from the base facts in the PKB such as Follows*, Parent*, Calls*, Modifies and Uses for if/while containers, procedure calls and procedures.

The third task is to populate a compressed CFG to get results for Next* queries and also Affects and Affects* queries. More details about the CFG can be found in the previous section on CFG.

Sequence of Populating Information / Validation

The Design Extractor validates and extracts certain information in PKB in the following order:

Validate procedures
Validate cyclic calls
Populate Follows*
Populate Parent*
Populate Calls*
Populate Uses and Modifies for procedure calls and procedures
Populate Uses for if/while statements
Populate Modifies for if/while statements
Populate CFG

Derived Facts

Type	Method of Derivation	Calls to PKB
Follows*(s1, s2)	Transitive Closure with recursive self-joins using base facts for Follows that is already in PKB	setFollowsT(s1, s2)
Parent*(s1, s2)	Transitive Closure with recursive self-joins using base facts for Parent that is already in PKB	setParentT(s1, s2)
Calls*(p, q)	Transitive Closure with recursive self-joins using base facts for Calls that is already in PKB	setCallsT(p, q)
Uses(s, v) where s is an procedure call or procedure	Topologically sort the procedures in the call graph and update Uses for procedure call or procedure in the toposorted order	setUses(s, v)
Modifies(s, v) where s is an procedure call or procedure	Topologically sort the procedures in the call graph and update Modifies for procedure call or procedure in the toposorted order	setModifies(s, v)
Uses(s, v) where s is an if/while statement	Parent*(s, s1) and Uses(s1, v)	setUses(s, v)
Modifies(s, v) where s is an if/while statement	Parent*(s, s1) and Modifies(s1, v)	setModifies(s, v)

Design Decision

We listed out some of our considerations between the two designs to derive Uses and Modifies for procedure call and procedure below.

Considerations	Recursively populate Uses and Modifies from first procedure to last, while traversing the call graph	Topologically sort the call graph, then populate Uses and Modifies using that order	Importance of Consideration
Time complexity	Slow since there may be Uses and Modifies relations that are populated more than once	Fast since this only require one pass to populate all Uses and Modifies relations without repeats	Unimportant since parsing time can be long
Ease of implementation	Harder to implement since ancestors of procedure have to be stored and continuously updated with Uses and Modifies relations	Simple since it is just a toposort and traversal algorithm	Highly important as our time is limited

Using the above criteria, we decided to move ahead and implement the second design. The ease of implementation made it easy for 1 person to fully implement this feature in a short period of time. Furthermore, the same toposort algorithm can be used to detect cyclical call, a semantic error.

PQL Parser

The PQL Parser implements a recursive descent parser similar to the SIMPLE Parser. The goal of the PQL parser is convert a PQL String to a valid Query object that the Query evaluator can use, and throw any error if the PQL string is not valid.

Query Object

About Query object, the data structure used to store a query is a class defined as Query, which basically contains:

A vector of QueryEntity's called selectors that stores all the declarations of synonyms;
A vector of Clause's called clauses that stores all the relationships in Select;
A vector of QueryEntity's called targets that stores all the entities that are to be selected.

More detailed structures of Query, QueryEntity and Clause is shown in the class diagram and descriptions below:

The selectors is a vector of QueryEntity objects. Each QueryEntity object contains its QueryEntityType and name (i.e. QueryEntity(QueryEntityType, name), where QueryEntityType is an enum class contains all the design entity types and name is stored as a string).

The clauses is a vector of Clause object. Each Clause object contains its ClauseType and clause parameters (i.e. Clause(ClauseType, parameters), where ClauseType is an enum class contains all the design abstraction types and parameters are a vector of QueryEntity).

For Follows, Follows*, Parent, Parent*, Calls, Calls*, Next, Next*, With, IfPatt, WhilePatt, Uses and Modifies relationship clauses, the parameters contains 2 arguments: the left and right arguments in the relation (for IfPatt and WhilePatt, the 2 arguments are the synonym and the variable entity reference). For AssignPatt clause, the parameters contains 3 arguments: the assign-synonym, left argument and right argument. The right argument is an expression if it is an exact or partial match, and the expression is converted to postfix.

The shunting-yard algorithm is used to convert the expression from infix to postfix. It maintains an output queue and operator stack, and parses each token in the expression from left to right once. Variable and constant tokens are always pushed to the output queue. Depending on the current operator token and what operator tokens are in the operator stack, the operator stack may either be popped into the output queue or operator token is pushed into the stack. If there are no more tokens to read, the operator stack are all popped into the output queue. Spaces are added before and after every token, in which the reason is described during pattern matching in the PKB.

The select targets is a vector of QueryEntity objects that will be selected by this query.

Design Decision

Data Structure to Store a Query.

Build a query tree to store the query
Define a dedicated Query object to store a query as described above

Considerations	Query tree	Query object	Importance of Consideration
Ease of implementation	Need an auxiliary tree node data structure to store the entities and clauses, as well as to put extra effort to represent the parent-child relationships	Easy to instantiate a Query object and continue adding entities or clauses into it by its API	Important in reducing the complexity of the implementation
Query optimization	For a query tree, to shift the position of a clause is not so easy, because it requires to play with the node pointers	For Query object, it is much easier to shift clauses by swapping indexes or split vector for grouping clauses	Important as PQL Evaluator needs to apply optimization strategies to the query
Time Complexity	To locate a clause in a query tree, one needs to search from the root node and walks along with the child pointer to locate a clause, which may cause linear time in the number of nodes	for a Query object, one clause can be easily located by vector indexing, it runs in constant time	Important in increasing the efficiency

Based on the comparisons above, the Query object solution is better than the query tree for implementation, optimization and time complexity, so we chose to use the Query object.

Query Validation

Query validation is performed during parsing. Syntactic errors are caught by the recursive descent parser when there are unexpected tokens at the various stages of parsing. Semantic errors are validated during parsing but are only thrown at the end of parsing of the entire program, to allow for catching of syntactic errors first. Semantic error validation are described below.

Declarations

There are 2 parts to validate in parsing of declarations. The first part is checking the validity of design-entity, which is simply a lookup in a predefined set of valid design entities. The second part is ensuring synonyms have not been declared before, which is performed by looking up the synonym in the targets vector and throwing an error if the synonym already exist.

Synonyms

Synonyms not part of parsing declarations have to be declared. This is validated by simply checking if the synonym exist in the targets vector.

Attribute references

Attribute references are of the following grammar:

attrRef: synonym '.' attrName

where only some types of attrName are valid for certain types of design entity that the synonym is for, as shown below.

procedure.procName, call.procName, variable.varName, read.varName, print.varName : NAME
constant.value : INTEGER
stmt.stmt#, read.stmt#, print.stmt#, call.stmt#, while.stmt#, if.stmt#, assign.stmt#: INTEGER

Thus, attribute references are validated by getting the design entity type of the synonym, pairing it with the attribute name and checking if the pair exist in a predefined map of valid design entity and attribute name pairing. The map will also map these pairings to the type (NAME or INTEGER) that the attribute reference is which is useful for validating with clauses in the next section.

With clauses

With clauses must ensure the two references parsed are of the same type. The type of the references are either inferred (for constants, integers and synonyms) or looked up (for attribute references, refer to previous section). The parsed synonym can also only be of design entity prog_line, which is validated after parsing of the synonym.

Such that clauses

Such that clauses have 2 entities which must be of certain types to be valid for the given relation in the clause. The types are split into 2 categories: Stmt and Ent. Valid entity types for Stmt and Ent are described in the table below, and both will be a union of both sets.

Type	Entity
Ent	Procedure, Variable, Name, Underscore
Stmt	Stmt, Read, Print, Call, While, If, Assign, Progline, Line, Underscore

Most of these entities are design entities that are declared for synonyms. Name is identifier wrapped in quotes, Line is an integer and Underscore is an underscore representing any.

Once the entities' type is determined, a lookup is performed on the relation to determine what entity types are allowed. The lookup table is as follows:

Relation	Left Entity	Right Entity
Follows, FollowsT, Parent, ParentT, Next, NextT, Affects, AffectsT	Stmt	Stmt
Calls, CallsT	Ent	Ent
Uses, Modifies	Both	Ent
A special case of validation is also performed for Uses and Modifies ensuring that the Left Entity is not an underscore.

Pattern Clause

The first synonym in the pattern clause can only be of 3 design entity types: Assign, If and While. Depending on the type, the parser would descend to parse the specific type. If the synonym is not of the correct type, it still parses the clause as an assign but throws a semantic error at the end of parsing.

Design Decision

Method to Validate Query Relationships

Use the switch-case style in the code to validate each kind of relationships separately.
Store all the valid argument types in a table in the header file, and use a method to query this table to check the validity.

Considerations	Switch-case style	Table-driven	Importance of Consideration
Ease of implementation	To implement the switch-case style method, need to consider all relationship types and validate them separately.	For a table-driven method, just get the relationship type and query the table with this relationship type to do validation.	Important in decreasing the complexity of implementation
Maintainability and Extensibility	Must make changes int the code to extend the rules.	To further extend the validation rules, one just needs to add the new rules into the table in header file and won't touch the code	Increase the scalability and extensibility

Based on the comparisons above, the table-driven method is easier to implement and extend. So we choose to use the table-driven validation method.

Query Evaluator

The Query Evaluator takes the Query object built by the Query builder and queries the PKB for answers. It first determines whether the query is "simple", where a simple query is one without clauses. Simple queries are evaluated directly by querying the PKB for what is been selected. A query that is not "simple", i.e. with clauses, will have each of its clauses evaluated separately and results stored (either a table or boolean). These results will then be merged depending on if its a table or boolean, some optimization strategy would be done during the merging period. Finally, the column with the select query will be selected and returned to the frontend.

Data Representation for Program Queries

Query Evaluator gets a Query object from Query builder, which contains all the information of program queries. The data representation of program queries is the same as the description of Query object in Query Builder part.

Strategy for Query Evaluation

As shown in the activity diagram of PQL Evaluator, at the beginning of the query evaluation, PQL Evaluator first checks whether contains any clauses. The Query that doesn't contain clause is a simple Query, otherwise, it's a complex query. The evaluating strategies for simple Queries and complex Queries are different.

Simple Query

As for a simple Query, if Query selects Boolean, PQL Evaluator returns true, otherwise, PQL Evaluator simply queries PKB by calling corresbonding API in Getters Table for the data of the synonym or attrRef selected in the tuple and merge them.

Complex Query with Optimization

First, PQL Evaluator evaluates each clause in Query by calling corresbonding API in Getters Table for data and filter the data according to the parameters in the clauses. The table merging method is reused here to filter the data. After that, we should get the correct result for each clause.

Then some optimization would be done here to reduce the intermediate table size during the merging to make it faster.

The result tables of clauses are divided into groups. Within the groups, tables are sharing some common attributes. Among the groups, there shouldn't be common attributes. Groups that contain attributes selected are tagged as relevant groups, else they are irrelevant groups.
Within each group, we use Dynamic Programming Algorithms to merge the tables to make the cost of merging minimum. Assume the number of tables in one group is N. Ti represents i th table in the group in its original order. The cost of joining two tables is the size of left table(M) * size of right table(N)*(1+No comparison). The table size of join two tables is estimated as the maximum of the sizes of two tables. The DP algorithm is shown below:

for i=2 to N {--------------------------------------------------------- (1)
      forall S subset of {T1, T2, … Tn} such that |S|=i { ------------- (2)
            bestPlan = dummy plan with infinite cost ------------------ (3)
            forall Sj subset of S, |S|>|Sj|>0, S = {Rj} U {Sj} { ------ (4)
                  p = joinPlan(optPlan(Sj), optPlan(Rj)) -------------- (5)
                  if cost(p) < cost(bestPlan) ------------------------- (6)
                        bestPlan = p ---------------------------------- (7)
                  }
                  optPlan(S) = bestPlan ------------------------------- (8)
            }
      }
}
P = optPlan{R1, R2, … Rn} --------------------------------------------- (9)

Explanation for Algorithm:

Input: a vector of tables

Basic idea behind algorithm:

Find the optimal join plan for subsets of input, start from size 2 to N. When we are finding the optimal join plan for a subset of size i, all the subsets with size less than i are computed and can be used. At the end of the loop, we will find the optimal join plan for size N, which is the join plan we are looking for the all the tables in this group.

Output: a binary tree contains information of optimized join order of tables, where the leave nodes are tables from input, and internal nodes are Join nodes. The left child and right child of a join node are merged as the result of the join node.

According to the binary tree given by DP Algorithm, we traverse the tree and join the tables in given order. Then we will get the join result of this group of tables. And we do this for each group.
After we get the merged result table for each group, we check whether the results of irrelevant groups are empty, if empty return FALSE or none, else ignore and continue. Then we continue to join the relevant groups in order to get all the information needed for selection.

Then according to the tuple selected, we drop the columns in result table that are not selected and keep merging result table with tables of synonym and attrRef that are not included result table. At this point, PQL Evaluator gets sufficient information that can be returned. The last step is to format the result according to the requirements of SPA into a list of string and return the result.

During the evaluation, if we encounter any clause that is evaluated to be false, which means no data in PKB can satisfy the clause, the evaluation process will be terminated and return empty result or FALSE if Boolean is selected.

Examples of Evaluating Complex Queries with Optimization

Select <a,s> such that Follows(a,_) and Parent(_,ifs) and Modifies(s,v) and Uses(s,"a")
 and Parent*(s,w) and Follows*(w,w1) pattern a1(v,_"a+1"_)
  and w2(v1,_) and ifs1(v1,_) such that Follows(w2,v1)

1. Result table for each clause after divided into groups:

In this step, divide the result tables of clauses into connected groups. Groups that contains selected item will be relevant groups, else they will be irrelevant groups.

Relevant groups:

Irrelevant groups:

2. Result tables for each group after merging

In this step, we applying the DP Algorithm within each group to find the optimal join order according to the binary tree given by algorithm, then we do the join according to the order.

3. Result tables for relevant groups and irrelevant groups after merging the relevant groups

In this step, we merge all the relevant groups. As for the irrelevant groups, we just check if they are empty. If the result of w2,ifs1,v1 is empty, we simply return `none`, else we ignore this irrelevant table and continue.

4. Select from result table of relevant groups

This is the last step, we can just simply select [a,s] from [ifs,a,s,v,w,w1,a1] and return the result.

Design Decisions

Greedy algorithm vs DP algorithm

Considerations	Greedy	Dynamic Programming	Importance of consideration
Ease of implementation	Easy to implement. Order the Clause and choose the next Clause to execute and join by following some heuristic.	Harder to implement. Need to estimate the cost of join and size of join result to compute the best join plan. Moreover, need to construct tree structure to represent the join plan.	Important to make code simple and easy to understand.
Performance	Faster than randomly executing and joining the result.Less time to compute join plans.	A bit more time to compute optimal join plan. Generally, come up with better join plan than Greedy. Because time of computing join plans is a small portion of actually joining the tables.Therefore, DP is faster than Greedy.	Most important consideration in doing optimization.
Space usage	Don't use extra space.	Need extra space to store the optimal join plan for subsets.The number of subsets is 2^(number of table in group).	Less important than performance

Based on the comparisons above, although DP is harder to implement and takes more space, but it performs better than Greedy. Since performance is more important, we chose to go with the DP approach.

Divide the result tables into groups vs Evaluating them together

Considerations	Divide into groups	Evaluate all together	Importance of Consideration
Ease of Implementation	Harder to implement. Need to traverse the all the result tables to find connected result tables.	Easy to implement. Simply give all the result table to Optimizer to join.	Important to make code simple and easy to understand.
Performance	Takes less time to join tables. Don't join groups that are irrelevant to final results	Takes more time.	Most important factor to consider.
Space usage	Use less space.	Use more space. Because the number of subsets of input tables is 2^(number of input tables). Taking all tables without dividing will cause a much larger plan space to search.	Less important.

Since without dividing the tables, it will cause a higher cost in time and space. As performance is our first thing to consider, we proceeded with the divide into connected groups approach.

SPA Design

SPA Design

Overview

Components Design

PKB

Design Decision

Table

Natural Join

Final Table

Cross Product

Final Table

Transitive Closure

Optimization

Natural Join

Hash Join

Loop Join

Transitive Closure

Design decisions

Choice of data structures

Column-based

Row-based

Control Flow Graph (CFG)

Getting Next* relations

Algorithm

Getting Affects relations

Algorithm

Affects(const, synonym)

Affects(synonym, constant)

Affects(synonym, synonym)

Getting Affects* relations

Algorithm

Transitive closure using Floyd Warshall on Affects relations

Modified BFS traversal on initial CFG

Transitive closure using DFS on Affects relations

Design Decisions

Types of CFG

Affects* Algorithm

Optimizations

General Lexer

Design Decision

General Lexer vs SIMPLE and PQL Lexer

SIMPLE Parser

Recursive Descent Parser (RDP)

Parsing of statements and statement lists

Extraction of base facts

Base Facts

Extraction of next(s1, s2)

Design Decision

Design Extractor

Sequence of Populating Information / Validation

Derived Facts

Design Decision

PQL Parser

Query Object

Design Decision

Data Structure to Store a Query.

Query Validation

Declarations

Synonyms

Attribute references

With clauses

Such that clauses

Pattern Clause

Design Decision

Method to Validate Query Relationships

Query Evaluator

Data Representation for Program Queries

Strategy for Query Evaluation

Simple Query

Complex Query with Optimization

Examples of Evaluating Complex Queries with Optimization

1. Result table for each clause after divided into groups:

2. Result tables for each group after merging

In this step, we applying the DP Algorithm within each group to find the optimal join order according to the binary tree given by algorithm, then we do the join according to the order.

3. Result tables for relevant groups and irrelevant groups after merging the relevant groups

4. Select from result table of relevant groups

Design Decisions

Greedy algorithm vs DP algorithm

Uh oh!

Uh oh!