1. Modular Semantic Analysis

Todo

  • How are results consolidated?

  • Complete overview over *_Results data structures, in particular, add values.

  • In local mode, where do we get the interprocedural results Andreas is talking about? A: per SSI node, for example: get_global_values()

  • Inline TODO

Modular Semantic Analysis (MSA) will have multiple modes, of which only the first is implemented. You can see an overview of the basic approach on the right side.

An overview of the different analyses in FSPTA

Figure 1.1 An overview of the different analyses in FSPTA.

1.1. Analysis mode: Function-By-Function or Local

In this mode, each routine is analysed in isolation. Almost no information is propagated from other routines to the analysed routine, except information derived

  • from contracts and external specifications, and

  • from several interprocedural analysis steps (e.g. “called function is pure”)

The Function-By-Function analysis mode is intended to find “interesting” findings with respect to the investigated error categories. It is therefore expected to be fast and might be unsound: it might skip true positives, but reports not too many false positives either. It can apply heuristics in its error checks to achieve this task.

For each routine, the analysis performs the following steps.

  1. Perform basic intraprocedural analyses, including

    • (local) pointer analysis

    • creation of SSA/SSI form

    • numerical analyses like range computation

    • symbolic expression

  2. Execute configured semantic checks (see library libSemanticChecks)

See fscs_analysis.cpp in libIntegrated for creation of basic analyses for MSA.

1.1.1. Results and data structures of different analyses

Analysis results are passed between the different analysis stages through three result types: Permanent_Results, Interprocedural_Results, and Local_Results.

  • Permanent_Results holds the analysis configuration, a Global_Object_Manager, and a list of Interprocedural_Results.

  • Interprocedural_Results contain the call graph (the initial approximated one as well as a final refined one), and two Interprocedural_Routine_Results, which contain the following results.

    • Interprocedural_Control_Flow_Results

    • Interprocedural_Data_Flow_Results

    • Interprocedural_Pointer_Results

    • Interprocedural_Numerical_Results

  • Local_Results link to Permanent_Results, Interprocedural_Results, and the following five results.

    • Local_Control_Flow_Results
      Unreachable blocks and locally caught exceptions (no control flow graph)
    • Local_Data_Flow_Results
      SSI graph (i.e., data flow graph between SSI nodes), modification side effects (TODO: what is this exactly?), use side effects (TODO: what is this exactly?)
    • Local_Pointer_Results
      May-alias relation between two pointer targets.
    • Local_Numerical_Results
      Not yet implemented
    • Local_Relational_Values_Results
      Relations that hold at the beginning of a basic block. Relations are represented as Relational_Value which is NOT a Value.

For SSI nodes, analysis values can also be obtained by FSPTA::SSI_Node::get_local_values(), returning a reference to the instance of Analysis_Values associated with the node.

FSPTA value types

Figure 1.2 FSPTA value types (except the ones that do not inherit Analysis_Value).

1.1.2. Interprocedural Base Analysis

All interprocedural base analyses implement the interface FSPTA::Interprocedural_Analysis and store their results as Permanent_Results. This phase is based on the RCI technique and follows the four phases described in the paper.

  1. Callgraph_Overestimation (phase 0 in RCI)
    Preparation step for RCI: compute call graph overestimation to guide bottom-up traversal.
  2. Local_Analysis_And_Summary_Extraction (part of phase 1 in RCI)
    Determine control-flow, data-flow, pointer and numerical results by running std_local_analysis on an overapproximated call graph, i.e., by looking at every routine in isolation and generating summaries in terms of symbolic values from these results.
  3. Separate_Bottom_Up_phase (part of phase 1 in RCI)
    Propagate symbolic values found in previous analysis bottom up, i.e., compute transitive closure of summaries and global modifications (what part of the heap is untouched by a routine). Resolve symbolic values if possible.
  4. RCI_Top_Down_Phase (phase 2 and 3 in RCI)
    Propagate concrete values to routine entries and inside methods top-down to resolve remaining symbolic values with concrete values.

1.1.3. Intraprocedural Analysis

The various intraprocedural analyses can either run once (linear) or run until a fixed point is reached, i.e., until the current run did not produce any change.

fscs_analysis#std_local_analysis utilizes the following intraprocedural analyses. Their options and their context (linear or fixed point) depend on various settings. Figure 1.1 shows their order and their context.

  • Local_SSI_Analysis (DFA)
    Computes uses and definitions of variables. Creates SSI (default) or SSA graph.
  • Local_Exception_Analysis
    Computes which blocks throw or catch which exceptions.
  • Local_FSPTA
    Flow-sensitive pointer analysis. Computes points-to sets for all pointer targets as well as data and control flow information. Can perform alias analysis. Based on this PhD thesis.
  • Local_Interval_Analysis
    Is going to be merged with Range_Analysis – currently, does nothing.
  • Basic_Local_Control_Flow_Analysis
    Refines call graph by using pointer analysis results.
  • Range_Analysis
    Complete abstract interpreter (computes its own fixed point, uses simple interval domain together with symbolic expressions and disjunctive states).
  • Condition_Analysis
    Combination of multiple analyses, computes its own fixed point.
    • Reset_Alias_Annotations resets all alias annotations s.t. the next analysis can compute the correct ones.

    • Local_FSPTA with alias analysis.

    • Local_Node_Condition_Analysis infers conditions for the existence of nodes or abstract nodes, e.g, only if \(p=q\).

    • Remove_Impossible_Values clean-up step that removes impossible values (artefacts of previous analyses).

  • Symbolic_Access_Analysis
    Use abstract interpretation with symbolic expression domain to refine reachable states, distinguish vales that may contain 0 from those that do not, and relational “values” for direct nodes.

1.1.4. Analysing a routine locally in Function-By-Function mode

On the C++ side, the library libSemanticChecks contains implementations of checks for each supported semantic check (like null-deref, overflow, …) of modular semantic analysis. The checks can, for each analysed routine \(r\),

  • use local results for \(r\) coming from the basic analysis, obtained via FSPTA::Local_Results instance

  • additionally run their own custom analysis for \(r\).

Results are generated by creating Semantic_Checks::Issue instances (see check.h in libSemanticChecks)

Note: The corresponding library for error checks in FIPTA is libErrorChecks.

1.2. Data-Flow Representation

To represent the data flow, Modular Semantic Analysis supports two approaches: SSA and SSI. SSI creates more nodes but may help to achieve higher precision.

1.2.1. Nodes in the data-flow graph

We distinguish data-flow nodes and operation nodes. Data-flow nodes represent the typical action on a certain memory location at a certain position in the code, whereas operation nodes represent operations like casts, roughly similar to LIR instruction types.

Types of data-flow nodes:

  • basic nodes: definition, use, deref_use, postcall, precall, address

  • block nodes (SSA): entry, exit, uninit, phi_argument, phi_definition

  • block nodes (SSI): same as for SSA, plus sigma_argument, sigma_definition

A dereferencing use node is a special case of a use node and provides a link to the nodes created for the pointer targets. Precall nodes are used to represent data-flow going into a call; there are precall nodes for the arguments and a placeholder precall node for the use side effects (we only deal with data-flow side effects here, i.e. accessing non-local objects; other side effects are not represented by data-flow nodes). Postcall nodes represent the data-flow coming out of a call; there is a placeholder postcall node for the definition side effects. Taking the address of an object is also represented with a data-flow node, while taking a function’s address is represented with an operation node.

Phi and sigma nodes are split into their effects at the end of the preceding block(s) and the start of the next block(s). The arguments of a phi node represent use-like nodes at the end of branches, while the definition part of a phi node is a def-like node at the beginning of the join block. Likewise, a sigma node has a single argument at the end of the split block and a definition node at the beginning of each branch.

1.2.2. Edges in the data-flow graph

Edges connect nodes in the data-flow graph, available both in data-flow direction (e.g. from def to use of same object, or from RHS to LHS of an assignment, or from operands into the operation) and also against data-flow direction.

For an edge from the reaching definition to a data-flow node, we currently enforce that the source and target node of such an edge refer to the exact same object. Possible influences on the data-flow to the target node through other objects are represented as Alias_Definition points along the edge. These cover especially the case of pointer-indirect definitions in between the reaching definition and the node, but also the possible influence from definition side effects of a call in between. In contrast to FIPTA, the alias definitions do not involve abstract nodes (see below), but instances of them (or, in the case of weak updates for array elements, those weak update nodes). The alias definitions could also be used for other types of aliasing, e.g. union members.

We also store def-def edges, albeit without recording potential other definitions for them. In addition, definition-like nodes allow to access the nodes/edges for which they appear as a potential definition on the way from their reaching definition.

Argument and definition part of phi and sigma nodes are connected by special types of data-flow edges (sometimes called “internal” edges). The connection between RHS and LHS of an assignment is represented by a “copy edge”.

Note that these edges and alias definitions are not only useful for error checks, but they are also used within the algorithms for building up the data-flow graph and pointer information (e.g., when incrementally adding a new node and adjusting all the edges accordingly).

Currently, the algorithm also produces generic edges for the nodes known in the initial iteration. These are completely separate from the normal edges and resemble the FIPTA model. They link a use-like node with its reaching definition, listing the first possible intermediate definition along the way (others are found implicitly from there, see the notion of data-flow basic block below). There is no back-link from definitions to where they appear as such possible intermediate definitions, and these possible intermediate definitions are just indices into an array, no full Alias_Definition instances as for normal edges.

Note

Generic edges are unused at the moment.

1.2.3. Abstract nodes and instances thereof

In places where we cannot name the affected objects directly, we use abstract nodes and later on during the analysis create instances of these with the computed possible objects that might be affected there. Such places are pointer dereferences and placeholders for side effects, and block nodes generated by SSA/SSI based on them (array accesses are not directly causing abstract nodes - we might reconsider this in light of the similarities between arrays and pointers). We therefore have a special abstract object for which data-flow is generated as well, including block nodes like for example entry, phi or exit nodes. An abstract node provides links to its instances and vice versa. In contrast to FIPTA, we represent each instance to compute and store the data-flow and values of each instance separately.

1.2.4. Data-flow basic blocks

In SSA form, the use-like nodes for which a definition is the reaching definition form a tree, rooted at the definition and guided by dominance relationship. In SSI form, that becomes a linear list instead of a tree, also taking post-dominance into account; we sometimes call this a data-flow (basic) block.

This property can be beneficial for a compact representation as leveraged in FIPTA (but not yet in FSPTA). For example, it is sufficient to store potential (alias) definitions only once per data-flow block, as the alias definition then automatically applies to all use-like nodes following it - no need to duplicate it for all of them.

1.2.5. Some notes on aliasing

Data-Flow representation has to deal with aliasing inside the object model: Symbolic object p->next could be an alias for global. As aliasing is not transitive, we have to track possible alias-definitions separately for each object and cannot linearize the definitions across objects. That is:

  • Def-Use and Def-Def edges always connect nodes of the very same object (as if there was no aliasing inside the object model)

  • In addition, possible intermediate definitions through other, aliasing objects must be annotated at the edge (as for generic edges)

Also, nodes of different, possibly aliasing objects co-exist at the same location. For example, at *q, we might have an indirect definition for both p->next and global, even if both could be aliases, as they represent different abstractions. This also holds for phi/sigma nodes.

1.3. Bottom-up Analysis

1.3.1. General outline of bottom-up analysis

Once local analysis has produced local summaries for all functions, the bottom-up phase begins interprocedural analysis. It operates on the call graph and the function summaries without going into the details of functions again.

While local summaries may include both bottom-up unknowns (e.g. UMOD, UMV, URV) and top-down unknown (e.g. UIV), summaries after bottom-up analysis should only have top-down unknowns. It is the job of the bottom-up analysis to resolve the bottom-up unknowns (except for those from calls to external functions). Such unknowns may appear in objects (e.g. UMOD or the result of dereferencing some unknown value), in values (e.g. URV) or in alias conditions associated with values (e.g. if (URV == x)).

Bottom-up analysis for a given function f processes the summaries of all its callees (for which bottom-up analysis has already been done except in cycles). Such a callee summary is “lifted” to the call site in f and thereby translated into the vocabulary usable in f. That is, callee-relative unknowns are replaced by their counterparts from the call site in f. That includes UIV(param) and UIV(globalvar) cases. For parameters, bottom-up analysis replaces UIV(param) with the argument values which were collected in the top-down local summary during local analysis. So for example, if we process the call g(&x), this lifting step would translate UIV(param) into the value &x. For the unknown input value of a global variable there can be two cases: Either the call has directly collected the values for this particular variable at the call site, or we have to inspect the general values collected for the UUSE use side effect placeholder object. That object exists exactly for this purpose and bottom-up analysis identifies relevant values by looking at their associated aliasing conditions like if (UUSE == global).

After lifting the callee summaries, bottom-up analysis is ready to compute transitive modification side effects for f. For that, it starts with the local summary of f and replaces the URV/UMOD/UMV placeholders with the actual effects and values from the lifted callee summaries. Remember that these replacements have to be done for objects, values and aliasing conditions.

Finally, bottom-up analysis for function f concludes with some preparations for the top-down analysis in that it also resolves argument values and values for global variables as stored in the top-down summaries, removing the bottom-up unknowns in them as well.

1.3.2. Dependencies between lifted summaries

After lifting callee summaries into the caller, the lifted summaries may have interdependencies. For example, if we have g(h()) as two calls, and if g returns its parameter, we have the following situation:

  1. Local summary of g says URV(g) = UIV(param)

  2. Lifting that to the call site replaces UIV(param) with the argument value, which happens to be URV(h); so we have URV(g) = URV(h) and the lifted summary for the call to g now depends on the lifted summary for the call to h.

In general, such dependencies can be more complex and also include nested cases, like UIV(URV(h)->field) etc. Additionally, while we are lifting the summary for g, we might not yet have lifted the summary of h. Bottom-up analysis therefore postpones the resolution of these dependencies until after the summaries of all calls inside the function f have been lifted. It builds up a dependency graph that tracks the relations, and then processes this dependency graph in order to step-by-step replace dependencies to other summaries by their resolved versions.

There is one exception to this postponing: When we look at UUSE values to replace some UIV(global) with its values from the call site, and if this results in some value referring to the side effects of another call (e.g., UMV(global from h) if (global == UMOD(h))), we directly inspect the callee’s summary for whether there is such a side effect at all. This removes many items early, reducing runtime and memory consumption considerably.

1.4. Basic Analysis Facts: Accessing Analysis Results

1.4.1. Values produced for Basic Analysis Facts

Before analysing a single routine in local mode, an instance of Local_Results is created, which is passed into the analyses as function argument. Each execution of a local analysis stores its results or modifies existing result values in said instance of Local_Results.

Local analyses are instances of FSPTA::Local_Analysis, e.g. pointer analysis, control flow analysis, range analysis. The order of their execution(s) is currently specified in fscs_analysis.cpp, to allow one analysis to use results from its predecessors.

The same analysis can run multiple times during the overall analysis of a routine:

  1. Analyses can perform a fixed-point computation.

  2. An analysis can refine a result of a previously run different analysis by being executed a second time in a later stage of the analysis process.

Currently, within a fixed point computation, i.e., one or more analyses are repeated until a fixed point is reached, the Local_Results instance is not necessarily an overapproximation of the program behaviour within the fixed point computation loop. However, there are well-defined points within the analysis sequence where the analysis can assume that the results are in fact an overapproximation, most prominently at the end of the complete analysis run.

Analyses can produce or modify instances of Basic_Analysis_Values::Analysis_Value, the base type of analysis results (there are however other analyses that e.g. create SSI form or control flow information). These values have to be interpreted in the context of

  • an SSI node,

  • start of a basic block,

  • before or after an instruction,

  • a call or

  • an entire routine.

Subclasses of this value represent different kinds of analysis results.

1.4.1.1. Values having SSI node as context

In the context of SSI nodes, the class Analysis_Values groups together several analysis value kinds. Value kinds having an SSI node as context are:

  • Pointer_Value: for the case the context is of type pointer: represents a single abstraction for a memory object / routine etc.

  • Singleton_Value: Flags for information like “context is uninitialized”

  • Numerical_Value: Base class for information regarding the numerical value of the context, e.g. integer ranges with Interval_Value,

  • Relational_Value: represents knowledge about relations between variables holding before the actual context, e.g. a < b for a and b being variables.

Not each SSI node stores values of each of these categories: E.g., numerical values are stored for SSI nodes representing a memory location interpreted as numerical value; pointer values are computed for locations having pointer type.

Corresponding classes ending in _Values (e.g. Pointer_Values, Singleton_Values) represent sets of such values, allowing to describe the abstract state using disjunctions (e.g. sets of pointer values, sets of intervals).

Within an instance of Analysis_Values, an analysis can register its analysis kind by creating a field for its values and providing accessors. Currently, there are instances of Singleton_Values, Relational_Values, Pointer_Values and Interval_Values present. It is also possible to create additional comfort functions or facade functions for commonly required queries and tasks, like checking whether value 0 can occur.

1.4.1.2. Aliasing Conditions for Pointer Values

When doing whole-program analysis (or when explicitly activating them), we annotate pointer values with conditions that tell us more about the situations in which those values can appear. These conditions are useful for higher precision and especially to resolve uncertainty from potential side effects.

The conditions represent aliasing conditions: equality or inequality between two objects. For the special analysis objects that represent sets of (side effect) objects, the conditions correspond to set-inclusion (“if the object is (not) modified in this call”). At the moment, an annotated value like UMV(g) (if g == UMOD(call)) can be read as “if the value <unknown value from side effect of call on g> appears here, then g was in the set of modification side effects of the call”. As another example, &x (if x != *UIV(p)) means that if the address of x appears as value, then x was not the value of parameter p.

To avoid trouble with aliases only appearing later during analysis, our fixed point analysis for each function first computes the values without any conditions attached. After the fixed point analysis, a pass is done to add the conditions, and they are derived from the alias definitions along the edges along which the value is propagated.

For propagating a value from the definition to the use of an edge, the conditions relating to the alias definitions on the edge are combined with AND and tell us that all these alias relations must not hold in order for the definition’s value to arrive at the use node. The condition for a value coming in from one of these alias definitions starts with the requirement that the aliasing relation for that alias definition must hold, and then combines the non-alias-conditions for all subsequent alias definitions in the same way as the definition’s value did. If the alias definition is a phi node, care must be taken to create the correct condition, because we do not directly see the relationship between all the aliases at the same phi - there might be some from the same branch or from different branches, and it might also be that in some branch there is no alias definition at all. In general, the condition for a phi node would be an OR over individual conditions for the branches, and each branch condition can be complex in itself. We therefore overapproximate at the moment, by simply not attaching any non-alias-condition for bypassing a phi node.

1.4.2. Use cases

  1. During analysis phase: analysis execution step \(A_i\) gets results computed
    by \(A_1, \ldots, A_{i-1}\) to use them in its own analysis and adds as a result a value / values that are not yet present at all in the results, e.g. Local_FSPTA adds pointer targets.
  2. During analysis phase: analysis execution step \(A_i\) gets results computed
    by \(A_1, \ldots, A_{i-1}\) and refines the results of an already existing kind of values, and passes the modified results to \(A_{i+1}\) For example: Range_Analysis refines already existing range values computed by the preceding Loop_Counter_Relation analysis.