2. Abstract Interpreter

The C++ library libAbstraIR allows implementing analyses based on the Abstract interpretation framework.

2.1. Basics

For a user-defined abstract domain T, libAbstraIR allows specifying:

  • Domain operations like join or comparisons: Domain_Operations<T>

  • Semantic operations like instruction effects, effects of conditions or calls: Semantic_Operations<T>.

Additionally, the following domain functors are currently available for combining domains \(D_1, D_2\):

  • Partially reduced domain product \(D_1 \times D_2\) in base/reduced_product_domain.h and base/reduced_product_semantics.h and

  • domain map \(V \rightarrow D_1\), to create a composed domain that consists of a map from a key set \(V\) (most of the time memory locations) to a base domain \(D_1\).

For example, a domain of maps from variables to intervals is implemented in the directory intervals, with operations of the base domain in interval_domain.h and semantic operations defined in interval_semantics.h.

2.2. Interpreters

Analyses can be carried out interprocedurally and intraprocedurally:

  • base/modular_interpreter.h contains a classic abstract interpreter that can work

    either intraprocedurally or interprocedurally. It stores values at each control flow block, or, more precisely, at each control flow location.

  • base/local_ssi_interpreter.h contains an interpreter that is specifically designed for use with analyses in MSA where values are stored for each SSI node.

2.3. Recipe for creating an analysis

The following is a rough check list for adding a new analysis. More steps might be required.

  • Specify the type of your domain elements D.

  • Create subclasses or template instantiations of Domain_Operations<D> and Semantic_Operations<D>

  • in exchange/analysis_provider.h, add an enumerator to Domain_Kind, and

  • in exchange/analysis_provider.cpp, add a string name to domain_shortcuts, and

  • add instances of your domain operations, semantic operations, and an instance of the interpreter you like to use, and extend the switch-case statement handling the different analyses.

  • Use irviewer (with --iranalysis --abstract_domain my_domain_shortcut) to check the results.

  • Create the actual analysis (see analyses folder) that uses results of your domain.

2.4. Architecture of libAbstraIR

2.4.1. Domain framework (domain_operations.h)

Elements of a particular abstract domain are instances of a freely choosable class D. To obtain the lattice-theoretic operations and comparisons between elements of D, a partial specialization Bauhaus::Abstract_Interpreter::Domains::Domain_Operations<D> has to be created. Here operations like

  1. the constants bot() and top() for obtaining the bottom or top value of D,

  2. methods to_bot() and to_top() to mutate an element of D into the bottom or top value of D,

  3. is_leq() for comparing two elements of D,

  4. is_eq() checks two elements for equality,

  5. join_with() and widen_with() for performing (destructive) join and widening, and

  6. a print() operation for D-values

have to be implemented. The abstract interpreter creates an instance of Domain_Operations<D> before the analysis, and then calls the method Domain_Operations<D>::init().

2.4.2. Abstract semantics (semantic_operations.h)

Given an abstract domain D, the class Bauhaus::Abstract_Interpreter::Domains::Semantic_Operations<D> (or an instance of it) can be used to define an abstract semantics for D:

  1. Semantics for each instruction: process_instruction(),

  2. restricting the range of values when a case of a switch statement is taken: filter_range(),

  3. restricting the range of values after an if block, where one of the two outgoing branches has been taken: filter_conditional(),

  4. for preparing an abstract value as input to a routine call: process_precall(),

  5. for doing “clean-ups” at transferring information back to a caller: filter_at_routine_exit(),

  6. for processing an abstract value at the end block of a routine back to a caller, combining the value with the stored value pre_entry before the call: process_postcall().

2.5. Existing analyses

2.5.1. Interval Domain

Code for interval analysis can be found in folder intervals. A domain for interval analysis is a map from a set of (designators of) memory locations to single intervals. Since the representation of memory locations and also the used interpreters differ for FIPTA and MSA, there are different interval analyses. However, basic domain and semantic operations for all three kinds are shared and contained within interval_domain.h, interval_semantics.h, interval_instructions.h, and interval_conditions.h:

  • MSA-sparse: Variables are SSI nodes (aka const FSPTA::SSI_Node*), interpreter is local_ssi_interpreter.h, currently intraprocedural: fspta_sparse_interval*: stores one value per SSI node, therefore the term “sparse” is used. It uses sigma nodes to evaluate branch conditions, and propagates along data flow relations, including alias edges. It can be slow if there are many aliases present, and is not very precise with respect to branch conditions.

  • MSA-dense: Variables are objects (i.e., Basic_Analysis_Facts::Object_Ptr), interpreter is modular_interpreter.h, currently intraprocedural: fspta_dense_interval*: tracks multiple values of variables for each basic block, therefore the term “dense” is used. There is a risk of tracking too many (useless) range information, which we try to tackle by restricting the maximal number of tracked elements per block by the option performance.general.abstract_value_analysis_options.interval_precision_threshold of the rule StaticSemanticAnalysis. It is often more precise than the sparse approach. Currently, both analyses are activated since they complement each other.

  • FIPTA-dense: Variables are accesses (aka Access<Objects::Object), interpreter is modular_interpreter.h, intraprocedural: fipta_interval*

Todo

  • Meeting notes might contain some usable fragments.

  • MSA-dense should track less values that are easily derivable (e.g. constant values, see also FIPTA-dense version) and/or are not required (e.g. track only-once-written values using a separate analysis pass).

  • Alternative MSA-sparse vs. MSA-dense: both active or choose via config option?

2.5.2. Buffer Domain

2.5.2.1. Approach

Abstraction of buffer contents

For each memory location that represents a string buffer \(b\),

  • its complete length \(len(b)\), and

  • the location of the first null byte \(zero(b)\) set in the buffer

are tracked. These numbers are meant as number of elements (chars etc.). For example, after char a[20] = "abc"; the following holds: \(len(a)=20 \land zero(a)=3\). The simple idea is to abstract these values using intervals.

Abstraction of pointers

Consider the following example.

char* p = &a;
p = p+1;
*p = '\0';

We have to track both the object a pointer points to and the offsets. For example, p points to a+1 after executing the example, and \(zero(a)=1\).

2.5.2.2. Papers

2.5.3. Symbolic Expression Domain

2.5.3.1. Summary

This domain is used for a cheap analysis of division-by-zero and out-of-bounds checks. In its initial form, it is basically intraprocedural. It tracks

  • a selection of predicates that unconditionally hold at a program point and

  • values of objects in form of certain arithmetic or boolean expressions.

The code for this domain is located in abstrair/src/symbolic.

2.5.3.2. Expressions (symbolic/expression.h)

The domain tracks symbolic expressions (Abstract_Domains::Expression). Object is an abstraction of memory locations, ObjectSet is a set of Object elements. Both types are given as template arguments in the implementation. Delta is a non-negative offset.

Var       ::=  Object | *PtrExpr
Expr      ::=  BoolExpr | ArithExpr | PtrExpr
BoolExpr  ::=  (Var == 0) | (Var != 0) | (Var < Var) | (Var < = Var) | (Var == Var) | (Var != Var)
ArithExpr ::=  (Var - Var) | (Var + Var) | (Var * Var) | (Var / Var) | (Var % Var) | (- Var)
PtrExpr   ::=  PTR(Object + Object) | PTR(Object + Delta)

2.5.3.3. Abstract elements (symbolic/symbolic_domain.h)

A domain element (Abstract_Domains::Expression_State) is a tuple \((P, f, up)\) where

  • \(P\) is a set of boolean expressions (\(BoolExpr\)),

  • \(f : Var \nrightarrow Expr\) is a partial map assigning expressions to variables,

  • \(up : PtrExpr \nrightarrow ObjectSet\) is a partial map, assigning sets of objects to pointer expressions.

If \(x\), \(y\) and \(z\) are integer-valued program variables, \(a\) and \(b\) are array variables, and \(p\) and \(q\) are pointer variables, the following is an example of a domain element.

\[\left( P=\{ (x>y), (*PTR(p+x) \neq 0)\}, f=[q \rightarrow PTR(p + x), y \rightarrow (x-z)], up=[PTR(p+x) \rightarrow \{a[.], b[.]\}] \right)\]

The concretization of \((P, f, up)\) is the set of all concrete program states \(s\) such that

  • \(s(p)\) holds for each \(p \in P\),

  • if \(f(v) = e\), then \(s(v) = s(e)\), and

  • if \(up(PTR(p+x)) = O\), then \(s((p+x)) \in O\).

We add special elements \(\top\) and \(\bot\). The concretization of \(\top\) are all program states, the concretization of \(\bot\) is the empty set.

For elements \(x_1 = (P_1, f_1, up_1)\) and \(x_2 = (P_2, f_2, up_2)\), \(x_1 <= x_2\) if

  • \(P_1 \subset P_2\),

  • \(\forall v \in Var:\) if \(f_{1}(v) = e\), then \(f_{2}(v) = e\) or \(v\) is not defined for \(f_2\), and

  • \(\forall p \in PtrExpr:\) if \(up_1(p) = o_1\), then \(up_1(p) \subseteq up_2(p)\) or \(p\) is not defined for \(up_2\).

This could be made more precise by introducing an order on the expressions, e.g. \((x < y) \sqsubseteq (x \leq y)\).

The join can be defined as \(x_1 \sqcup x_2 = (P, f, up)\) with

  • \(P = P_1 \cap P_2\) (or something more precise),

  • \(f(v) = e\) if \(f_1(v) = e\) and \(f_2(v) = e\), and

  • \(up(p) = up_1(p) \cup up_2(p)\) if \(up_1(p)\) and \(up_2(p)\) are defined.

2.5.3.4. Transitive closure

Right now, equalities and inequalities are not directly inferred from existing predicates or stored values. E.g., if the predicates \(x=y\) and \(y=z\) hold, the predicate \(x=z\) is not automatically computed. However, this would help to reduce precision loss introduced at join operations, since predicates are removed from the join result if they do not occur in both operands of the join. But the transitive closure is expensive to compute and might lead to an explosion of stored predicates. A compromise would be a partial transitive computation step, e.g. up to level 1. If we were computing the transitive closure, the domain contains a kind of pre-stage of difference-bound-matrices.

2.5.3.5. Abstract semantics for semantics (symbolic/symbolic_instruction_transformer.h)

We give a semantics for each LIR instruction instr for a given abstract element \((P, f)\).

\(accessed\_objects : Operand \rightarrow Set of Object\) gives an overapproximation of all objects represented by the operand.

For each instruction, we do the following first. If Left is valid

  1. Compute \(accessed\_objects(\mathtt{instr.Left}) = L\),

  2. remove all predicates in \(P\) involving an object in \(L\) and

  3. if \(f(v) = e\) and \(v\) or \(e\) mentions an element in \(L\), remove \(v\) from the domain of f.

This action alone is already a safe abstract semantics for every non-call instruction.

Copy_Assignment_Instruction

Let \(instr = \mathtt{Left:=Right}\).

  1. Compute an abstract value \(val_r\) for the right-hand side.

  2. Compute \(R = accessed\_objects(\mathtt{instr.Left})\)

  3. If \(\mathtt{Left} = \mathtt{Indirect\_Write\_Operand}(h)\) and \(f(h) = PTR(o_1 + o_2, O)\), then set \(f(o_1[o_2]) = val_r\)

Arithmetic_Instruction
Left_Pointer_Add_Instruction
Right_Pointer_Add_Instruction

2.5.3.6. Example test cases for div-by-zero analysis

mapnik_001.tst
extern int ndet();
#define ARR_SIZE 10
int main()
{
    unsigned cols[ARR_SIZE];
    int max_colors = ndet();
    for (int i = 0; i < ARR_SIZE; ++i)
    {
        cols[i] = ndet();
    }
    cols[0] = cols[0]>0?1:0; // fully transparent color (one or not at all)
    if (max_colors>=64)
    {
        unsigned minCols = (13)/(max_colors-cols[0]); // !!!
    }
}

Required facts at division

  1. \(max\_colors \geq 64\)

  2. \(cols[0] \in [9, 63]\)

Consequence

(OK) The abstract domain tracks now values \(a[0], a[1], a[2]\) of each array. Maybe it makes sense to also track the last two values of an array. We require interval info (see second point). This is now very rudimentary done by choosing the (quite often occurring) interval \([0,1]\) in question and add a special predicate \(P(x) := x \in [0,1]\). For later, we should extend the interval analysis also to (constant) array values.

mapnik_002.tst
extern int ndet();

int main()
{
    int x1 = ndet();
    int x2 = ndet();

    if(x1 == x2)
    {
        return 1;
    }

    return double(100) / (x2 - x1);
}

Required facts at division

  1. \(x_2 - x_1 \neq 0\)

  2. The implicit conversion of \((x_2 - x_1)\) conserves (1).

Consequence

(OK) For an implicit conversion \(s \rightarrow t\) from an integer type to a floating type we check whether \(s \neq 0\). If so, we can deduce that \(t \neq 0.0\) as well. We propagate this info then in the domain. However, no actual floating point arithmetic is considered (yet).

bash_001.tst
...
if (((op == DIV) || (op == MOD)) && (val2 == 0))
  evalerror (_("division by 0")); // longjmp

if (op == MUL)
  val1 *= val2;
else if (op == DIV)
  val1 /= val2;  // val2 != 0
else if (op == MOD)
  val1 %= val2; // val2 != 0
...

Required facts at division

  1. \(\mathtt{op} = \mathtt{DIV} \Rightarrow \mathtt{val2} = 0\)

Consequence

(FAIL) This is a complicated one, since we would need an implication (or disjunction) to track it.

cppunit_001.tst
...
while (*)
{
    long long total_size = size * f();
    if (total_size == 0)
    {
        continue;
    }
    long long q = total_size / size;
...

Required facts at division

  1. \(\mathtt{total\_size} = \mathtt{size} \cdot ret_{\mathtt{f}} \wedge \mathtt{total\_size} \neq 0 \Rightarrow \mathtt{size} \neq 0\)

Consequence

This works, since we now also track multiplication / division.

2.5.4. Array Initialization Domain

2.5.4.1. Approach

The domain is used to detect whether array accesses like arr[i] access initialized memory. This is a challenging problem, since writing accesses to arrays have to be tracked, and arrays often are initialized using loops.

\(V\) refers to a memory object (e.g. Objects::Object or Basic_Analysis_Facts::Object_Ptr). The domain used here is a partially reduced product of a domain mapping arrays to intervals (using the primitive interval domain, but not its semantics) and a domain mapping pairs of array objects and counter variables to an automaton state:

\[D = (V \rightarrow Intv) \times (V \times V \rightarrow \{ \top, \text{START}, \text{WRITTEN}, \text{ACCEPT}, \bot \})\]

The intervals in the first domain component track an (over)approximation of an index range for each array. E.g. full initializations using initializers in C++ can be tracked, and also possibly uninitialized assignments.

int arr[5] = {0};   // arr -> []
arr[4] = undefined; // arr ->[4,4]

The second domain is inspired by the approach of Nikolic and Spoto (see paper reference), which uses a simple automaton to observe execution traces. It detects, for variables i used as index in array accesses, and arrays arr, the following execution pattern:

...
i = 0;
...
(arr[i] = initval; ...; ++i; ...)*
...
!(i < len(arr)) => // arr initialized

If this analysis detects that an array is fully initialized, the information is transferred to the first domain component, setting the uninit range of arr to the empty interval.

In addition, simple forms of (complete) initializations based on memset are detected, most prominently: memset(arr, init_val, sizeof(arr)).

2.5.4.2. Source files and Tests

The implementing source files have the path abstra_ir/src/must_uninit/array_init*.

  • The alphabet and transition specification of the trace automaton and helpers to detect the transitions in code can be found in array_init_automaton.h

  • The init state domain for the alphabet of the trace automaton is a simple lattice, defined in array_init_state_domain.{h,cpp}.

  • array_init_semantics.{h,cpp} contains the semantics of the product domain.

  • Stylecheck tests: stylecheck/test/axivion/fault_detection/uninit_local_array_uninit.c

  • Unit tests in test-abstrair: test_array_init_state.cpp.

2.5.4.3. Configuration options and accessing results

  • Main client of the analysis is the FaultDetection-UninitializedVariable check.

  • You can see the results of the analysis using irviewer and the --abstract_domain array_init options.

Improvements and open points
  • Approach has to be adapted to the new FSPTA framework.

  • Important problem: Detecting whether the right side of assignments is actually an initialized value. Currently, each assignment arr[i] = rhs is assumed as initialized if rhs is a direct access of a user variable (local / global etc.); if rhs is not initialized, there should be a preceding warning.

  • Support more syntactic cases of loop conditions; currently, only counter < array_len is properly handled.

  • Support reversed iteration orders: for (int i = array_len-1; i != 0; --i) { arr[i] = ... }; cases can be found in nethack.

2.5.4.4. Papers

2.5.5. Loop Counter Relations Domain

2.5.5.1. Approach

This domain is used to detect simple relations between loop counters c and other variables r that are incremented within the same loop. More precisely, it represents facts of the form \(r \leq \alpha \cdot c + [0, m]\), for \(\alpha\) a natural number.

Such relations could also be obtained e.g. by using a polyhedral abstract domain. However, an analysis using the polyhedral domain might be costly and seems to be overkill for the simple task of detecting variables that are linearly related to a counter variable. The analysis described in this section essentially is not more costly than an analysis using the interval domain. It also uses the primitive interval widening with widening raster points taken from literals from the routine.

\(V\) refers to a set of loop counter and related variables. Currently, these variables are always (routine-)local variables whose address is not taken, therefore, LIR::Variable is used as underlying data type for them in the implementation. The domain D is a map between pairs (c, r) of loop counter variables and related variables to an abstract “configuration”:

\[\begin{split}\begin{gather*} S = \{ \top, \text{C_INIT}, \text{R_INIT}, \text{INCR}, \bot \} \\ C = S \times Intv \times Intv \times \mathbb{N} \\ D = (V \times V \rightarrow C) \\ \end{gather*}\end{split}\]

The ingredients of C are

  • the state (within a loop) as element from S: Nothing known (\(\top\)), after initialization of c (\(\text{C_INIT}\)), after initialization of r (\(\text{RC_INIT}\)), both initialized and ready to check increments (\(\text{INCR}\)), unreachable state (\(\bot\))

  • the tracked range of increment of the related variable r during a loop iteration (occurring as \([0,m]\) in \(r \leq \alpha \cdot c + [0, m]\)); currently no negative values are considered

  • the tracked range of the counter variable c

  • the coefficient \(\alpha\) in \(r \leq \alpha \cdot c + [0, m]\)

For tracking overflows, C actually also tracks the type info of the related variable (related_type), this is skipped in the section above.

Similar to the array init analysis, simple trace information is tracked, using the first component of C. Additional information is used to keep the invariant r <= alpha*c + [0, m] up to date. E.g., join works as follows:

\[\begin{split}(r \leq \alpha_1 \cdot c + [0, m_1]) \sqcup (r \leq \alpha_2 \cdot c + [0, m_2]) = \begin{cases} (r \leq \max\{\alpha_1, \alpha_2\} \cdot c + [0, \max\{m_1, m_2\}]) ,\\ \top, \text{if overflow might occur} \end{cases}\end{split}\]

An interesting case of instruction is c += 1 for a state \(r \leq \alpha \cdot c + [0, m]\); here \(\alpha\) is set to \(\max\{\alpha, m\}\), and the increment range [0, m] is replaced by [0,0] (also checks for overflows take place).

int test1()
{
    int x = 0;
    int y = 0;
    while( x < 10)
    {
        if(ndet()) { y++; }
        if(ndet()) { y++; }
        if(ndet()) { y++; }
        x++;
    }
    x;
    y; // y <= 3*y  resp. y in [0, 30] inferred
/* irviewer output here:
[ LIR:107 ~ counter LIR:106 |-> INCR_MODE{ RELATED <= COUNTER * 3 + [ 0, 0 ] (uint64), range of counter: [ 10, 10 ] (uint64) } ]
*/
    return y;
}

The analysis is also integrated within the new FSPTA framework.

2.5.5.2. Source files and Tests

  • The implementing source files have the path abstra_ir/src/loops/loop_counter_relations.*.

  • The implementation of domains as well as semantics are located within one compilation unit for this analysis, since they heavily depend on each other.

  • Stylecheck tests: errorchecks/test/array_bounds/array_bounds_loop_counter_relations.c

  • Unit tests in test-abstrair: test_array_init_state.cpp.

2.5.5.3. Configuration options and accessing results

  • All analyses relying on numerical results might profit from the additional precision

  • You can see the results of the analysis using irviewer and the --abstract_domain loop_counter_relations options.

Improvements and open points
  • Also allow global variables as considered variables

  • More flexible handling of code patterns, e.g. support reset of related value / counter? within loops (for (...) { if (*) { related = 0; })

  • Reversed iteration (for (int i = C; i>=0; --i) {...})

2.6. See also

2.6.1. Papers

2.6.3. A collection of static analysis tools using different techniques

  • Interproc: Analysis of interprocedural programs using numerical domains (intervals, polyhedra…)