2. Abstract Interpreter¶
The C++ library libAbstraIR allows implementing analyses based on the Abstract
interpretation framework.
2.1. Basics¶
For a user-defined abstract domain T, libAbstraIR allows specifying:
Domain operations like join or comparisons:
Domain_Operations<T>Semantic operations like instruction effects, effects of conditions or calls:
Semantic_Operations<T>.
Additionally, the following domain functors are currently available for combining domains \(D_1, D_2\):
Partially reduced domain product \(D_1 \times D_2\) in
base/reduced_product_domain.handbase/reduced_product_semantics.handdomain map \(V \rightarrow D_1\), to create a composed domain that consists of a map from a key set \(V\) (most of the time memory locations) to a base domain \(D_1\).
For example, a domain of maps from variables to intervals is implemented in the
directory intervals, with operations of the base domain in interval_domain.h
and semantic operations defined in interval_semantics.h.
2.2. Interpreters¶
Analyses can be carried out interprocedurally and intraprocedurally:
base/modular_interpreter.hcontains a classic abstract interpreter that can workeither intraprocedurally or interprocedurally. It stores values at each control flow block, or, more precisely, at each control flow location.
base/local_ssi_interpreter.hcontains an interpreter that is specifically designed for use with analyses in MSA where values are stored for each SSI node.
2.3. Recipe for creating an analysis¶
The following is a rough check list for adding a new analysis. More steps might be required.
Specify the type of your domain elements
D.Create subclasses or template instantiations of
Domain_Operations<D>andSemantic_Operations<D>in
exchange/analysis_provider.h, add an enumerator toDomain_Kind, andin
exchange/analysis_provider.cpp, add a string name todomain_shortcuts, andadd instances of your domain operations, semantic operations, and an instance of the interpreter you like to use, and extend the switch-case statement handling the different analyses.
Use
irviewer(with--iranalysis --abstract_domain my_domain_shortcut) to check the results.Create the actual analysis (see
analysesfolder) that uses results of your domain.
2.4. Architecture of libAbstraIR¶
2.4.1. Domain framework (domain_operations.h)¶
Elements of a particular abstract domain are instances of a freely
choosable class D. To obtain the lattice-theoretic operations and
comparisons between elements of D, a partial specialization
Bauhaus::Abstract_Interpreter::Domains::Domain_Operations<D> has to
be created. Here operations like
the constants
bot()andtop()for obtaining the bottom or top value ofD,methods
to_bot()andto_top()to mutate an element ofDinto the bottom or top value ofD,is_leq()for comparing two elements ofD,is_eq()checks two elements for equality,join_with()andwiden_with()for performing (destructive) join and widening, anda
print()operation forD-values
have to be implemented. The abstract interpreter creates an instance of
Domain_Operations<D> before the analysis, and then calls the method
Domain_Operations<D>::init().
2.4.2. Abstract semantics (semantic_operations.h)¶
Given an abstract domain D, the class
Bauhaus::Abstract_Interpreter::Domains::Semantic_Operations<D> (or
an instance of it) can be used to define an abstract semantics for
D:
Semantics for each instruction:
process_instruction(),restricting the range of values when a case of a switch statement is taken:
filter_range(),restricting the range of values after an if block, where one of the two outgoing branches has been taken:
filter_conditional(),for preparing an abstract value as input to a routine call:
process_precall(),for doing “clean-ups” at transferring information back to a caller:
filter_at_routine_exit(),for processing an abstract value at the end block of a routine back to a caller, combining the value with the stored value
pre_entrybefore the call:process_postcall().
2.5. Existing analyses¶
2.5.1. Interval Domain¶
Code for interval analysis can be found in folder intervals. A domain for interval analysis is
a map from a set of (designators of) memory
locations to single intervals.
Since the representation of memory locations and also the used interpreters differ for
FIPTA and MSA, there are different interval analyses. However, basic
domain and semantic operations for all three kinds are shared and contained within
interval_domain.h, interval_semantics.h, interval_instructions.h, and
interval_conditions.h:
MSA-sparse: Variables are SSI nodes (aka
const FSPTA::SSI_Node*), interpreter islocal_ssi_interpreter.h, currently intraprocedural:fspta_sparse_interval*: stores one value per SSI node, therefore the term “sparse” is used. It uses sigma nodes to evaluate branch conditions, and propagates along data flow relations, including alias edges. It can be slow if there are many aliases present, and is not very precise with respect to branch conditions.MSA-dense: Variables are objects (i.e.,
Basic_Analysis_Facts::Object_Ptr), interpreter ismodular_interpreter.h, currently intraprocedural:fspta_dense_interval*: tracks multiple values of variables for each basic block, therefore the term “dense” is used. There is a risk of tracking too many (useless) range information, which we try to tackle by restricting the maximal number of tracked elements per block by the optionperformance.general.abstract_value_analysis_options.interval_precision_thresholdof the ruleStaticSemanticAnalysis. It is often more precise than the sparse approach. Currently, both analyses are activated since they complement each other.FIPTA-dense: Variables are accesses (aka
Access<Objects::Object), interpreter ismodular_interpreter.h, intraprocedural:fipta_interval*
Todo
Meeting notes might contain some usable fragments.
MSA-dense should track less values that are easily derivable (e.g. constant values, see also FIPTA-dense version) and/or are not required (e.g. track only-once-written values using a separate analysis pass).
Alternative MSA-sparse vs. MSA-dense: both active or choose via config option?
2.5.2. Buffer Domain¶
2.5.2.1. Approach¶
Abstraction of buffer contents¶
For each memory location that represents a string buffer \(b\),
its complete length \(len(b)\), and
the location of the first null byte \(zero(b)\) set in the buffer
are tracked. These numbers are meant as number of elements (chars etc.).
For example, after char a[20] = "abc"; the following holds:
\(len(a)=20 \land zero(a)=3\). The simple idea is to abstract these values using
intervals.
Abstraction of pointers¶
Consider the following example.
char* p = &a;
p = p+1;
*p = '\0';
We have to track both the object a pointer points to and the offsets. For example,
p points to a+1 after executing the example, and \(zero(a)=1\).
2.5.2.2. Papers¶
Xavier Allamigeon, Wenceslas Godard, and Charles Hymans. Static Analysis of String Manipulations in Critical Embedded C Programs. (Download)
David Wagner, Jeffrey S. Foster, Eric A. Brewer, Alexander Aiken. A First Step Towards Automated Detection of Buffer Overrun Vulnerabilities. (Download)
Xavier Rival, Kwangkeun Yi. Introduction to Static Analysis. Chapter 8. (Textbook about abstract interpretation)
2.5.3. Symbolic Expression Domain¶
2.5.3.1. Summary¶
This domain is used for a cheap analysis of division-by-zero and out-of-bounds checks. In its initial form, it is basically intraprocedural. It tracks
a selection of predicates that unconditionally hold at a program point and
values of objects in form of certain arithmetic or boolean expressions.
The code for this domain is located in abstrair/src/symbolic.
2.5.3.2. Expressions (symbolic/expression.h)¶
The domain tracks symbolic expressions (Abstract_Domains::Expression). Object
is an abstraction of memory locations, ObjectSet is a set of Object elements.
Both types are given as template arguments in the implementation. Delta is a
non-negative offset.
Var ::= Object | *PtrExpr
Expr ::= BoolExpr | ArithExpr | PtrExpr
BoolExpr ::= (Var == 0) | (Var != 0) | (Var < Var) | (Var < = Var) | (Var == Var) | (Var != Var)
ArithExpr ::= (Var - Var) | (Var + Var) | (Var * Var) | (Var / Var) | (Var % Var) | (- Var)
PtrExpr ::= PTR(Object + Object) | PTR(Object + Delta)
2.5.3.3. Abstract elements (symbolic/symbolic_domain.h)¶
A domain element (Abstract_Domains::Expression_State) is a tuple
\((P, f, up)\) where
\(P\) is a set of boolean expressions (\(BoolExpr\)),
\(f : Var \nrightarrow Expr\) is a partial map assigning expressions to variables,
\(up : PtrExpr \nrightarrow ObjectSet\) is a partial map, assigning sets of objects to pointer expressions.
If \(x\), \(y\) and \(z\) are integer-valued program variables, \(a\) and \(b\) are array variables, and \(p\) and \(q\) are pointer variables, the following is an example of a domain element.
The concretization of \((P, f, up)\) is the set of all concrete program states \(s\) such that
\(s(p)\) holds for each \(p \in P\),
if \(f(v) = e\), then \(s(v) = s(e)\), and
if \(up(PTR(p+x)) = O\), then \(s((p+x)) \in O\).
We add special elements \(\top\) and \(\bot\). The concretization of \(\top\) are all program states, the concretization of \(\bot\) is the empty set.
For elements \(x_1 = (P_1, f_1, up_1)\) and \(x_2 = (P_2, f_2, up_2)\), \(x_1 <= x_2\) if
\(P_1 \subset P_2\),
\(\forall v \in Var:\) if \(f_{1}(v) = e\), then \(f_{2}(v) = e\) or \(v\) is not defined for \(f_2\), and
\(\forall p \in PtrExpr:\) if \(up_1(p) = o_1\), then \(up_1(p) \subseteq up_2(p)\) or \(p\) is not defined for \(up_2\).
This could be made more precise by introducing an order on the expressions, e.g. \((x < y) \sqsubseteq (x \leq y)\).
The join can be defined as \(x_1 \sqcup x_2 = (P, f, up)\) with
\(P = P_1 \cap P_2\) (or something more precise),
\(f(v) = e\) if \(f_1(v) = e\) and \(f_2(v) = e\), and
\(up(p) = up_1(p) \cup up_2(p)\) if \(up_1(p)\) and \(up_2(p)\) are defined.
2.5.3.4. Transitive closure¶
Right now, equalities and inequalities are not directly inferred from existing predicates or stored values. E.g., if the predicates \(x=y\) and \(y=z\) hold, the predicate \(x=z\) is not automatically computed. However, this would help to reduce precision loss introduced at join operations, since predicates are removed from the join result if they do not occur in both operands of the join. But the transitive closure is expensive to compute and might lead to an explosion of stored predicates. A compromise would be a partial transitive computation step, e.g. up to level 1. If we were computing the transitive closure, the domain contains a kind of pre-stage of difference-bound-matrices.
2.5.3.5. Abstract semantics for semantics (symbolic/symbolic_instruction_transformer.h)¶
We give a semantics for each LIR instruction instr for a given
abstract element \((P, f)\).
\(accessed\_objects : Operand \rightarrow Set of Object\) gives an overapproximation of all objects represented by the operand.
For each instruction, we do the following first. If Left is valid
Compute \(accessed\_objects(\mathtt{instr.Left}) = L\),
remove all predicates in \(P\) involving an object in \(L\) and
if \(f(v) = e\) and \(v\) or \(e\) mentions an element in \(L\), remove \(v\) from the domain of f.
This action alone is already a safe abstract semantics for every non-call instruction.
Copy_Assignment_Instruction¶
Let \(instr = \mathtt{Left:=Right}\).
Compute an abstract value \(val_r\) for the right-hand side.
Compute \(R = accessed\_objects(\mathtt{instr.Left})\)
If \(\mathtt{Left} = \mathtt{Indirect\_Write\_Operand}(h)\) and \(f(h) = PTR(o_1 + o_2, O)\), then set \(f(o_1[o_2]) = val_r\) …
Arithmetic_Instruction¶
Left_Pointer_Add_Instruction¶
Right_Pointer_Add_Instruction¶
2.5.3.6. Example test cases for div-by-zero analysis¶
mapnik_001.tst¶
extern int ndet();
#define ARR_SIZE 10
int main()
{
unsigned cols[ARR_SIZE];
int max_colors = ndet();
for (int i = 0; i < ARR_SIZE; ++i)
{
cols[i] = ndet();
}
cols[0] = cols[0]>0?1:0; // fully transparent color (one or not at all)
if (max_colors>=64)
{
unsigned minCols = (13)/(max_colors-cols[0]); // !!!
}
}
Required facts at division
\(max\_colors \geq 64\)
\(cols[0] \in [9, 63]\)
Consequence
(OK) The abstract domain tracks now values \(a[0], a[1], a[2]\) of each array. Maybe it makes sense to also track the last two values of an array. We require interval info (see second point). This is now very rudimentary done by choosing the (quite often occurring) interval \([0,1]\) in question and add a special predicate \(P(x) := x \in [0,1]\). For later, we should extend the interval analysis also to (constant) array values.
mapnik_002.tst¶
extern int ndet();
int main()
{
int x1 = ndet();
int x2 = ndet();
if(x1 == x2)
{
return 1;
}
return double(100) / (x2 - x1);
}
Required facts at division
\(x_2 - x_1 \neq 0\)
The implicit conversion of \((x_2 - x_1)\) conserves (1).
Consequence
(OK) For an implicit conversion \(s \rightarrow t\) from an integer type to a floating type we check whether \(s \neq 0\). If so, we can deduce that \(t \neq 0.0\) as well. We propagate this info then in the domain. However, no actual floating point arithmetic is considered (yet).
bash_001.tst¶
...
if (((op == DIV) || (op == MOD)) && (val2 == 0))
evalerror (_("division by 0")); // longjmp
if (op == MUL)
val1 *= val2;
else if (op == DIV)
val1 /= val2; // val2 != 0
else if (op == MOD)
val1 %= val2; // val2 != 0
...
Required facts at division
\(\mathtt{op} = \mathtt{DIV} \Rightarrow \mathtt{val2} = 0\)
Consequence
(FAIL) This is a complicated one, since we would need an implication (or disjunction) to track it.
cppunit_001.tst¶
...
while (*)
{
long long total_size = size * f();
if (total_size == 0)
{
continue;
}
long long q = total_size / size;
...
Required facts at division
\(\mathtt{total\_size} = \mathtt{size} \cdot ret_{\mathtt{f}} \wedge \mathtt{total\_size} \neq 0 \Rightarrow \mathtt{size} \neq 0\)
Consequence
This works, since we now also track multiplication / division.
2.5.4. Array Initialization Domain¶
2.5.4.1. Approach¶
The domain is used to detect whether array accesses like arr[i] access initialized memory.
This is a challenging problem, since writing accesses to arrays have to be tracked, and arrays
often are initialized using loops.
\(V\) refers to a memory object (e.g. Objects::Object or Basic_Analysis_Facts::Object_Ptr).
The domain used here is a partially reduced product of a domain mapping arrays to intervals (using the primitive interval domain, but not its semantics)
and a domain mapping pairs of array objects and counter variables to an automaton state:
The intervals in the first domain component track an (over)approximation of an index range for each array. E.g. full initializations using initializers in C++ can be tracked, and also possibly uninitialized assignments.
int arr[5] = {0}; // arr -> []
arr[4] = undefined; // arr ->[4,4]
The second domain is inspired
by the approach of Nikolic and Spoto (see paper reference), which uses a simple automaton to observe execution traces.
It detects, for
variables i used as index in array accesses, and arrays arr, the following
execution pattern:
...
i = 0;
...
(arr[i] = initval; ...; ++i; ...)*
...
!(i < len(arr)) => // arr initialized
If this analysis detects that an array is fully initialized, the information is transferred to the first domain component, setting
the uninit range of arr to the empty interval.
In addition, simple forms of (complete) initializations based on memset are detected, most prominently: memset(arr, init_val, sizeof(arr)).
2.5.4.2. Source files and Tests¶
The implementing source files have the path abstra_ir/src/must_uninit/array_init*.
The alphabet and transition specification of the trace automaton and helpers to detect the transitions in code can be found in
array_init_automaton.hThe init state domain for the alphabet of the trace automaton is a simple lattice, defined in
array_init_state_domain.{h,cpp}.array_init_semantics.{h,cpp}contains the semantics of the product domain.Stylecheck tests:
stylecheck/test/axivion/fault_detection/uninit_local_array_uninit.cUnit tests in
test-abstrair:test_array_init_state.cpp.
2.5.4.3. Configuration options and accessing results¶
Main client of the analysis is the
FaultDetection-UninitializedVariablecheck.You can see the results of the analysis using irviewer and the
--abstract_domain array_initoptions.
Improvements and open points¶
Approach has to be adapted to the new FSPTA framework.
Important problem: Detecting whether the right side of assignments is actually an initialized value. Currently, each assignment
arr[i] = rhsis assumed as initialized ifrhsis a direct access of a user variable (local / global etc.); ifrhsis not initialized, there should be a preceding warning.Support more syntactic cases of loop conditions; currently, only
counter < array_lenis properly handled.Support reversed iteration orders:
for (int i = array_len-1; i != 0; --i) { arr[i] = ... }; cases can be found in nethack.
2.5.4.4. Papers¶
2.5.5. Loop Counter Relations Domain¶
2.5.5.1. Approach¶
This domain is used to detect simple relations between loop counters c
and other variables r that are incremented within the same loop.
More precisely, it represents facts of the form
\(r \leq \alpha \cdot c + [0, m]\), for \(\alpha\) a natural number.
Such relations could also be obtained e.g. by using a polyhedral abstract domain. However, an analysis using the polyhedral domain might be costly and seems to be overkill for the simple task of detecting variables that are linearly related to a counter variable. The analysis described in this section essentially is not more costly than an analysis using the interval domain. It also uses the primitive interval widening with widening raster points taken from literals from the routine.
\(V\) refers to a set of loop counter and related variables. Currently,
these variables are always (routine-)local variables whose address is not taken,
therefore, LIR::Variable is used as underlying data type for them in the
implementation. The domain D is a map between pairs (c, r) of loop
counter variables and related variables to an abstract “configuration”:
The ingredients of C are
the state (within a loop) as element from
S: Nothing known (\(\top\)), after initialization ofc(\(\text{C_INIT}\)), after initialization ofr(\(\text{RC_INIT}\)), both initialized and ready to check increments (\(\text{INCR}\)), unreachable state (\(\bot\))the tracked range of increment of the related variable
rduring a loop iteration (occurring as \([0,m]\) in \(r \leq \alpha \cdot c + [0, m]\)); currently no negative values are consideredthe tracked range of the counter variable
cthe coefficient \(\alpha\) in \(r \leq \alpha \cdot c + [0, m]\)
For tracking overflows, C actually also tracks the type info of the related
variable (related_type), this is skipped in the section above.
Similar to the array init analysis, simple trace information is tracked,
using the first component of C. Additional information is used to keep the
invariant r <= alpha*c + [0, m] up to date.
E.g., join works as follows:
An interesting case of instruction is c += 1 for a state
\(r \leq \alpha \cdot c + [0, m]\); here \(\alpha\) is set to
\(\max\{\alpha, m\}\), and the increment range [0, m] is replaced by
[0,0] (also checks for overflows take place).
int test1()
{
int x = 0;
int y = 0;
while( x < 10)
{
if(ndet()) { y++; }
if(ndet()) { y++; }
if(ndet()) { y++; }
x++;
}
x;
y; // y <= 3*y resp. y in [0, 30] inferred
/* irviewer output here:
[ LIR:107 ~ counter LIR:106 |-> INCR_MODE{ RELATED <= COUNTER * 3 + [ 0, 0 ] (uint64), range of counter: [ 10, 10 ] (uint64) } ]
*/
return y;
}
The analysis is also integrated within the new FSPTA framework.
2.5.5.2. Source files and Tests¶
The implementing source files have the path
abstra_ir/src/loops/loop_counter_relations.*.The implementation of domains as well as semantics are located within one compilation unit for this analysis, since they heavily depend on each other.
Stylecheck tests:
errorchecks/test/array_bounds/array_bounds_loop_counter_relations.cUnit tests in
test-abstrair:test_array_init_state.cpp.
2.5.5.3. Configuration options and accessing results¶
All analyses relying on numerical results might profit from the additional precision
You can see the results of the analysis using irviewer and the
--abstract_domain loop_counter_relationsoptions.
Improvements and open points¶
Also allow global variables as considered variables
More flexible handling of code patterns, e.g. support reset of related value / counter? within loops (
for (...) { if (*) { related = 0; })Reversed iteration (
for (int i = C; i>=0; --i) {...})
2.6. See also¶
2.6.1. Papers¶
ACM 2014. Gange, Navaz, Schachte, Søndergaard, Stuckey. Interval Analysis and Machine Arithmetic: Why Signedness Ignorance Is Bliss. (Download)
Book 2017. Antoine Miné. Tutorial on Static Inference of Numeric Invariants by Abstract Interpretation. Chapter 1: Introduction. (“Domain Tutorial”) (Download)
Formal Methods Syst. Des. 2009. Patrick Cousot, Radhia Cousot, Jérôme Feret, Laurent Mauborgne, Antoine Miné, Xavier Rival. Why does Astrée scale up? (Download)
ATxWInG 2012. Antoine Miné. Abstract Domains for Bit-Level Machine Integer and Floating-point Operations (Download)
2.6.2. Links¶
Crab: A C++ Library for Building Program Static Analyses
2.6.3. A collection of static analysis tools using different techniques¶
Interproc: Analysis of interprocedural programs using numerical domains (intervals, polyhedra…)