Dynamic syntax tree
Dynamic Syntax Tree (DST), was created by M.Barzanti in 2001 as an alternative of Abstract syntax tree and Parse tree representation of the structure of source code written in a programming language. Unlike the Statically typed programming languages, the analysis of a Dynamic programming language has to estimate the Type system (types, variables, functions) from the code fragment semantics, as well as all Configuration files and Binary files. Dynamic Syntax Tree implementation also take into account non-static and non-typed implicit and explicit Declaration (computer programming)s, so that the resulting information covers most of readable dynamic code fragments.
Mapping of language constructs from a dynamic language into a Dynamic Syntax Tree (DST) is a kind of semantic analysis (compilers). Information implying from the syntax is analyzed and the results are inserted back into the same tree, but in the form of complete static information. Preprocessing the source code will create a specialized Syntax Tree for each Class found. That can be applied to traditional programming languages too, like COBOL, where Classes can be represented by Programs, Methods by calls/Performs, Parameter by Using etc.
Application in SAST tools
Dynamic syntax trees are mostly used in program analysis and program transformation systems. Dynamic syntax trees are data structures used in some Static program analysis tools, due to their property of fast representing an optimized structure of program code and its related Binary file.
Motivation
- Compared to the source code, a DST does not include certain elements, such as inessential punctuation and delimiters (braces, semicolons, parentheses, etc.).
- The whole source file, in a parsed form, must be processed and every expression or construct has to be analyzed. As a result, it extends some existing node information or add new declaration nodes (implicit declarations). In static languages it is typically sufficient to scan declaration constructs only (class declaration, variable declaration etc.); single expressions are not needed for collecting available symbols and their description.
- In addition to statically defined symbols other tree nodes are added using the semantic analysis of the DST. Also some existing information can be modified. Particular expressions are scanned to find, e.g. implicitly declared variables and dynamically added object members. Also possible variable values, estimated types and behavior influenced from configuration are discovered. The so far built DST is used for that. Process of mapping the Assignment Expression, containing Variable Usage (including the one declared in web pages or JavaScript) into the static declaration of used variable and Configuration files influencing the Behavior, is depicted on the Figure at right side.
- The mapping collects known information from the expressions and stores it into the static representation of the source code (DST). The process of single Dynamic Syntax Tree mapping is performed every time the corresponding source code changes, hence it is suitable to implement some optimizations. Also, there could be trees containing symbols from large static language libraries, e.g. from .NET assemblies, which manage typically thousands of symbols. Initializing all of them may be too slow.
Optimization
Differently from abstract syntax tree and CST parse tree, in case of huge Classes having more than 65535 objects, the DST Object Dictionary structure (68,083 nodes), will be paged into some small XML files, about 575KB sized each. The same will be done with the Dynamic Syntax Tree itself: only 4096 Classes at time will be processed, max 135KB each. There will be no case of RAM Random access memory usage over 700MB, that means it can be able to perform a Static Analysis using a low-end Windows XP notebook with 1GB of RAM and a single-core processor, and up to 5 simultaneously static analysis of different applications at time can be achieved using only a 4GB RAM, dual-core processor machine. So, performances will be scalable depending on hardware architecture, with the guarantee of performing at least a complete static analysis starting from a notebook and going up to multi-core machine. An important optimization respect than AST and CST will be the number of tree nodes. For example, GCC 2.52 from SPEC 95 Standard Performance Evaluation Corporation benchmark suite comprises 604,000 AST nodes vs only 127,067 nodes paged in 2 DST, and Word 1.1A (1990) comprises 607,700 AST nodes vs only 204,249 nodes, paged in 3 DST. GCC 2.52 is about 140,000 lines of code and Word 1.1A is about 322,000. Static Analysis processing times using DST were half an hour for GCC and 1 hour and half for Word 1.1A. Word 1.1A source was downloaded from Computer History Museum site. Some of additional information initialization can be postponed until it is requested, because not all of the source elements must be analyzed and mapped immediately.
See also
- Abstract semantic graph (ASG)
- Composite pattern
- Control-flow graph
- Document Object Model (DOM)
- Semantic resolution tree (SRT)
- Symbol table
- Term graph
References
- Barzanti, Marco. Dynamic Syntax Tree: Optimized Binary Sandboxing. https://www.academia.edu/27410383/Dynamic_Syntax_Tree_Optimized_Binary_Sandboxing.
- Barzanti, Marco. Dynamic Syntax Tree: Implementation Results Q1-2016. https://www.academia.edu/26717762/Dynamic_Syntax_Tree_Implementation_Results_Q1-2016.