Compilers!

Overview
As part of my internship at ARM, I worked on the LLVM Compiler toolchain.
A key part of automating the process of verification of ARMs Architecture Specification is the use of a form of mechanical translation of ARM’s Architecture to an executable binary. The AIG department in ARM has various internal tools that work on the Architecure specification language( ASL ) to achieve these goals.
The LLVM Compiler toolchain was chosen as a candidate for potentially replacing the existing tools. It's an open source tool and there is excellent documentation available online for beginners.
Motivation
ARM has various internal tools that read and transform ASL. These tools are not efficient and scalable relative to the increasing size of the ARM ARM ( Architecture Reference Manual ).
All of this leads to the need of a well-documented tool that may serve as a common interface for all existing tools, and may be used to support the existing use cases. The LLVM Toolchain consists of modular and reusable compiler technologies and may potentially meet these requirements.
Goal
The end goal of the project was to assess the feasibility of a llvm-based compiler for supporting the existing use-cases of the current tools.
Task 1- Lexer + Parser + AST design
a) Designed a Lexer using Flex tool -
Background
Flex is a Lexer generator tool and caters to tokenizing the input source program. It uses a Type-3 grammar implementation, or more concretely, regular expressions ( regex ) specification file as the input.
Procedure
This process involved creating regular expressions for specifying patterns for tokens like identifiers, integer, hexadecimal and floating point constants, keywords and operators in the input source ASL.
B) Designed a Parser using YACC Bison tool –
Background
YACC Bison is a bottom-up Parser generator tool and is used to express the structure of languages with a Context Free Grammar using Backus Naur Form (BNF) grammar. It is used for checking syntax of the input program.
Bison generates a bottom-up LR(1) Parser that uses shift-reduce parsing, as opposed to the more intuitive top-down Parser. The Parser uses a lookahead of 1 token to decide whether to shift or reduce. The term “shift-reduce” refers to the usage of a parser stack, containing the already parsed tokens , and when a new lookahead token is seen , if the topmost symbol in the parser stack and the token can be reduced partially or completely to a grammar rule , then it is shifted onto the stack otherwise the topmost symbol on the stack is reduced.
Procedure
The first stage is modifying the Lexer to return tokens when the yylex() function is called. This is known as “Pull” mode of the Parser , since it pulls the tokens from the Lexer. The next step is deciding what semantic value the recognized tokens must have. For example, when an integer is tokenized in the input, the Lexer just tokenizes it as a C string , but does not assign a type or a value to it. It is upto the programmer to decide what type may be assigned to the recognized token on the Lexer end. These tokens and their types need to be declared in a header file and included as a header in the Parser and Lexer specification files. The final stage involves writing grammar productions in Backus Naur Form.At this stage, it is important to decide what type of object must be parsed by the Parser. It is handled by a C union declaration in the Parser file and this is a similar problem to deciding the semantic value of tokens as explained above. The types in the Union declaration are determined by the AST constructs.The remaining procedure is explained below.
C) Designing an Abstract Syntax Tree ( AST )
Background
An Abstract Syntax tree is a form of syntactic representation of the input source and is dependent on the source language. It is known as a Syntax tree as it uniquely determines a structure for unambiguous grammars. It is known as “abstract” since in contrast to a Parse Tree it does not represent every detail of the syntax. For example, it may ignore parenthesis, or semi-colons, if the input language has any. The method of constructing the AST by attaching action rules, that build the AST nodes, to productions in the grammar is known as Syntax Directed Translation (SDT).
Procedure
This is a slightly more involved process than the earlier two. This stage involves declaring C++ classes for most productions in the BNF grammar, and once the production is parsed, Bison returns a pointer to the object that was parsed. The type of this object is determined by the union in the bison file , which in turn is determined by the C++ classes declared in the AST header file.
An example of an AST Node class will be FunctionProtoTypeAST that represents a Function prototype external linkage. It’s member data consists of the Function name as a string, parameter names as a vector of strings, parameter Types as a vector of llvm::Type class and the return types as a vector of llvm::Type class. It consists of various public getters and setters for its member data and also contains the virtual codegen() function that is used to generate the IR.
Task 2 - generating Valid ir
The second task was simple as it was already accounted for when the AST was designed, since the design of class structures is influenced by the LLVM IR that needs to be generated in this step. This involved understanding the usage of the LLVM API which is not explained here to avoid specific details.
An example of a consideration that you need to make during IR code generation would be the evaluation strategy for function calls.
An exmaple of a consideration that you don't need to make when using the LLVM API , would be the calling convention used during function/procedure calls. This is handled by the LLVM API , depending on the target platform.
Task 3 - Transformations
This task was the most crucial in deciding feasibility of using LLVM toolchain.
The task involved implementing a basic version of 2
transformations required to compile Arms Architecture
Specification Language ( ASL ). It contains very high level features and implementing one of the transforms using the LLVM API meant IR needed to be generated and then types were required to be mutated.
Due to the confidential nature of this task, details/code cannot be provided.
Task 4 - Extending Compiler optimization
This was an attempt to improve performance of the existing compiler at Arm. It involved a flow-insensitive global range analysis of integer variables and functions (with integer return types) in the ASL language, and then replacing their types based on some criteria.
What is Range Analysis anyway ?
Range Analysis refers to finding lower and upper bounds for values of integer variables. This information can be used for several useful optimizations like branch prediciton, array bound check elimination, changing datatypes , etc.
The task involved extending a compiler optimization for changing data types from custom extended precision integer type to C native int64_t datatype in order to speedup runtime. It involved changing types of integer variables and functions (with integer return types) in ASL language, and replacing it with a native C int64_t type if the bitwidth of variable/function return type was less then 64 in signed domain.
Due to the confidential nature of this task, details/code cannot be provided.
conclusion ...
I enjoyed the LLVM project as it was my first experience of using the LLVM API and also gained more exposure to Compiler Theory and optimizations. I also learnt details of integer range analysis and also about the ARM specification language and how it describes the Arm architecture.