XCore Architecture
Designer | XMOS |
---|---|
Bits | 32-bit |
Introduced | 2007 |
Version | XS1, XS2, XS3 |
Design | RISC |
Type | Load-store |
Encoding | Variable |
Branching | Condition register |
Endianness | Little |
Registers | |
General purpose | 12 |
Floating point | 12 (shared, XS3) |
Vector | 3 (256-bit, XS3) |
The XCore Architecture is a 32-bit RISC microprocessor architecture designed by XMOS. The architecture is designed to be used in multi-core processors for embedded systems. Each XCore executes up to eight concurrent threads, each thread having its own register set, and the architecture directly supports inter-thread and inter-core communication and various forms of thread scheduling.
Two versions of the XCore architecture exist: the XS1 architecture [1] and the XS2 architecture.[2] Processors with the XS1 architecture include the XCore XS1-G4 and XCore XS1-L1. Processors with the XS2 architecture include xCORE-200.
The architecture encodes instructions compactly, using 16 bits for frequently used instructions (with up to three operands) and 32 bits for less frequently used instructions (with up to 6 operands). Almost all instructions execute in a single cycle, and the architecture is event-driven in order to decouple the timings that a program needs to make from the execution speed of the program. A program will normally perform its computations and then wait for an event (e.g. a message, time, or external I/O event) before continuing.
Versions and Devices
There are thre versions of the xCORE architecture: XS1, XS2, and XS3; XS2 extends the XS1 architecture, and XS3 extends the XS2 architecture.
XS1
The XS1 architecture was the first xCORE architecture, defined in 2007. It is implemented by the XCore XS1-G4, XCore XS1-L1, XCore XS1-SU, and XCore XS1-AnA. The former is a four-core processing node, the latter three are single and dual core processing nodes.
XS2
The XS2 architecture was defined in 2015. It is implemented by the xCORE-VOICE processors and xCORE-200 series processors. The latter are marketed as the XL2 series (general purpose), XU2 series (USB), XE2 series (RGMII), and versions with embedded flash.
XS2 extends the XS1 architecture with a limited form of Dual Issue execution.[2] The processor core comprises two lanes. The Resource lane can execute IO operations and general arithmetic. The Memory lane can execute memory operations, branches, and general arithmetic. Short resource or arithmetic instructions can be executed in the resource lane; short memory, branch, or arithmetic operations can be executed in the memory lane. Long instructions span both lanes. In dual issue mode all pairs of instructions are aligned on a 32-bit boundary.
A few instructions have been added to aid in high bandwidth processing, such as dual-word load/store, dual-word zip and unzip (bit and byte strings), dual word arithmetic saturation and shift.
XS3
The XS3 architecture was introduced in 2020. It is implemented by a new series of xcore.ai processors aimed at embedded and IoT devices utilizing AI acceleration in SoC-like designs.
XS3 extends the XS2 architecture with enhanced DSP performance, new 32-bit floating-point capability, and 256-bit-wide vector instructions. Processors based on this architecture also support a two-lane MIPI interface for cameras or other sensor input, as well as a 16-bit-wide LPDDR interface for external memory. Core clock speed has been increased to 800 MHz, and up to four xCONNECT links are available, providing scalability through the connection of additional xCORE processors.[3]
Of particular note, the new vector capability provides up to a theoretical maximum of 51 GOPS (billion operations per second, e.g. multiply-accumulate) of AI performance using 8-bit data, and 408 GOPS on binarized (1-bit) neural networks.[4][5]
Architecture
The architecture comprises a central execution unit that operates on a set of 25 registers, and surrounded by a number of resources that perform operations that interact with the environment. Each thread has its own set of hardware registers, enabling threads to execute concurrently. The instruction set comprises both a (more or less standard) sequential programming model, and instructions that implement multi-threading, multi-core and I/O operations.
Most instructions can access only the 12 general-purpose registers r0–r11. In general, they are completely interchangeable, except that some instructions use r11 implicitly. There are also 4 base registers usable by some instructions:
- r12 = cp = Constant pool pointer
- r13 = dp = Data pointer
- r14 = sp = Stack pointer
- r15 = lr = Link register
Registers 16 through 24 are only accessible to specialized instructions. Except for the first two (r16 = pc = program counter, r17 = sr = status register), they are dedicated to exception and interrupt handling.
The status register contains various mode bits, but the processor does not have the standard ALU result flags like carry, zero, negative or overflow. Add and subtract with carry instructions exist, but specify five operand registers: two inputs and input carry, and one output and output carry.
Instruction encoding
Most instructions are 16-bit while a few have 32-bit encoding. Instructions can use between zero and six operands. Most common arithmetic operations (such as ADD, SUB, MULT) are three-operand instructions based on a set of 12 general purpose registers.
1 5 |
1 4 |
1 3 |
1 2 |
1 1 |
1 0 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
Description |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
opcode | immediate | 10/20-bit immediate | ||||||||||||||
opcode | register | immediate | register & 6/16-bit immediate | |||||||||||||
opcode | 1 1 | opc | immediate | 6/16-bit immediate | ||||||||||||
opcode | 9×a+3×b+c | a a | b b | c c | three-operand register | |||||||||||
opcode | 27+3×b+c | ∗ | o | b b | c c | two-operand register | ||||||||||
opcode | 1 1 1 1 1 1 | o | c c c c | one-operand register | ||||||||||||
opcode | 1 1 1 1 1 1 | o | 1 1 | opc | zero-operand |
The last four forms share the same opcode range, because the number of operands is determined by bits 5 through 10. The last three forms use bit 4 as an additional opcode bit. (And the last form uses bits 1 and 0 as well.)
In the second form, some instructions (loads and stores) use all four bits to encode the register number, allowing access to r12–r15. Other instructions (conditional branches) do not allow register numbers above 11, instead allowing the third form to share the opcode range.
Because constants are always unsigned, many instructions come in add/subtract pairs, e.g. jump forward and backward.
The form of an instruction is determined by its four most-significant bits:
- 00__: register operands (8 opcodes)
- 0100: register operands (2 opcodes)
- 0101: register + 6-bit immediate (4 opcodes, 16 registers allowed)
- 0110: register + 6-bit immediate (4 opcodes, 16 registers allowed)
- 0111: register + 6-bit immediate (4 opcodes, 12 registers allowed)
- 10__: register operands (8 opcodes)
- 1100: register operands (2 opcodes)
- 1101: 10-bit immediate (4 opcodes)
- 1110: 10-bit immediate (4 opcodes)
- 1111: Prefix opcodes:
- 111100: 10 additional immediate bits, prepended to following instruction's 6 or 10 bits.
- 11111: three additional operands, in addition to the operands of the following register instruction
The encoding of the three-operand register opcodes is quite unusual, since 12 registers is not a power of 2. The encoding used fits zero to three operands, and the number of operands, into 11 bits. Thus, each 5-bit opcode can be assigned four times, once to a three-operand instruction, once to a two-operand, etc.
In all cases, the low 2 bits of the register number are placed in a 2-bit field, reducing the problem to encoding the high bits, which are in the range of 0 to 2.
The three-operand form places the low register numbers in the low 6 instruction bits. The high 2 bits of each register number are combined in base-3 into a number between 0 and 26 (using 9×a+3×b+c) and stored in the remaining 5 bits.
The two-operand form uses the unused 5 combinations (27–31) in the 5-bit field. Operand a is not used, and the 2-bit field for its low bits is reassigned; one bit is used for an additional opcode bit, and the other is used as an additional combination register specifier, doubling the number of available combinations to 10, and allowing all 9 combinations of 3×b+c to be represented. This is done in a manner similar to bi-quinary coded decimal: the combination, modulo 5, is stored in the 5-bit field (as (3×b+c) mod 5 + 27), and the 1-bit quotient (⌊(3×b+c)/5⌋) is stored in instruction bit 5 (marked with an asterisk in the table above).[6]
One-operand instructions use the tenth combination value, with all 6 bits set, and place the register number in the 4 available bits. Only operand c is specified, and the high bits are stored in the b field.
Finally, the one-operand encoding, with a register number 12 or more (the b field contains binary 11), is also used to encode zero-operand instructions. The two low-order bits of the c field are available for additional opcode bits (bringing the total to 8).
(A few instructions use the register c field value 0–11 as a small immediate constant, or use it to select one of 12 convenient bit-shift constants 0–8, 16, 24, or 32.)
Less frequently used instructions are encoded in 32 bits. 32-bit instructions allow 16- or 20-bit immediate operands (such as far branches), up to six register operands (for example long multiply which has four source and two destination operands) and additional opcode space for rarely used instructions.
One 10-bit immediate opcode (PFIX, opcode 111100) is used to add an additional 10 bits to the 6- or 10-bit immediate in the following instruction.
One three-operand opcode (EOPR, opcode 11111) is reserved for an "additional operands" prefix. Its 3 operands are used along with those of the following instruction word to produce additional 32-bit instructions with up to six operands. This is also used for rarely used three- and two-operand instructions; in such cases the EOPR specifies all three or two operands, and the following instruction word is a zero-operand instruction. (In the two-operand case, the extra opcode bit in the leading EOPR is used.)
Programming model
Sequential programming model
Each thread has access to 12 general purpose registers R0...R11. In addition there are 4 special purpose registers the SP, LR (Link register - stores the return address), CP (constant pool, points to a part of memory that stores constants) and DP (data pool - points to global variables). In addition to those 16 there are another 9 registers that store the PC, kernel PC, Exception type, Exception data, and saved copies of all those in case of an exception or interrupt.[7] The instruction set is a load-store instruction set.
Almost all instructions execute in a single cycle. If an instruction does not need data from memory (for example, arithmetic operations), the instruction will prefetch a word of instructions. Because most instructions are encoded in 16-bits, and because most instructions are not loads or stores (a typical number is 20% loads&stores, 80% other instructions[8]), the prefetch mechanism can stay ahead of the instructions stream. This acts like a very small instruction cache, but its behaviour can be predicted at compile time, making timing behaviour as predictable as functional behaviour.
Instructions that access memory all use a base register: SP, DP, CP, PC or any general purpose register. In a single 16-bit instruction a thread can access:
- Up to 64 words relative to the stack pointer (read or write, word access only)
- Up to 64 words relative to the data pointer (read or write, word access only)
- Up to 64 words relative to the constant pointer (read only, word access only)
- Up to 12 words relative to any general purpose register (read and write, word access only)
- An indexed word using any two general purpose registers
- An indexed 16-bit quantity using any two general purpose registers
- An indexed byte using any two general purpose registers
Larger sections of memory can be accessed by means of extended instructions, which extend the above ranges to 64 KBytes.
This scheme is designed in order to densely encode the common cases found in many programming patterns: access to small stack frames, a small set of globals and constants, structures, and arrays. Access to bit fields that have an odd length is facilitated by means of sign and zero extend instructions.
All common arithmetic instructions are provided - including a divide and remainder (which are the only instructions that are not single cycle). Comparison instructions compute a truth value (0 or 1) into a register, avoiding the use of flags. Many instructions have an immediate version that allows a single operand with a value of between 0 and 11 inclusive, encoding many common cases such as "i = i + 1". In the case of bit operations such as shift, the immediate value encodes common cases. Extra instructions are provided for reversing bits and bytes, count leading zeros, digital signal processing, and long integer arithmetic.
The branch instructions include conditional and unconditional relative branches. A branch using the address in a register is provided; a relative branch which adds a scaled register operand to the program counter is provided to support jump tables. Branches to up to instructions distance are encoded in a single word. The procedure calling instructions include relative calls, calls via the constant pool, indexed calls via a dedicated register and calls via a register. Most calls within a single program module can be encoded in a single instruction; inter-module calling requires at most two instructions. It is up to the callee to save the link-register if it is not a leaf-function, a single instruction extends the stack and saves the link register.
Dual issue mode, available on XS2, enables one short load, store, or branch instruction to be paired with one short resource instruction. Short arithmetic instructions can be paired with any instruction. This enables inner-loops that, for example, transfer data from memory to IO to be halved in length by issuing the LOAD instruction together with the ADD instruction, and the change to the counter together with the branch instruction.
Parallel programming model
The XS1 instruction set is designed to support both multi threading and multi-core computations. To this extent it supports channel communication (to support distributed memory computations) and barriers and locks (to support shared memory computations). A thread initiates execution on one or more newly allocated threads by setting their initial register values.
Communication between threads is performed using channels that provide full-duplex data transfer between channel-ends. This enables, amongst others, the implementation of CSP based languages, languages based on the Pi calculus. The instruction set is agnostic as to where a channel is connected to - whether that is inside a core or outside the core. Channels carry messages constructed from data and control tokens between the two channel ends. The control tokens can be used to encode communication protocols.
Channel ends have a buffer able to hold sufficient tokens to allow at least one word to be buffered. If an output instruction is executed when the channel is too full to take the data then the thread which executed the instruction is paused. It is restarted when there is enough room in the channel for the instruction to successfully complete. Likewise, when an input instruction is executed and there is not enough data available then the thread is paused and will be restarted when enough data becomes available.
A thread can, with a single instruction, synchronise with a group of threads using a barrier synchronisation. Alternatively a thread can synchronise using a lock, providing mutual exclusion. In order to communicate data when using barriers and locks, threads can either write data into the registers of another thread, or they can access memory of another thread (provided both threads execute on the same core). If shared memory is used, then the compiler or the programmer must ensure that there are no race conditions.
The XS2 architecture has a 'priority mode' that enables threads to run in high priority. Low priority threads are guaranteed progress, but high priority threads are guaranteed a thread cycle when they are ready to execute.
I/O and timing instructions
The XS1 architecture is event-driven. It has an instruction that can dispatch on external events in addition to traditional interrupts. If the program chooses to use events, then the underlying processor has to expect an event and wait in a specific place so that it can be handled synchronously. If desired, I/O can be handled asynchronously using interrupts. Events and interrupts can be used on any resource that the implementation supports.
Common resources that are supported are ports (for external input and output), timers (that allow timing to a reference clock), channels (that allow communication and synchronization between threads within a core, and threads on different cores), locks (which allow controlled access to shared memory), and synchronizers (which implement barrier synchronizations between threads).
References
- ↑ "XMOS XS1 Architecture" (PDF). XMOS. 2016-12-21. https://www.xmos.com/published/xmos-xs1-architecture?format=pdf.
- ↑ 2.0 2.1 "xCORE-200: The XMOS XS2 Architecture" (PDF). XMOS. 2016-12-21. https://www.xmos.com/published/xs2-isa-specification.
- ↑ "XMOS announces world’s lowest cost, most flexible AI processor". XMOS. https://www.xmos.com/xmos-announces-worlds-lowest-cost-most-flexible-ai-processor.
- ↑ "XMOS adapts Xcore into AIoT ‘crossover processor’". EE Times. https://www.eetimes.com/xmos-adapts-xcore-into-aiot-crossover-processor.
- ↑ "Insightful analysis of xcore.ai from The Linley Group". The Linley Group. https://www.xmos.com/insightful-analysis-of-xcore-ai.
- ↑ The architecture manual documents bit 5 as the "most significant bit", but fails to mention the non-binary base; some XS-1 disassembler source code makes it clear. In the definition of
parse-inssn-r2
, the1 #split 1b - swap 5 * +
portion splits the 6-bit register field into a 5-bit and a 1-bit part, subtracts 27 (hex 1b) from the high part, multiplies the low part by 5, and adds them. - ↑ David May. "XMOS XS1 Architecture" (PDF). XMOS. https://www.xmos.com/published/xmos-xs1-architecture. (Free registration required)
- ↑ Jurij Šilc; Borut Robič; Theo Ungerer (1999). Processor Architecture. Springer. ISBN 3-540-64798-8. https://archive.org/details/processorarchite0000silc.